By Richard Dorman


2008-09-25 14:17:40 8 Comments

Is it possible to write a regular expression that matches a nested pattern that occurs an unknown number of times? For example, can a regular expression match an opening and closing brace when there are an unknown number of open/close braces nested within the outer braces?

For example:

public MyMethod()
{
  if (test)
  {
    // More { }
  }

  // More { }
} // End

Should match:

{
  if (test)
  {
    // More { }
  }

  // More { }
}

11 comments

@awwsmm 2018-03-28 11:20:37

YES

...assuming that there is some maximum number of nestings you'd be happy to stop at.

Let me explain.


@torsten-marek is right that a regular expression cannot check for nested patterns like this, BUT it is possible to define a nested regex pattern which will allow you to capture nested structures like this up to some maximum depth. I created one to capture EBNF-style comments (try it out here), like:

(* This is a comment (* this is nested inside (* another level! *) hey *) yo *)

The regex (for single-depth comments) is the following:

m{1} = \(+\*+(?:[^*(]|(?:\*+[^)*])|(?:\(+[^*(]))*\*+\)+

This could easily be adapted for your purposes by replacing the \(+\*+ and \*+\)+ with { and } and replacing everything in between with a simple [^{}]:

p{1} = \{(?:[^{}])*\}

(Here's the link to try that out.)

To nest, just allow this pattern within the block itself:

p{2} = \{(?:(?:p{1})|(?:[^{}]))*\}
  ...or...
p{2} = \{(?:(?:\{(?:[^{}])*\})|(?:[^{}]))*\}

To find triple-nested blocks, use:

p{3} = \{(?:(?:p{2})|(?:[^{}]))*\}
  ...or...
p{3} = \{(?:(?:\{(?:(?:\{(?:[^{}])*\})|(?:[^{}]))*\})|(?:[^{}]))*\}

A clear pattern has emerged. To find comments nested to a depth of N, simply use the regex:

p{N} = \{(?:(?:p{N-1})|(?:[^{}]))*\}

  where N > 1 and
  p{1} = \{(?:[^{}])*\}

A script could be written to recursively generate these regexes, but that's beyond the scope of what I need this for. (This is left as an exercise for the reader. 😉)

@Craig H 2008-09-25 14:19:02

No, you are getting into the realm of Context Free Grammars at that point.

@Pavlush 2008-12-05 06:35:28

Yes, if it is .NET RegEx-engine. .Net engine supports finite state machine supplied with an external stack. see details

@Ben S 2010-03-15 00:18:46

As others have mentioned, .NET is not the only capable regex engine to do this.

@Zsolt Botykai 2008-09-25 14:40:25

Probably working Perl solution, if the string is on one line:

my $NesteD ;
$NesteD = qr/ \{( [^{}] | (??{ $NesteD }) )* \} /x ;

if ( $Stringy =~ m/\b( \w+$NesteD )/x ) {
    print "Found: $1\n" ;
  }

HTH

EDIT: check:

And one more thing by Torsten Marek (who had pointed out correctly, that it's not a regex anymore):

@Michael Carman 2008-09-25 15:09:02

Yup. Perl's "regular expressions" aren't (and haven't been for a very long time). It should be noted that recursive regexes are a new feature in Perl 5.10 and that even though you can do this you probably shouldn't in most of the cases that commonly come up (e.g. parsing HTML).

@Brad Gilbert 2008-10-16 16:30:24

@Pete B 2012-09-17 08:43:24

Using the recursive matching in the PHP regex engine is massively faster than procedural matching of brackets. especially with longer strings.

http://php.net/manual/en/regexp.reference.recursive.php

e.g.

$patt = '!\( (?: (?: (?>[^()]+) | (?R) )* ) \)!x';

preg_match_all( $patt, $str, $m );

vs.

matchBrackets( $str );

function matchBrackets ( $str, $offset = 0 ) {

    $matches = array();

    list( $opener, $closer ) = array( '(', ')' );

    // Return early if there's no match
    if ( false === ( $first_offset = strpos( $str, $opener, $offset ) ) ) {
        return $matches;
    }

    // Step through the string one character at a time storing offsets
    $paren_score = -1;
    $inside_paren = false;
    $match_start = 0;
    $offsets = array();

    for ( $index = $first_offset; $index < strlen( $str ); $index++ ) {
        $char = $str[ $index ];

        if ( $opener === $char ) {
            if ( ! $inside_paren ) {
                $paren_score = 1;
                $match_start = $index;
            }
            else {
                $paren_score++;
            }
            $inside_paren = true;
        }
        elseif ( $closer === $char ) {
            $paren_score--;
        }

        if ( 0 === $paren_score ) {
            $inside_paren = false;
            $paren_score = -1;
            $offsets[] = array( $match_start, $index + 1 );
        }
    }

    while ( $offset = array_shift( $offsets ) ) {

        list( $start, $finish ) = $offset;

        $match = substr( $str, $start, $finish - $start );
        $matches[] = $match;
    }

    return $matches;
}

@MichaelRushton 2010-10-03 18:49:51

Using regular expressions to check for nested patterns is very easy.

'/(\((?>[^()]+|(?1))*\))/'

@ridgerunner 2011-03-12 06:35:37

I agree. However,one problem with the (?>...) atomic group syntax (under PHP 5.2) is that the ?> portion is interpreted as: "end-of-script"! Here is how I would write it: /\((?:[^()]++|(?R))*+\)/. This is a bit more efficient for both matching and non-matching. In its minimal form, /\(([^()]|(?R))*\)/, it is truly a beautiful thing!

@MichaelRushton 2011-03-19 12:27:41

Double +? I used (?1) to allow for comments to be within other text (I ripped it and simplified it from my email address regular expression). And (?> was used because I believe it makes it fail faster (if required). Is that not correct?

@Dwayne 2015-01-15 18:01:41

Can you add an explanation for each part of the regex?

@Cœur 2015-10-13 07:38:45

For string '(a (b c)) (d e)', using simple expression '/\([^()]*\)/' gives me the same result. Are there benefits to your long expression?

@MichaelRushton 2015-10-13 09:16:18

Try using /^(\((?>[^()]+|(?1))*\))+$/ and /^\([^()]*\)+$/ to match (a (b c))(d e). The former matches but the latter doesn't.

@elquimista 2016-02-01 15:18:56

@MichaelRushton your solution worked fine for me. But I'm just wondering what's the difference between ?> and ?: ? Tried with both of them and they all seem to work.

@MichaelRushton 2016-02-01 18:58:31

It makes it an atomic group, and is used to prevent catastrophic backtracking.

@Sean Huber 2010-04-01 20:39:10

This seems to work: /(\{(?:\{.*\}|[^\{])*\})/m

@Stijn Sanders 2014-01-02 06:52:41

It also seems to match {{} which it shouldn't

@sirnotappearingonthissite 2008-09-25 15:25:10

as zsolt mentioned, some regex engines support recursion -- of course, these are typically the ones that use a backtracking algorithm so it won't be particularly efficient. example: /(?>[^{}]*){(?>[^{}]*)(?R)*(?>[^{}]*)}/sm

@Remo.D 2008-09-25 15:09:20

Proper Regular expressions would not be able to do it as you would leave the realm of Regular Languages to land in the Context Free Languages territories.

Nevertheless the "regular expression" packages that many languages offer are strictly more powerful.

For example, Lua regular expressions have the "%b()" recognizer that will match balanced parenthesis. In your case you would use "%b{}"

Another sophisticated tool similar to sed is gema, where you will match balanced curly braces very easily with {#}.

So, depending on the tools you have at your disposal your "regular expression" (in a broader sense) may be able to match nested parenthesis.

@Rafał Dowgird 2008-09-25 14:47:07

The Pumping lemma for regular languages is the reason why you can't do that.

The generated automaton will have a finite number of states, say k, so a string of k+1 opening braces is bound to have a state repeated somewhere (as the automaton processes the characters). The part of the string between the same state can be duplicated infinitely many times and the automaton will not know the difference.

In particular, if it accepts k+1 opening braces followed by k+1 closing braces (which it should) it will also accept the pumped number of opening braces followed by unchanged k+1 closing brases (which it shouldn't).

@Torsten Marek 2008-09-25 14:27:12

No. It's that easy. A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.

You can match nested/paired elements up to a fixed depth, where the depth is only limited by your memory, because the automaton gets very large. In practice, however, you should use a push-down automaton, i.e a parser for a context-free grammar, for instance LL (top-down) or LR (bottom-up). You have to take the worse runtime behavior into account: O(n^3) vs. O(n), with n = length(input).

There are many parser generators avialable, for instance ANTLR for Java. Finding an existing grammar for Java (or C) is also not difficult.
For more background: Automata Theory at Wikipedia

@daremon 2008-09-25 15:26:12

Torsten is correct as far as theory is concerned. In practice many implementations have some trick in order to allow you to perform recursive "regular expressions". E.g. see the chapter "Recursive patterns" in php.net/manual/en/regexp.reference.php

@Torsten Marek 2008-09-25 15:31:08

I am spoiled by my upbringing in Natural Language Processing and the automata theory it included.

@Ben Doom 2008-09-25 16:35:52

A refreshingly clear answer. Best "why not" I've ever seen.

@Novikov 2010-10-04 16:54:57

Regular expressions in language theory and regular expressions in practice are different beasts... since regular expressions can't have niceties such as back references, forward references etc.

@Rafael Eyng 2015-09-21 00:33:26

A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton. - best answer on this topic I've seen so far

@Andy Baker 2016-08-13 13:25:07

@TorstenMarek - can you confirm this is still true? Other sources state that if a regex engine supports features such as back-references it becomes a class 2 grammar (context-free) rather than a class 3 (regular grammar). Therefore PCRE for example - is capable of handling nested structures. The confusion comes from the fact that 'regex' in the real world are no longer regular in the technical sense. If this is correct it would be great to update this answer.

@Grant Eagon 2017-09-06 20:13:16

There is a way to accomplish this, but it will not be purely regex. You need to match every instance of braces/brackets/parens (global), then use some programming language to recursively replace/mark the nested matches within the parent.

@Aaron Cicali 2018-04-19 19:34:37

This answer is way above my head. And then I found a working regex: drregex.com/2017/11/match-nested-brackets-with-regex-new.htm‌​l

@Aaron Cicali 2018-04-19 20:07:26

Correction... that regex works in most cases :(

Related Questions

Sponsored Content

15 Answered Questions

[SOLVED] How do you access the matched groups in a JavaScript regular expression?

  • 2009-01-11 07:21:20
  • nickf
  • 630697 View
  • 1134 Score
  • 15 Answer
  • Tags:   javascript regex

5 Answered Questions

[SOLVED] Regular expression for exact match of a string

  • 2011-04-22 06:24:41
  • Chirayu
  • 397098 View
  • 100 Score
  • 5 Answer
  • Tags:   regex

70 Answered Questions

27 Answered Questions

6 Answered Questions

[SOLVED] Regular expression to stop at first match

  • 2010-03-23 20:36:35
  • publicRavi
  • 441681 View
  • 416 Score
  • 6 Answer
  • Tags:   regex

35 Answered Questions

[SOLVED] RegEx match open tags except XHTML self-contained tags

  • 2009-11-13 22:38:26
  • Jeff
  • 2481754 View
  • 1324 Score
  • 35 Answer
  • Tags:   html regex xhtml

18 Answered Questions

[SOLVED] How do you use a variable in a regular expression?

  • 2009-01-30 00:11:05
  • JC Grubbs
  • 580354 View
  • 1083 Score
  • 18 Answer
  • Tags:   javascript regex

7 Answered Questions

[SOLVED] Is there a regular expression to detect a valid regular expression?

  • 2008-10-05 17:07:35
  • psytek
  • 99252 View
  • 650 Score
  • 7 Answer
  • Tags:   regex

12 Answered Questions

[SOLVED] Regular Expressions: Is there an AND operator?

  • 2009-01-22 16:49:14
  • Hugoware
  • 615064 View
  • 585 Score
  • 12 Answer
  • Tags:   regex lookahead

9 Answered Questions

[SOLVED] Converting user input string to regular expression

  • 2009-05-17 14:20:18
  • Gordon Gustafson
  • 228220 View
  • 291 Score
  • 9 Answer
  • Tags:   javascript html regex

Sponsored Content