2011-01-30 03:33:55 8 Comments
What class of languages do real modern regexes actually recognise?
Whenever there is an unbounded length capturing group with a back-reference (e.g. (.*)_\1
) a regex is now matching a non-regular language. But this, on its own, isn't enough to match something like S ::= '(' S ')' | ε
— the context-free language of matching pairs of parens.
Recursive regexes (which are new to me, but I am assured exist in Perl and PCRE) appear to recognize at least most CFLs.
Has anyone done or read any research in this area? What are the limitations of these "modern" regexes? Do they recognize strictly more or strictly less than CFGs, of LL or LR grammars? Or do there exist both languages that can be recognized by a regex but not a CFG and the opposite?
Links to relevant papers would be much appreciated.
Related Questions
Sponsored Content
41 Answered Questions
[SOLVED] A comprehensive regex for phone number validation
- 2008-09-23 20:13:42
- Nicholas Trandem
- 884909 View
- 901 Score
- 41 Answer
- Tags: regex validation phone-number
11 Answered Questions
[SOLVED] How to negate specific word in regex?
- 2009-08-06 17:20:45
- Bostone
- 644346 View
- 596 Score
- 11 Answer
- Tags: regex
7 Answered Questions
34 Answered Questions
3 Answered Questions
2 Answered Questions
[SOLVED] Converting PCRE recursive regex pattern to .NET balancing groups definition
- 2010-07-28 04:47:59
- kennytm
- 1661 View
- 21 Score
- 2 Answer
- Tags: .net regex pcre recursive-regex balancing-groups
3 Answered Questions
[SOLVED] How can we match a^n b^n with Java regex?
- 2010-09-04 22:49:41
- polygenelubricants
- 11234 View
- 94 Score
- 3 Answer
- Tags: java regex capturing-group lookaround nested-reference
3 Answered Questions
[SOLVED] What kind of formal languages can modern regex engines parse?
- 2012-07-03 07:52:24
- georg
- 1351 View
- 27 Score
- 3 Answer
- Tags: regex computer-science formal-languages
1 Answered Questions
[SOLVED] How does this Java regex detect palindromes?
- 2010-09-08 05:34:21
- polygenelubricants
- 5570 View
- 21 Score
- 1 Answer
- Tags: java regex palindrome lookaround nested-reference
1 comments
@tchrist 2011-01-30 15:07:51
Pattern Recursion
With recursive patterns, you have a form of recursive descent matching.
This is fine for a variety of problems, but once you want to actually do recursive descent parsing, you need to insert capture groups here and there, and it is awkward to recover the full parse structure in this way. Damian Conway’s Regexp::Grammars module for Perl transforms the simple pattern into an equivalent one that automatically does all that named capturing into a recursive data structure, making for far easier retrieval of the parsed structure. I have a sample comparing these two approaches at end of this posting.
Restrictions on Recursion
The question was what kinds of grammars that recursive patterns can match. Well, they’re certainly recursive descent type matchers. The only thing that comes to mind is that recursive patterns cannot handle left recursion. This puts a constraint on the sorts of grammars that you can apply them to. Sometimes you can reorder your productions to eliminate left recursion.
BTW, PCRE and Perl differ slightly on how you’re allowed to phrase the recursion. See the sections on “RECURSIVE PATTERNS” and “Recursion difference from Perl” in the pcrepattern manpage. eg: Perl can handle
^(.|(.)(?1)\2)$
where PCRE requires^((.)(?1)\2|.)$
instead.Recursion Demos
The need for recursive patterns arises surprisingly frequently. One well-visited example is when you need to match something that can nest, such as balanced parentheses, quotes, or even HTML/XML tags. Here’s the match for balenced parens:
I find that trickier to read because of its compact nature. This is easily curable with
/x
mode to make whitespace no longer significant:Then again, since we’re using parens for our recursion, a clearer example would be matching nested single quotes:
Another recursively defined thing you may wish to match would be a palindrome. This simple pattern works in Perl:
which you can test on most systems using something like this:
Note that PCRE’s implementation of recursion requires the more elaborate
This is because of restrictions on how PCRE recursion works.
Proper Parsing
To me, the examples above are mostly toy matches, not all that interesting, really. When it becomes interesting is when you have a real grammar you’re trying to parse. For example, RFC 5322 defines a mail address rather elaborately. Here’s a “grammatical” pattern to match it:
As you see, that’s very BNF-like. The problem is it is just a match, not a capture. And you really don’t want to just surround the whole thing with capturing parens because that doesn’t tell you which production matched which part. Using the previously mentioned Regexp::Grammars module, we can.
As you see, by using a very slightly different notation in the pattern, you now get something which stores the entire parse tree away for you in the
%/
variable, with everything neatly labelled. The result of the transformation is still a pattern, as you can see by the=~
operator. It’s just a bit magical.@hobbs 2011-01-30 23:30:46
The limitation on left-recursion is definitely worth knowing about, but if I remember properly, it doesn't have an effect on "recognizing power" strictly, since for any left-recursive grammar, there's a right-recursive grammar that matches the same language -- it just might be a lot more cumbersome.
@tchrist 2011-01-31 12:38:21
@tobyodavies: I could have explained the PCRE restrictions further; they have to do with atomicity of groups: you cannot invoke recursion on a group that hasn't been completed yet in PCRE but you can in Perl. The grammatical RFC 5322 pattern should work equally well in PCRE; the whole
((DEFINE)…)
idea is extremely powerful and useful, allowing for separation of declaration (and its ordering) from execution, just like all top-down programming. I can't recall which other languages have group recursion; it may be something exotic like C♯ or its ilk.