By KTrover


2014-09-08 16:13:18 8 Comments

I understand the ? mark here means "lazy".

My question essentially is [0-9]{2}? vs [0-9]{2}

Are they same?
If so, why are we writing the former expression? Aren't lazy mode more expensive performance wise?
If not, can you tell the difference?

2 comments

@Unihedron 2014-09-08 18:43:52

What's "lazy" (reluctant) matching?

When matching with regex, the pointer is greedy by default:

Left | Right
\d+    12345
^      ^
\d+    12345
  ^    ^^^^^ Matched!

Lazy is the opposite of greedy:

Left | Right
\d+?   12345
^      ^
\d+?   12345
  ^^    ^
       12345
         ^
       12345
          ^
       12345
           ^ Matched!

Why does it matter?

In matches, quantifiers * + ? are greedy by default. This can lead to unwanted behaviour, especially when we would like certain characters only to match when necessary for match to complete, and omit otherwise.

A typical example is when we want to match a single XML tag: We will fail this with <.*>.

Left | Right
<.*>   <p>hi</p><br /><p>bye</p>
^      ^
<.*>   <p>hi</p><br /><p>bye</p>
 ^^     ^^^^^^^^^^^^^^^^^^^^^^^^
<.*>   <p>hi</p><br /><p>bye</p>
   ^                           < [backtrack!]
<.*>   <p>hi</p><br /><p>bye</p>
   ^                           ^ Matched "<p>hi</p><br /><p>bye</p>"!
Left* | Right
<.*?>   <p>hi</p><br /><p>bye</p>
^       ^
<.*?>   <p>hi</p><br /><p>bye</p>
 ^^^     ^ [can we stop? we're lazy [yes]]
<.*?>   <p>hi</p><br /><p>bye</p>
    ^     ^ Matched "<p>"!

What can I quantify as lazy?

You can add the ? construct behind quantifiers and ranges:

+ (one or more), * (zero or more), ? (optional);
{n,m} (between n and m where n < m), {n,} (n or more), {n} (exactly n times).
(n and m in the examples are real numbers and satisfies n, m ϵ N)

  1.     Reluctant quantifiers are unwilling to push on.
    The match is allowed to match as much as possible or as little as possible, under the consideration that the engine only attempts to match when absolutely necessary for the rest of the rest to succeed. See the following cases:

    Left | Right
    abc*   abccccd
    ^      ^
    abc*   abccccd
     ^      ^
    abc*   abccccd
      ^      ^
    abc*   abccccd
      ^^     ^^^^ Matched "abcccc"!
    
    Left* | Right
    abc*?   abccccd
    ^       ^
    abc*?   abccccd
     ^       ^
    abc*?   abccccd
      ^^^     ^ [must we do this? we're lazy [no]]
               Matched "ab"!
    

    As demonstrated, they match as few as possible.

  2.     Reluctant quantifiers give up to entertain other quantifiers.
    (Demonstration purposes; If anyone asked, I did not tell you it is OK to use RegExp like this.)

    Left | Right
    c+c+   abccccd
    ^        ^
    c+c+   abccccd
    ^^       ^^^^
    c+c+   abccccd
      ^         < [backtrack]
    c+c+   abccccd
      ^^        ^ Matched "cccc"!
                  (c+ -> @ccc; c+ -> @c)
    
    Left* | Right
    c+?c+   abccccd
    ^         ^
    c+?c+   abccccd
    ^^^       ^ [pass]
    c+?c+   abccccd
       ^^      ^^^ Matched "cccc"!
                   (c+? -> @c; c+ -> @c)
    
  3.     Exact range quantifiers are not affected.
    Between X{n} and X{n}?, there are virtually no differences; And most engines internally optimizes away the reluctant flag. This is because the lazy construct only applies when the match is dynamic, of which the engine can behave one way or the other for the quantifier (need or greed), but is not applicable for this case.

Check out regex101, a well-done regular expression engine which comes with explanation and a debugger log to show you the pointer steps. Also read The Stack Overflow Regex Reference!

@Sam 2014-09-08 18:46:15

Good work actually writing out the location of the internal pointer. I think we explained similar points, but I skipped the expanded examples.

@zx81 2014-09-09 21:35:45

I'm probably not understanding what you are trying to say, but you don't mean that against subject 12345 the first match of the pattern \d+? is the entire string, do you? To my eyes it looks like that's what you are saying — whereas the first match is 1, the second match is 2, and so on.

@Sam 2014-09-08 18:38:06

There is not a difference between [0-9]{2} and [0-9]{2}?.

The difference between greedy matching and lazy matching (the addition of a ?) has to do with backtracking. Regular expression engines are built to match text (from left to right). Therefore it is logical that when you ask an expression to match a range of character(s), it matches as many as possible.


Assume we have the string acac123.

If we use a greedy match of [a-z]+c (+ standing for 1+ repetitions or {1,}):

  • [a-z]+ would match acac and fail at 1
  • then we would try to match the c, but fail at 1
  • now we start backtracking, and successfully match aca and c

If we make this lazy ([a-z]+?c), we will get both a different response (in this case) and be more efficient:

  • [a-z]+? would match a, but stop because it sees the next character matches the rest of the expression c
  • the c would then match, successfully matching a and c (with no backtracking)

Now you can see that there will be no difference between X{#} and X{#}?, because {#} is not a range and even a greedy match will not experience any backtracking. Lazily matches are often used with * (0+ repetitions or {0,}) or +, but can also be used with ranges {m,n} (where n is optional).

This is essential when you want to match the least amount of characters possible and you will often see .*? in an expression when you want to fill up some space (foo.*?bar on a string foo bar filler text bar). However, many times a lazy match is an example of bad/inefficient regex. Many people will do something like foo:"(.*?)" to match everything within double quotes, when you can avoid a lazy match by writing your expression like foo:"([^"]+)" and match anything but "s.


Final note, ? typically means "optional" or match {0,1} times. ? only will make a match lazy if you use it on a range ({m,n}, *, +, or another ?). This means X? will not make X lazy (since we already said {#}? is pointless), but instead it will be optional. However, you can do a lazy "optional" match: [0-9]?? will lazily match 0-1 times.

Related Questions

Sponsored Content

75 Answered Questions

31 Answered Questions

[SOLVED] Regular expression to match a line that doesn't contain a word

22 Answered Questions

[SOLVED] How do you access the matched groups in a JavaScript regular expression?

  • 2009-01-11 07:21:20
  • nickf
  • 797331 View
  • 1409 Score
  • 22 Answer
  • Tags:   javascript regex

12 Answered Questions

19 Answered Questions

[SOLVED] Regular Expression for alphanumeric and underscores

  • 2008-12-03 04:25:27
  • Jim
  • 1141890 View
  • 616 Score
  • 19 Answer
  • Tags:   regex

15 Answered Questions

[SOLVED] What is a non-capturing group in regular expressions?

8 Answered Questions

[SOLVED] Is there a regular expression to detect a valid regular expression?

  • 2008-10-05 17:07:35
  • psytek
  • 213915 View
  • 1032 Score
  • 8 Answer
  • Tags:   regex

5 Answered Questions

[SOLVED] \d is less efficient than [0-9]

20 Answered Questions

[SOLVED] How do you use a variable in a regular expression?

  • 2009-01-30 00:11:05
  • JC Grubbs
  • 811171 View
  • 1439 Score
  • 20 Answer
  • Tags:   javascript regex

12 Answered Questions

[SOLVED] Regular Expressions: Is there an AND operator?

  • 2009-01-22 16:49:14
  • Hugoware
  • 798125 View
  • 734 Score
  • 12 Answer
  • Tags:   regex lookahead

Sponsored Content