By Pointer Null


2012-08-10 09:37:40 8 Comments

I need help about regular expression matching with non-greedy option.

The match pattern is:

<img\s.*>

The text to match is:

<html>
<img src="test">
abc
<img
  src="a" src='a' a=b>
</html>

I test on http://regexpal.com

This expression matches all text from <img to last >. I need it to match with the first encountered > after the initial <img, so here I'd need to get two matches instead of the one that I get.

I tried all combinations of non-greedy ?, with no success.

3 comments

@tripleee 2018-11-19 05:50:56

The other answers here presuppose that you have a regex engine which supports non-greedy matching, which is an extension introduced in Perl 5 and widely copied to other modern languages; but it is by no means ubiquitous.

Many older or more conservative languages and editors only support traditional regular expressions, which have no mechanism for controlling greediness of the repetition operator * - it always matches the longest possible string.

The trick then is to limit what it's allowed to match in the first place. Instead of .* you seem to be looking for

[^>]*

which still matches as many of something as possible; but the something is not just . "any character", but instead "any character which isn't >".

Depending on your application, you may or may not want to enable an option to permit "any character" to include newlines.

Even if your regular expression engine supports non-greedy matching, it's better to spell out what you actually mean. If this is what you mean, you should probably say this, instead of rely on non-greedy matching to (hopefully, probably) Do What I Mean.

For example, a regular expression with a trailing context after the wildcard like .*?><br/> will jump over any nested > until it finds the trailing context (here, ><br/>) even if that requires straddling multiple > instances and newlines if you let it, where [^>]*><br/> (or even [^\n>]*><br/> if you have to explicitly disallow newline) obviously can't and won't do that.

Of course, this is still not what you want if you need to cope with <img title="quoted string with > in it" src="other attributes"> and perhaps <img title="nested tags">, but at that point, you should finally give up on using regular expressions for this like we all told you in the first place.

@Ilya 2012-08-10 09:43:05

The ? operand makes match non-greedy. E.g. .* is greedy while .*? isn't. So you can use something like <img.*?> to match the whole tag. Or <img[^>]*>.

But remember that the whole set of HTML can't be actually parsed with regular expressions.

@Mario Marinato 2016-11-11 14:03:10

Your answer reminded of this: stackoverflow.com/a/1732454/431

@golopot 2016-11-12 01:34:02

I think It's more clear to say that *? is the non-greedy version of *.

@Pavan Manjunath 2012-08-10 09:42:12

The non-greedy ? works perfectly fine. It's just that you need to select dot matches all option in the regex engines (regexpal, the engine you used, also has this option) you are testing with. This is because, regex engines generally don't match line breaks when you use .. You need to tell them explicitly that you want to match line-breaks too with .

For example,

<img\s.*?>

works fine!

Check the results here.

Also, read about how dot behaves in various regex flavours.

@Tom Lord 2014-11-21 11:45:51

There is also a trick you can do to work around this: Since \s means "any whitespace", and "\S" means "any non-whitespace", [\s\S] will match ANY character (like ".", but including new line)! Similarly, you could use [\d\D], or [\w\W]. This can be quite a handy little "hack", and it certainly a very useful trick to be aware of.

@Tom Lord 2014-11-21 11:52:54

Or even, in this example, you could use: <img[^>]*> to achieve the same affect: since "Any character other than >" DOES include new line!

@Thorsten Staerk 2015-03-22 08:47:48

good answer, but how about bash? echo "<img src=test>bla<img src=a>" | grep -P '<img\s.*?>' matches the whole string despite the ? operator.

@Joachim Wagner 2016-01-21 08:54:53

@Thorsten: -P selects Perl mode and perldoc says *? is non-greedy. Confirmed to work on a 10-year-old Linux and a recent Linux. Maybe you misinterpreted the output. "grep" prints any line (in full) that has a match somewhere. Add "-o" to only print the matches.

@Mrinal Bhattacharjee 2016-03-17 14:29:01

I intend to find the pattern in the line below. line = "/ab[1].bc[2].cd[3]"; pattern="([a-zA-Z0-9].*?\[\\d*?\])"; I can find multiple matches in TextFX,notepad++ but in java it finds only 1 match

Related Questions

Sponsored Content

21 Answered Questions

[SOLVED] Non greedy (reluctant) regex matching in sed?

34 Answered Questions

[SOLVED] RegEx match open tags except XHTML self-contained tags

  • 2009-11-13 22:38:26
  • Jeff
  • 2733348 View
  • 1356 Score
  • 34 Answer
  • Tags:   html regex xhtml

18 Answered Questions

[SOLVED] How do you access the matched groups in a JavaScript regular expression?

  • 2009-01-11 07:21:20
  • nickf
  • 733377 View
  • 1303 Score
  • 18 Answer
  • Tags:   javascript regex

9 Answered Questions

[SOLVED] Check whether a string matches a regex in JS

4 Answered Questions

[SOLVED] Notepad++ non-greedy regular expressions

3 Answered Questions

[SOLVED] Match all occurrences of a regex

  • 2008-09-17 05:46:26
  • Chris Bunch
  • 183708 View
  • 574 Score
  • 3 Answer
  • Tags:   ruby regex

7 Answered Questions

[SOLVED] Greedy vs. Reluctant vs. Possessive Quantifiers

  • 2011-03-16 00:55:41
  • Regex Rookie
  • 89325 View
  • 346 Score
  • 7 Answer
  • Tags:   regex regex-greedy

8 Answered Questions

[SOLVED] How can I make my match non greedy in vim?

2 Answered Questions

[SOLVED] Find all (most) html codes using regex in one expression

  • 2012-06-14 15:57:53
  • WhiskerBiscuit
  • 158 View
  • 1 Score
  • 2 Answer
  • Tags:   regex

Sponsored Content