By Teifion


2008-08-07 14:05:23 8 Comments

I don't really understand regular expressions. Can you explain them to me in an easy-to-follow manner? If there are any online tools or books, could you also link to them?

1 comments

@Greg Bacon 2010-05-03 16:09:21

The most important part is the concepts. Once you understand how the building blocks work, differences in syntax amount to little more than mild dialects. A layer on top of your regular expression engine's syntax is the syntax of the programming language you're using. Languages such as Perl remove most of this complication, but you'll have to keep in mind other considerations if you're using regular expressions in a C program.

If you think of regular expressions as building blocks that you can mix and match as you please, it helps you learn how to write and debug your own patterns but also how to understand patterns written by others.

Start simple

Conceptually, the simplest regular expressions are literal characters. The pattern N matches the character 'N'.

Regular expressions next to each other match sequences. For example, the pattern Nick matches the sequence 'N' followed by 'i' followed by 'c' followed by 'k'.

If you've ever used grep on Unix—even if only to search for ordinary looking strings—you've already been using regular expressions! (The re in grep refers to regular expressions.)

Order from the menu

Adding just a little complexity, you can match either 'Nick' or 'nick' with the pattern [Nn]ick. The part in square brackets is a character class, which means it matches exactly one of the enclosed characters. You can also use ranges in character classes, so [a-c] matches either 'a' or 'b' or 'c'.

The pattern . is special: rather than matching a literal dot only, it matches any character. It's the same conceptually as the really big character class [-.?+%$A-Za-z0-9...].

Think of character classes as menus: pick just one.

Helpful shortcuts

Using . can save you lots of typing, and there are other shortcuts for common patterns. Say you want to match a digit: one way to write that is [0-9]. Digits are a frequent match target, so you could instead use the shortcut \d. Others are \s (whitespace) and \w (word characters: alphanumerics or underscore).

The uppercased variants are their complements, so \S matches any non-whitespace character, for example.

Once is not enough

From there, you can repeat parts of your pattern with quantifiers. For example, the pattern ab?c matches 'abc' or 'ac' because the ? quantifier makes the subpattern it modifies optional. Other quantifiers are

  • * (zero or more times)
  • + (one or more times)
  • {n} (exactly n times)
  • {n,} (at least n times)
  • {n,m} (at least n times but no more than m times)

Putting some of these blocks together, the pattern [Nn]*ick matches all of

  • ick
  • Nick
  • nick
  • Nnick
  • nNick
  • nnick
  • (and so on)

The first match demonstrates an important lesson: * always succeeds! Any pattern can match zero times.

A few other useful examples:

  • [0-9]+ (and its equivalent \d+) matches any non-negative integer
  • \d{4}-\d{2}-\d{2} matches dates formatted like 2019-01-01

Grouping

A quantifier modifies the pattern to its immediate left. You might expect 0abc+0 to match '0abc0', '0abcabc0', and so forth, but the pattern immediately to the left of the plus quantifier is c. This means 0abc+0 matches '0abc0', '0abcc0', '0abccc0', and so on.

To match one or more sequences of 'abc' with zeros on the ends, use 0(abc)+0. The parentheses denote a subpattern that can be quantified as a unit. It's also common for regular expression engines to save or "capture" the portion of the input text that matches a parenthesized group. Extracting bits this way is much more flexible and less error-prone than counting indices and substr.

Alternation

Earlier, we saw one way to match either 'Nick' or 'nick'. Another is with alternation as in Nick|nick. Remember that alternation includes everything to its left and everything to its right. Use grouping parentheses to limit the scope of |, e.g., (Nick|nick).

For another example, you could equivalently write [a-c] as a|b|c, but this is likely to be suboptimal because many implementations assume alternatives will have lengths greater than 1.

Escaping

Although some characters match themselves, others have special meanings. The pattern \d+ doesn't match backslash followed by lowercase D followed by a plus sign: to get that, we'd use \\d\+. A backslash removes the special meaning from the following character.

Greediness

Regular expression quantifiers are greedy. This means they match as much text as they possibly can while allowing the entire pattern to match successfully.

For example, say the input is

"Hello," she said, "How are you?"

You might expect ".+" to match only 'Hello,' and will then be surprised when you see that it matched from 'Hello' all the way through 'you?'.

To switch from greedy to what you might think of as cautious, add an extra ? to the quantifier. Now you understand how \((.+?)\), the example from your question works. It matches the sequence of a literal left-parenthesis, followed by one or more characters, and terminated by a right-parenthesis.

If your input is '(123) (456)', then the first capture will be '123'. Non-greedy quantifiers want to allow the rest of the pattern to start matching as soon as possible.

(As to your confusion, I don't know of any regular-expression dialect where ((.+?)) would do the same thing. I suspect something got lost in transmission somewhere along the way.)

Anchors

Use the special pattern ^ to match only at the beginning of your input and $ to match only at the end. Making "bookends" with your patterns where you say, "I know what's at the front and back, but give me everything between" is a useful technique.

Say you want to match comments of the form

-- This is a comment --

you'd write ^--\s+(.+)\s+--$.

Build your own

Regular expressions are recursive, so now that you understand these basic rules, you can combine them however you like.

Tools for writing and debugging regexes:

Books

Free resources

Footnote

†: The statement above that . matches any character is a simplification for pedagogical purposes that is not strictly true. Dot matches any character except newline, "\n", but in practice you rarely expect a pattern such as .+ to cross a newline boundary. Perl regexes have a /s switch and Java Pattern.DOTALL, for example, to make . match any character at all. For languages that don't have such a feature, you can use something like [\s\S] to match "any whitespace or any non-whitespace", in other words anything.

@Juraj.Lorinc 2015-09-09 10:25:21

You can also use trial and error method and than following online regex tester and debugger can be a huge help: regex101.com

@Nic Hartley 2016-03-31 12:12:19

It would be worth mentioning that, despite being a similar pattern, a{,m} isn't a thing, at least in Javascript, Perl, and Python.

@hek2mgl 2016-11-14 18:14:30

It would be very worth to mention that there are different kind of regular expression engines with all have different feature set's and syntactic rules.

@Saurabh Hooda 2017-08-16 07:14:04

hackr.io/tutorials/learn-regular-expressions-regex is a great place to find best online regex tutorials. All the tutorials here are submitted and recommended (upvoted like SO) by the programming community.

@Saurabh Tiwari 2019-01-14 08:37:13

Appreciate your efforts to bring it all here in a nutshell.

Related Questions

Sponsored Content

73 Answered Questions

28 Answered Questions

8 Answered Questions

[SOLVED] Is there a regular expression to detect a valid regular expression?

  • 2008-10-05 17:07:35
  • psytek
  • 103187 View
  • 672 Score
  • 8 Answer
  • Tags:   regex

18 Answered Questions

[SOLVED] How do you use a variable in a regular expression?

  • 2009-01-30 00:11:05
  • JC Grubbs
  • 622447 View
  • 1157 Score
  • 18 Answer
  • Tags:   javascript regex

16 Answered Questions

[SOLVED] How do you access the matched groups in a JavaScript regular expression?

  • 2009-01-11 07:21:20
  • nickf
  • 668497 View
  • 1196 Score
  • 16 Answer
  • Tags:   javascript regex

11 Answered Questions

[SOLVED] How to do a regular expression replace in MySQL?

15 Answered Questions

[SOLVED] Regular expression to search for Gadaffi

  • 2011-03-19 22:14:17
  • SiggyF
  • 52214 View
  • 361 Score
  • 15 Answer
  • Tags:   regex search

10 Answered Questions

[SOLVED] How to match "anything up until this sequence of characters" in a regular expression?

  • 2011-08-19 16:45:44
  • callum
  • 521221 View
  • 405 Score
  • 10 Answer
  • Tags:   regex

12 Answered Questions

[SOLVED] Regular Expressions: Is there an AND operator?

  • 2009-01-22 16:49:14
  • Hugoware
  • 656647 View
  • 614 Score
  • 12 Answer
  • Tags:   regex lookahead

22 Answered Questions

[SOLVED] Why are regular expressions so controversial?

  • 2009-04-18 21:33:07
  • Gumbo
  • 34735 View
  • 205 Score
  • 22 Answer
  • Tags:   regex

Sponsored Content