By Igor Katson


2008-12-29 23:32:04 8 Comments

I am tired of always trying to guess, if I should escape special characters like '()[]{}|' etc. when using many implementations of regexps.

It is different with, for example, Python, sed, grep, awk, Perl, rename, Apache, find and so on. Is there any rule set which tells when I should, and when I should not, escape special characters? Does it depend on the regexp type, like PCRE, POSIX or extended regexps?

8 comments

@Rob Wells 2008-12-30 00:09:19

Sometimes simple escaping is not possible with the characters you've listed. For example, using a backslash to escape a bracket isn't going to work in the left hand side of a substitution string in sed, namely

sed -e 's/foo\(bar/something_else/'

I tend to just use a simple character class definition instead, so the above expression becomes

sed -e 's/foo[(]bar/something_else/'

which I find works for most regexp implementations.

BTW Character classes are pretty vanilla regexp components so they tend to work in most situations where you need escaped characters in regexps.

Edit: After the comment below, just thought I'd mention the fact that you also have to consider the difference between finite state automata and non-finite state automata when looking at the behaviour of regexp evaluation.

You might like to look at "the shiny ball book" aka Effective Perl (sanitised Amazon link), specifically the chapter on regular expressions, to get a feel for then difference in regexp engine evaluation types.

Not all the world's a PCRE!

Anyway, regexp's are so clunky compared to SNOBOL! Now that was an interesting programming course! Along with the one on Simula.

Ah the joys of studying at UNSW in the late '70's! (-:

@Jonathan Leffler 2008-12-30 08:43:36

'sed' is a command for which plain '(' is not special but '\(' is special; in contrast, PCRE reverses the sense, so '(' is special, but '\(' is not. This is exactly what the OP is asking about.

@Rob Wells 2008-12-31 01:32:40

sed is a *nix utility that uses one of the most primitive sets of regexp evaluation. PCRE doesn't enter in to the situation I describes as it involves a different class of (in)finite automata with the way it evaluates regexps. I think my suggestion for the minimum set of regexp syntax still holds.

@Jan Goyvaerts 2008-12-31 07:30:02

On a POSIX-compliant system, sed uses POSIX BRE, which I cover in my answer. The GNU version on modern Linux system uses POSIX BRE with a few extensions.

@Beejor 2015-08-25 19:12:56

Modern RegEx Flavors (PCRE)

Includes C, C++, Delphi, EditPad, Java, JavaScript, Perl, PHP (preg), PostgreSQL, PowerGREP, PowerShell, Python, REALbasic, Real Studio, Ruby, TCL, VB.Net, VBScript, wxWidgets, XML Schema, Xojo, XRegExp.
PCRE compatibility may vary

    Anywhere: . ^ $ * + - ? ( ) [ ] { } \ |


Legacy RegEx Flavors (BRE/ERE)

Includes awk, ed, egrep, emacs, GNUlib, grep, PHP (ereg), MySQL, Oracle, R, sed.
PCRE support may be enabled in later versions or by using extensions

ERE/awk/egrep/emacs

    Outside a character class: . ^ $ * + ? ( ) [ { } \ |
    Inside a character class: ^ - [ ]

BRE/ed/grep/sed

    Outside a character class: . ^ $ * [ \
    Inside a character class: ^ - [ ]
    For literals, don't escape: + ? ( ) { } |
    For standard regex behavior, escape: \+ \? \( \) \{ \} \|


Notes

  • If unsure about a specific character, it can be escaped like \xFF
  • Alphanumeric characters cannot be escaped with a backslash
  • Arbitrary symbols can be escaped with a backslash in PCRE, but not BRE/ERE (they must only be escaped when required). For PCRE ] - only need escaping within a character class, but I kept them in a single list for simplicity
  • Quoted expression strings must also have the surrounding quote characters escaped, and often with backslashes doubled-up (like "(\")(/)(\\.)" versus /(")(\/)(\.)/ in JavaScript)
  • Aside from escapes, different regex implementations may support different modifiers, character classes, anchors, quantifiers, and other features. For more details, check out regular-expressions.info, or use regex101.com to test your expressions live

@Jan Goyvaerts 2017-02-23 08:05:50

There are many errors in your answer, including but not limited to: None of your "modern" flavors require - or ] to be escaped outside character classes. POSIX (BRE/ERE) doesn't have an escape character inside character classes. The regex flavor in Delphi's RTL is actually based on PCRE. Python, Ruby, and XML have their own flavors that are closer to PCRE than to the POSIX flavors.

@Beejor 2017-03-07 03:15:58

@JanGoyvaerts Thanks for the correction. The flavors you mentioned are indeed closer to PCRE. As for the escapes, I kept them that way for simplicity; it's easier to remember just to escape everywhere than a few exceptions. Power users will know what's up, if they want to avoid a few backslashes. Anyway, I updated my answer with a few clarifications that hopefully address some of this stuff.

@zylstra 2013-10-01 11:22:23

For PHP, "it is always safe to precede a non-alphanumeric with "\" to specify that it stands for itself." - http://php.net/manual/en/regexp.reference.escape.php.

Except if it's a " or '. :/

To escape regex pattern variables (or partial variables) in PHP use preg_quote()

@Jan Goyvaerts 2008-12-30 14:01:58

Which characters you must and which you mustn't escape indeed depends on the regex flavor you're working with.

For PCRE, and most other so-called Perl-compatible flavors, escape these outside character classes:

.^$*+?()[{\|

and these inside character classes:

^-]\

For POSIX extended regexes (ERE), escape these outside character classes (same as PCRE):

.^$*+?()[{\|

Escaping any other characters is an error with POSIX ERE.

Inside character classes, the backslash is a literal character in POSIX regular expressions. You cannot use it to escape anything. You have to use "clever placement" if you want to include character class metacharacters as literals. Put the ^ anywhere except at the start, the ] at the start, and the - at the start or the end of the character class to match these literally, e.g.:

[]^-]

In POSIX basic regular expressions (BRE), these are metacharacters that you need to escape to suppress their meaning:

.^$*

Escaping parentheses and curly brackets in BREs gives them the special meaning their unescaped versions have in EREs. Some implementations (e.g. GNU) also give special meaning to other characters when escaped, such as \? and +. Escaping a character other than .^$*(){} is normally an error with BREs.

Inside character classes, BREs follow the same rule as EREs.

If all this makes your head spin, grab a copy of RegexBuddy. On the Create tab, click Insert Token, and then Literal. RegexBuddy will add escapes as needed.

@jackthehipster 2015-01-14 08:23:29

It seems to me you forgot the "/", which also needs to be escaped outside a class.

@Jan Goyvaerts 2015-02-06 23:39:05

/ is not a metacharacter in any of the regular expression flavors I mentioned, so the regular expression syntax does not require escaping it. When a regular expression is quoted as a literal in a programming language, then the string or regex formatting rules of that language may require / or " or ' to be escaped, and may even require `\` to be doubly escaped.

@nicolallias 2015-05-22 14:05:50

what about colon, ":"? Shall it be escaped inside character classes as well as outside? en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions says "PCRE has consistent escaping rules: any non-alpha-numeric character may be escaped to mean its literal value [...]"

@Jan Goyvaerts 2015-06-09 07:52:00

MAY be escaped is not the same as SHOULD be escaped. The PCRE syntax never requires a literal colon to be escaped, so escaping literal colons only makes your regex harder to read.

@slebetman 2015-08-21 04:47:49

For non-POSIX ERE (the one I use most often because it's what's implemented by Tcl) escaping other things don't generate errors.

@goyote 2015-10-08 09:11:45

If you put the hyphen at the start of the character class it does not need escaping.

@AndreKR 2015-11-19 05:59:04

Now should I escape any characters inside character classes or not? At the top you say I should, later you say I shouldn't. Or do you mean "Those character have a special meaning inside character classes." That's not the same as "need to be escaped", or is it?

@Jan Goyvaerts 2015-11-20 02:06:31

@AndreKR: The first part of my answer talks about PCRE, which allows you to escape special characters inside character classes with backslashes, and requires backslashes inside character classes to be escaped. The second part of my answer talks about POSIX, which treats the backslash as a literal in character classes, requiring you to use "clever placement" of characters with special meanings in character classes. Clever placement also works with PCRE for all special characters inside character classes except the backslash.

@AndreKR 2015-11-20 04:22:48

Aaah, I see. :)

@Max Starkenburg 2016-04-29 14:57:25

@jackthehipster, I too got stuck on thinking "/" was missing because I temporarily forgot that sed can use delimiters other than a slash.

@Константин Ван 2016-09-24 14:28:56

For JavaScript developers: const escapePCRE = string => string.replace(/[.*+?^${}()|[\]\\]/g, "\\$&"); from Mozilla developer network.

@Kevin 2017-02-08 19:10:39

Outside a character class, with PCRE and other Perl-like regexes, if the x flag is activated, you also have to escape # (because it introduces a comment). Inside a character class this is unnecessary.

@RaisingAgent 2017-06-07 09:41:43

what about " ?

@Jan Goyvaerts 2017-06-08 04:05:27

" and ' are not metacharacters in any regex flavor that I know of. Do not confuse escaping metacharacters in a regex (which this question is about) with escaping delimiters when formatting a regex as a string in source code (which adds a second set of escape rules).

@Mizar 2019-01-10 13:56:21

Did you forget }?

@Jan Goyvaerts 2019-01-12 01:00:16

No, I did not forget }. It does not need to be escaped in any of the flavors mentioned in my reply. Very few flavors require it to be escaped.

@Jonathan Leffler 2008-12-30 00:05:08

POSIX recognizes multiple variations on regular expressions - basic regular expressions (BRE) and extended regular expressions (ERE). And even then, there are quirks because of the historical implementations of the utilities standardized by POSIX.

There isn't a simple rule for when to use which notation, or even which notation a given command uses.

Check out Jeff Friedl's Mastering Regular Expressions book.

@Darron 2008-12-29 23:44:33

Unfortunately, the meaning of things like ( and \( are swapped between Emacs style regular expressions and most other styles. So if you try to escape these you may be doing the opposite of what you want.

So you really have to know what style you are trying to quote.

@Dillie-O 2008-12-29 23:42:45

Unfortunately there really isn't a set set of escape codes since it varies based on the language you are using.

However, keeping a page like the Regular Expression Tools Page or this Regular Expression Cheatsheet can go a long way to help you quickly filter things out.

@Alan Moore 2017-03-07 05:00:26

The Addedbytes cheat sheet is grossly oversimplified, and has some glaring errors. For example, it says \< and \> are word boundaries, which is true only (AFAIK) in the Boost regex library. But elsewhere it says < and > are metacharacters and must be escaped (to \< and \>) to match them literally, which not true in any flavor

@Charlie Martin 2008-12-29 23:37:02

Really, there isn't. there are about a half-zillion different regex syntaxes; they seem to come down to Perl, EMACS/GNU, and AT&T in general, but I'm always getting surprised too.

Related Questions

Sponsored Content

15 Answered Questions

[SOLVED] How do you access the matched groups in a JavaScript regular expression?

  • 2009-01-11 07:21:20
  • nickf
  • 628169 View
  • 1129 Score
  • 15 Answer
  • Tags:   javascript regex

48 Answered Questions

[SOLVED] What is the best regular expression to check if a string is a valid URL?

18 Answered Questions

[SOLVED] Regular Expression for alphanumeric and underscores

  • 2008-12-03 04:25:27
  • Jim
  • 907641 View
  • 487 Score
  • 18 Answer
  • Tags:   regex

9 Answered Questions

[SOLVED] (grep) Regex to match non-ASCII characters?

70 Answered Questions

27 Answered Questions

18 Answered Questions

[SOLVED] How do you use a variable in a regular expression?

  • 2009-01-30 00:11:05
  • JC Grubbs
  • 577889 View
  • 1079 Score
  • 18 Answer
  • Tags:   javascript regex

7 Answered Questions

[SOLVED] Is there a regular expression to detect a valid regular expression?

  • 2008-10-05 17:07:35
  • psytek
  • 98986 View
  • 647 Score
  • 7 Answer
  • Tags:   regex

10 Answered Questions

[SOLVED] jQuery selector regular expressions

12 Answered Questions

[SOLVED] Regular Expressions: Is there an AND operator?

  • 2009-01-22 16:49:14
  • Hugoware
  • 612329 View
  • 579 Score
  • 12 Answer
  • Tags:   regex lookahead

Sponsored Content