By Ben


2019-03-14 20:00:41 8 Comments

I'm trying to extract twitter handles from tweets using R's stringr package. For example, suppose I want to get all words in a vector that begin with "A". I can do this like so

library(stringr)

# Get all words that begin with "A"
str_extract_all(c("hAi", "hi Ahello Ame"), "(?<=\\b)A[^\\s]+")

[[1]]
character(0)

[[2]]
[1] "Ahello" "Ame"   

Great. Now let's try the same thing using "@" instead of "A"

str_extract_all(c("[email protected]", "hi @hello @me"), "(?<=\\b)\\@[^\\s]+")

[[1]]
[1] "@i"

[[2]]
character(0)

Why does this example give the opposite result that I was expecting and how can I fix it?

3 comments

@Wiktor Stribi┼╝ew 2019-03-14 21:47:14

A couple of things about your regex:

  • (?<=\b) is the same as \b because a word boundary is already a zero width assertion
  • \@ is the same as @, as @ is not a special regex metacharacter and you do not have to escape it
  • [^\s]+ is the same as \S+, almost all shorthand character classes have their negated counterparts in regex.

So, your regex, \[email protected]\S+, matches @i in [email protected] because there is a word boundary between h (a letter, a word char) and @ (a non-word char, not a letter, digit or underscore). Check this regex debugger.

\b is an ambiguous pattern whose meaning depends on the regex context. In your case, you might want to use \B, a non-word boundary, that is, \[email protected]\S+, and it will match @ that are either preceded with a non-word char or at the start of the string.

x <- c("[email protected]", "hi @hello @me")
regmatches(x, gregexpr("\\[email protected]\\S+", x))
## => [[1]]
## character(0)
## 
## [[2]]
## [1] "@hello" "@me"   

See the regex demo.

If you want to get rid of this \b/\B ambiguity, use unambiguous word boundaries using lookarounds with stringr methods or base R regex functions with perl=TRUE argument:

regmatches(x, gregexpr("(?<!\\w)@\\S+", x, perl=TRUE))
regmatches(x, gregexpr("(?<!\\S)@\\S+", x, perl=TRUE))

where:

  • (?<!\w) - an unambiguous starting word boundary - is a negative lookbehind that makes sure there is a non-word char immediately to the left of the current location or start of string
  • (?<!\S) - a whitespace starting word boundary - is a negative lookbehind that makes sure there is a whitespace char immediately to the left of the current location or start of string.

See this regex demo and another regex demo here.

Note that the corresponding right hand boundaries are (?!\w) and (?!\S).

@MokeEire 2019-03-14 20:33:00

The answer above should suffice. This will remove the @ symbol in case you are trying to get the users' names only.

str_extract_all(c("@tweeter tweet", "[email protected]", "tweet @tweeter2"), "(?<=\\B\\@)[^\\s]+")
[[1]]
[1] "tweeter"

[[2]]
character(0)

[[3]]
[1] "tweeter2"

While I am no expert with regex, it seems like the issue may be that the @ symbol does not correspond to a word character, and thus matching the empty string at the beginning of a word (\\b) does not work because there is no empty string when @ is preceding the word.

Here are two great regex resources in case you hadn't seen them:

@MrFlick 2019-03-14 20:34:46

This seems to also grab the "i" in "[email protected]" which the OP was trying to avoid,

@MokeEire 2019-03-14 21:00:04

You're absolutely right, I missed that. I updated it now.

@MrFlick 2019-03-14 20:08:37

It looks like you probably mean

str_extract_all(c("[email protected]", "hi @hello @me", "@twitter"), "(?<=^|\\s)@[^\\s]+")
# [[1]]
# character(0)
# [[2]]
# [1] "@hello" "@me" 
# [[3]]
# [1] "@twitter"

The \b in a regular expression is a boundary and it occurs "Between two characters in the string, where one is a word character and the other is not a word character." see here. Since an space and "@" are both non-word characters, there is no boundary before the "@".

With this revision you match either the start of the string or values that come after spaces.

@Gregor 2019-03-14 20:11:01

As a side-note, I think (?<=\\b) (from OP's original) is equivalent to \\b, since the match will not include the boundary change. However, in the solution the positive look-behind is indeed needed since we do not want the space to be part of the match.

Related Questions

Sponsored Content

11 Answered Questions

[SOLVED] How to negate specific word in regex?

  • 2009-08-06 17:20:45
  • Bostone
  • 586136 View
  • 541 Score
  • 11 Answer
  • Tags:   regex

34 Answered Questions

[SOLVED] RegEx match open tags except XHTML self-contained tags

  • 2009-11-13 22:38:26
  • Jeff
  • 2558449 View
  • 1324 Score
  • 34 Answer
  • Tags:   html regex xhtml

28 Answered Questions

4 Answered Questions

[SOLVED] Regex: match word with intrusive symbol

  • 2019-03-12 02:06:44
  • bongbang
  • 64 View
  • 3 Score
  • 4 Answer
  • Tags:   python regex

39 Answered Questions

[SOLVED] A comprehensive regex for phone number validation

2 Answered Questions

[SOLVED] Regex extraction in R

  • 2016-02-29 11:18:45
  • user2962887
  • 92 View
  • 1 Score
  • 2 Answer
  • Tags:   regex r stringr

2 Answered Questions

[SOLVED] Extracting hashtags from twitter - string in R error

  • 2015-05-23 11:04:28
  • Apricot
  • 485 View
  • -1 Score
  • 2 Answer
  • Tags:   r twitter

3 Answered Questions

[SOLVED] Substring extraction from vector in R

  • 2015-03-22 16:24:39
  • Brian P
  • 870 View
  • 2 Score
  • 3 Answer
  • Tags:   regex r stringr

0 Answered Questions

Use str_extract_all to extract pattern as a hashtag from a text

4 Answered Questions

[SOLVED] Regex: Searching for words with '%' at the beginning

  • 2012-02-13 21:55:58
  • enchance
  • 134 View
  • 0 Score
  • 4 Answer
  • Tags:   regex

Sponsored Content