By Divers

2012-01-19 18:16:23 8 Comments

I have next code:

public static void createTokens(){
    String test = "test is a word word word word big small";
    Matcher mtch = Pattern.compile("test is a (\\s*.+?\\s*) word (\\s*.+?\\s*)").matcher(test);
    while (mtch.find()){
        for (int i = 1; i <= mtch.groupCount(); i++){

And have next output:


But in my opinion it must be:


Somebody please explain me why so?


@Garrett Hall 2012-01-19 18:23:35

By using \\s* it will match any number of spaces including 0 spaces. w matches (\\s*.+?\\s*). To make sure it matches a word separated by spaces try (\\s+.+?\\s+)

@Alan Moore 2012-01-19 18:46:02

Trouble is, the regex is already consuming the space characters before and after the word, so now you're trying to consume them twice.

@Daniel Gray 2017-07-05 10:21:29

All you would need to do is remove the space from the regex like ...\\s+)word(\\s+...

@theglauber 2012-01-19 18:22:03

Because your patterns are non-greedy, so they matched as little text as possible while still consisting of a match.

Remove the ? in the second group, and you'll get
word word big small

Matcher mtch = Pattern.compile("test is a (\\s*.+?\\s*) word (\\s*.+\\s*)").matcher(test);

@Alan Moore 2012-01-19 18:41:13

And now the second group is capturing too much instead of too little. Non-greediness is not the problem, and greediness is not the solution.

@theglauber 2012-01-19 18:49:09

You're correct, but IMHO, the non-greedyness of the second capturing group explains why it captures simply "w". The first capturing group has to capture "word" because of the "word" literal following it. I don't know exactly what he's looking for and he edited the question after i submitted my answer, so i can't supply a correct regexp.

