By Ondra Žižka


2009-10-12 00:28:47 8 Comments

I'd like to port a generic text processing tool, Texy!, from PHP to Java.

This tool does ungreedy matching everywhere, using preg_match_all("/.../U"). So I am looking for a library, which has some UNGREEDY flag.

I know I could use the .*? syntax, but there are really many regular expressions I would have to overwrite, and check them with every updated version.

I've checked

  • ORO - seems to be abandoned
  • Jakarta Regexp - no support
  • java.util.regex - no support

Is there any such library?

Thanks, Ondra

4 comments

@brianegge 2009-10-12 02:58:12

You may be able to use 'com.caucho.quercus.lib.regexp.JavaRegexpModule'. Quercus is a Java implementation of PHP, and the regex library implements the PHP regex syntax and method names.

@brianegge 2009-10-12 02:52:45

I suggest you create your own modified Java library. Simply copy the java.util.regex source into your own package.

The Sun JDK 1.6 Pattern.java class offers these default flags:

static final int GREEDY     = 0;

static final int LAZY       = 1;

static final int POSSESSIVE = 2;

You'll notice that these flags are only used a couple of times, and it would be trivial to modify. Take the following example:

    case '*':
        ch = next();
        if (ch == '?') {
            next();
            return new Curly(prev, 0, MAX_REPS, LAZY);
        } else if (ch == '+') {
            next();
            return new Curly(prev, 0, MAX_REPS, POSSESSIVE);
        }
        return new Curly(prev, 0, MAX_REPS, GREEDY);

Simply change the last line to use the 'LAZY' flag instead of the GREEDY flag. Since your wanting a regex library to behave like the PHP one, this might be the best way to go.

@Ondra Žižka 2009-10-12 03:04:19

Actually, the patch for this RFE would be as simple as replacing the GREEDY in the default return path with a variable created from the flags. Great, I'm gonna submit a patch to JDK :)

@EmFi 2009-10-12 02:08:03

Update: After checking the docs I found the LAZY flag, which is another term for non-greedy. However it only appears to be available in OpenJDK

p = Pattern.compile("your regex here", LAZY);
p.matcher("string to match")

Original deprecated response I honestly don't think there's one.

The whole point of the +? and *? is so that you can choose which sections to do greedily and which ones to do lazily.

Greedy is the default behaviour because that's the most commonly use of + and * in regular expressions. In fact I can't think of a single regex parser that does it the other way around. As in where a modifier is used to make something greedy, and the default is lazy matching.

I know this isn't the answer you're looking for, but, the only way I think you'll be able to make it work is to add the ? to your *'s and +'s. On the upside you can use regular expressions to help determine which ones need to be changed. Or even make the changes for you if all of them need to be changed. Or if you can can describe a pattern that identifies which need to be changed.

@hhafez 2009-10-12 02:20:22

so are you saying there is no way to change the default behavior? Having a default behavior that can not be changed just because it's "the most common[...]" doesn't mean that having the switch is a bad idea

@EmFi 2009-10-12 02:32:13

I wasn't saying it's impossible, or even unnecessary. I was just stating that based on my experience in a number of languages. I've never even seen a laziness switch for a regular expression until the Asker mentioned preg_match_all("/.../U").

@Ondra Žižka 2009-10-12 02:57:47

Wow, when it's in OpenJDK, then there's a good chance of this making it into Sun JDK! And, hopefully, I can take OpenJDK's implementation and use it in Sun JDK. But, where did you find it? It's not in the doc: jdocs.com/javase/7.b12/java/util/regex/Pattern.html (which should be OpenJDK's doc).

@EmFi 2009-10-12 03:58:26

Here's where I found it. Take note it's listed as a final int which usually means flag. But it's there isn't exactly a description. So it may be unimplemented. docjar.com/docs/api/java/util/regex/Pattern.html

@Ondra Žižka 2009-10-12 21:17:01

Unfortunately, this is just a constant used for parsing and processing closures. docjar.com creates docs from the source, and it shows private scope.

@Jeremy Huiskamp 2009-10-12 02:17:32

About the idea of checking and rechecking all regular expressions, are you sure that the php and java libraries agree enough on syntax that you wouldn't have to do this anyway? What I'd do up front is go through them all and write some tests (input and output) and make sure that they work the same in both implementations. Then devise a way to run them automatically and you will be covered for future upgrades and incompatibilities. You'll still need to tweak stuff, but at least you'll know where.

@Ondra Žižka 2009-10-12 02:44:00

Well, java.util.regex should be Perl5 compatible, not counting few features, which are not used in the tool - besides this one. And sure, I've asked the author of the PHP original to create some tests which would kind of certify other implementations.

Related Questions

Sponsored Content

42 Answered Questions

[SOLVED] How do I convert a String to an int in Java?

87 Answered Questions

[SOLVED] Is Java "pass-by-reference" or "pass-by-value"?

42 Answered Questions

[SOLVED] How do I efficiently iterate over each entry in a Java Map?

55 Answered Questions

[SOLVED] How to create a memory leak in Java?

67 Answered Questions

[SOLVED] How do I generate random integers within a specific range in Java?

  • 2008-12-12 18:20:57
  • user42155
  • 3947094 View
  • 3411 Score
  • 67 Answer
  • Tags:   java random integer

28 Answered Questions

32 Answered Questions

[SOLVED] When to use LinkedList over ArrayList in Java?

58 Answered Questions

[SOLVED] How do I read / convert an InputStream into a String in Java?

47 Answered Questions

[SOLVED] Does a finally block always get executed in Java?

12 Answered Questions

[SOLVED] How to do a regular expression replace in MySQL?

  • 2009-06-12 14:08:42
  • Piskvor left the building
  • 427017 View
  • 491 Score
  • 12 Answer
  • Tags:   mysql regex mysql-udf

Sponsored Content