By Codemonkey

2015-08-12 11:05:21 8 Comments

(And colons and periods, though I'm sure I can make that modification myself!)

After extracting data from a PDF I have lots of "merged" fields where they've overlapped, such as


Which I want to split into

John Doe

I have a couple hundred of these, so I'm hoping this is possible with a regex - I feel it should be, but can't quite get my head around matching multiple parts from a string and returning them concatenated together?


Works, but returns separate matches for each character, rather than one string?


@Andris Leduskrasts 2015-08-12 11:29:01

Regex will always return seperate matches, as that's just how regex works. Also, \d+|[:.] is probably slightly better as each set of digits will be together.

As for your perdicament, you can use something like (\d+|[:.])|[\s\S]*? and substitute with $1 on regex101, like this, the added alternation being there to remove all the other characters (though, granted, it leaves a space for each one of them, so it looks odd)

@m.cekiera 2015-08-12 11:40:46

Another solution, but it will differ depend on language, you can use two regexes, like: [\d:.] and [^\d:.] or [a-zA-Z] and [^a-zA-Z], an then use a function ocurring in many languages, like replaceAll with regex. On Java example:

String str = example.replaceAll("[\\d:.]", ""); // result: JohnDode
String time = example.replaceAll("[^\\d:.]", ""); // result: 15:24.81

two operations, but no need to using groups, etc.

