By Matt McManis


2019-03-11 05:55:51 8 Comments

I want to find and separate words in a title that has no spaces.

Before:

ThisIsAnExampleTitleHELLO-WORLD2019T.E.S.T.(Test)"Test"'Test'[Test]

After:

This Is An Example Title HELLO-WORLD 2019 T.E.S.T. (Test) [Test] "Test" 'Test'


I'm looking for a regular expression rule that can do the following.

I thought I'd identify each word if it starts with an uppercase letter.

But also preserve all uppercase words as not to space them into A L L U P P E R C A S E.

Additional rules:

  • Space a letter if it touches a number: Hello2019World Hello 2019 World
  • Ignore spacing initials that contain periods, hyphens, or underscores T.E.S.T.
  • Ignore spacing if between brackets, parentheses, or quotes [Test] (Test) "Test" 'Test'
  • Preserve hyphens Hello-World

C#

https://rextester.com/GAZJS38767

// Title without spaces
string title = "ThisIsAnExampleTitleHELLO-WORLD2019T.E.S.T.(Test)[Test]\"Test\"'Test'";

// Detect where to space words
string[] split =  Regex.Split(title, "(?<!^)(?=(?<![.\\-'\"([{])[A-Z][\\d+]?)");

// Trim each word of extra spaces before joining
split = (from e in split
         select e.Trim()).ToArray();

// Join into new title
string newtitle = string.Join(" ", split);

// Display
Console.WriteLine(newtitle);

Regular expression

I'm having trouble with spacing before the numbers, brackets, parentheses, and quotes.

https://regex101.com/r/9IIYGX/1

(?<!^)(?=(?<![.\-'"([{])(?<![A-Z])[A-Z][\d+?]?)

(?<!^)          // Negative look behind

(?=             // Positive look ahead

(?<![.\-'"([{]) // Ignore if starts with punctuation
(?<![A-Z])      // Ignore if starts with double Uppercase letter
[A-Z]           // Space after each Uppercase letter
[\d+]?          // Space after number

)

Solution

Thanks for all your combined effort in answers. Here's a Regex example. I'm applying this to file names and have exclude special characters \/:*?"<>|.

https://rextester.com/FYEVE73725

https://regex101.com/r/xi8L4z/1

4 comments

@Mukyuu 2019-03-11 10:26:25

First few parts are similar to @revo answer: (?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}, additionally I add the following regex to space between number and letter: (?<=[a-z])(?=\d)|(?<=\d)(?=[a-z])|(?<=[A-Z])(?=\d)|(?<=\d)(?=[A-Z]) and to detect OTPIsADevice then replace with lookahead and lookbehind to find uppercase with a lowercase: (((?<!^)[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))

Note that | is or operator which allowed all the regex to be executed.

Regex: (?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=[a-z])(?=\d)|(?<=\d)(?=[a-z])|(?<=[A-Z])(?=\d)|(?<=\d)(?=[A-Z])|(((?<!^)[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))

Demo

Update

Improvised a bit:

From: (?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=[a-z])(?=\d)|(?<=\d)(?=[a-z])|(?<=[A-Z])(?=\d)|(?<=\d)(?=[A-Z])

into: (?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=\p{L})\d which do the same thing.

(((?<!^)(?<!\p{P})[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))|(?<!^)(?=[[({&])|(?<=[)\]}!&}]) improvised from OP comment which is adding exception to some punctuation: (((?<!^)(?<!['([{])[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))|(?<!^)(?=[[({&])|(?<=[)\\]}!&}])

Final regex: (?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=\p{L})\d|(((?<!^)(?<!\p{P})[A-Z](?=[a-z]))|((?<=[a-z])[A-Z]))|(?<!^)(?=[[({&])|(?<=[)\]}!&}])

Demo

@Matt McManis 2019-03-11 20:48:20

This is almost working perfect. One issue, somewhere in the last part |(((?<!^)[A-Z](?=[a-z]))|((?<=[a-z])[A-Z])) is not preserving the parentheses, brackets, and quotes. rextester.com/BTA83734

@Matt McManis 2019-03-12 01:53:56

Thanks, your regex has solved the single letter problem. I've added some extra rules at the end to handle the other issues. rextester.com/FYEVE73725

@Michał Turczyn 2019-03-11 07:29:19

Aiming for simplicity rather than huge regex, I would recommend this code with small simple patterns (comments with explanation are in code):

string str = "ThisIsAnExampleTitleHELLO-WORLD2019T.E.S.T.(Test)\"Test\"'Test'[Test]";
// insert space when there is small letter followed by upercase letter
str = Regex.Replace(str, "(?<=[a-z])(?=[A-Z])", " ");
// insert space whenever there's digit followed by a ltter
str = Regex.Replace(str, @"(?<=\d)(?=[A-Za-z])", " ");
// insert space when there's letter followed by digit
str = Regex.Replace(str, @"(?<=[A-Za-z])(?=\d)", " ");
// insert space when there's one of characters ("'[ followed by letter or digit
str = Regex.Replace(str, @"(?=[(\[""'][a-zA-Z0-9])", " ");
// insert space when what preceeds is on of characters ])"'
str = Regex.Replace(str, @"(?<=[)\]""'])", " ");

@revo 2019-03-11 07:45:10

If commenting was your main concern you could enable x-mode or use inline comments i.e. (?#insert space when there's letter followed by digit).

@Michał Turczyn 2019-03-11 07:46:51

@revo I used standard C# comments :) I think it's more readable.

@revo 2019-03-11 07:49:48

You could also write such kind of readable comments by setting standard x modifier which enables you to write multiline, indented perfect comments. It's not simple by the way. Just split. .

@revo 2019-03-11 07:06:39

You could reduce the requirements to shorten the steps of a regular expression using a different interpretation of them. For example, the first requirement would be the same as to say, preserve capital letters if they are not preceded by punctuation marks or capital letters.

The following regex works almost for all of the mentioned requirements and may be extended to include or exclude other situations:

(?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}

You have to use Replace() method and use $0 as substitution string.

See live demo here

.NET (See it in action):

string input = @"ThisIsAnExample.TitleHELLO-WORLD2019T.E.S.T.(Test)""Test""'Test'[Test]";
Regex regex = new Regex(@"(?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}", RegexOptions.Multiline);
Console.WriteLine(regex.Replace(input, @" $0"));

@Matt McManis 2019-03-11 07:10:52

This is an interesting way. Which rule can be added to fix HELLO-WORLD2019 by spacing the 2019?

@revo 2019-03-11 07:17:31

Add (?<=\p{L})\d within an alternation: (?<!^|[A-Z\p{P}])[A-Z]|(?<=\p{P})\p{P}|(?<=\p{L})\d.

@Matt McManis 2019-03-11 08:23:42

I have one other issue, single letter words like A and I won't space. ATitleExample becomes ATitle Example.

@revo 2019-03-11 08:29:29

What about something like OTPIsADevice?

@Matt McManis 2019-03-11 08:54:56

It starts to get complicated. OTPIs ADevice maybe I can run the output through a second filter. Rules: If a word starts with 2 Uppercase letters ADevice, add a space after the first letter A Device. And if an ALL UPPERCASE word ends in a lowercase letter OTPIs, add a space before the last two letters OTP Is.

@Tim Biegeleisen 2019-03-11 06:00:57

Here is a regex which seems to work well, at least for your sample input:

(?<=[a-z])(?=[A-Z])|(?<=[0-9])(?=[A-Za-z])|(?<=[A-Za-z])(?=[0-9])|(?<=\W)(?=\W)

This patten says to make a split on a boundary of one of the following conditions:

  • what precedes is a lowercase, and what precedes is an uppercase (or vice-versa)
  • what precedes is a digit and what follows is a letter (or vice-versa)
  • what precedes and what follows is a non word character (e.g. quote, parenthesis, etc.)


string title = "ThisIsAnExampleTitleHELLO-WORLD2019T.E.S.T.(Test)[Test]\"Test\"'Test'";
string[] split =  Regex.Split(title, "(?<=[a-z])(?=[A-Z])|(?<=[0-9])(?=[A-Za-z])|(?<=[A-Za-z])(?=[0-9])|(?<=\\W)(?=\\W)"); 
split = (from e in split select e.Trim()).ToArray();
string newtitle = string.Join(" ", split);

This Is An Example Title HELLO-WORLD 2019 T.E.S.T. (Test) [Test] "Test" 'Test'

Note: You might also want to add this assertion to the regex alternation:

(?<=\W)(?=\w)|(?<=\w)(?=\W)

We got away with this here, because this boundary condition never happened. But you might need it with other inputs.

@Matt McManis 2019-03-11 07:35:29

I ran into one issue, when it comes to single letter words like A and I, it will not separate because it uses the ALL UPPERCASE rule (two uppercase next to each other). ATitleExample becomes ATitle Example.

@Tim Biegeleisen 2019-03-11 07:36:27

@MattMcManis This is an edge case which will potentially break all of the answers given here. You would need to do more work to cover such cses.a

@Matt McManis 2019-03-11 07:38:29

Maybe I can run the output of this through a second regex to fix those.

Related Questions

Sponsored Content

53 Answered Questions

[SOLVED] How to replace all occurrences of a string in JavaScript

61 Answered Questions

[SOLVED] What is the difference between String and string in C#?

83 Answered Questions

[SOLVED] How do I make the first letter of a string uppercase in JavaScript?

21 Answered Questions

[SOLVED] Creating a comma separated list from IList<string> or IEnumerable<string>

  • 2009-04-28 19:15:58
  • Daniel Fortunov
  • 526855 View
  • 758 Score
  • 21 Answer
  • Tags:   c# string

28 Answered Questions

24 Answered Questions

[SOLVED] Case insensitive 'Contains(string)'

38 Answered Questions

3 Answered Questions

[SOLVED] Regex for splitting word / (slash) word

2 Answered Questions

[SOLVED] Get each item within a capturing group

  • 2017-08-08 00:24:03
  • Gabriel Rodriguez
  • 65 View
  • 0 Score
  • 2 Answer
  • Tags:   c# .net regex

1 Answered Questions

[SOLVED] Matching hyphenated word

  • 2014-05-26 04:54:46
  • JayJay
  • 114 View
  • 0 Score
  • 1 Answer
  • Tags:   c# regex replace

Sponsored Content