C# More intuitive way to split a string into tokens? - c#

I have a method which takes in a string, which contains various characters, but I'm only concerned about underscores '_' and dollar signs '$'. I want to split up the string into tokens by underscores as each piece b/w the underscores contains important information.
However, if a $ is contained in an area between underscores, then a token should be created from the last occurrence of an underscore to the end (ignoring any underscores in this last section).
Example
input: Hello_To_The$Great_World
expected tokens: Hello, To, The$Great_World
Question
I have a solution below, but I'm wondering is there a cleaner/more intuitive way of doing this than what I have below?
var aTokens = new List<string>();
var aPos = 0;
for (var aNum = 0; aNum < item.Length; aNum++)
{
if (aNum == item.Length - 1)
{
aTokens.Add(item.Substring(aPos, item.Length - aPos));
break;
}
if (item[aNum] == '$')
{
aTokens.Add(item.Substring(aPos, item.Length - aPos));
break;
}
if (item[aNum] == '_')
{
aTokens.Add(item.Substring(aPos, aNum - aPos));
aPos = aNum + 1;
}
}

You can split string by _ not having $ before them.
For that you can use the following regex:
(?<!\$.*)_
Sample code:
string input = "Hello_To_The$Great_World";
string[] output = Regex.Split(input, #"(?<!\$.*)_");
You also can do the task without regex and without loops, but with the help of 2 splits:
string input = "Hello_To_The$Great_World";
string[] temp = input.Split(new[] { '$' }, 2);
string[] output = temp[0].Split('_');
if (temp.Length > 1)
output[output.Length - 1] = output[output.Length - 1] + "$" + temp[1];

This method is not efficient or clean, but it gives you a general idea of how to do this:
Split your string into tokens
Find the index of the first string to contain $
Return a new array with the first n tokens and the final token is the remaining strings concatenated.
It's probably more useful to take advantage of IEnumerable or do things over a for loop instead of all this Array.Copy stuff... but you get the gist of it.
private string[] SomeMethod(string arg)
{
var strings = arg.Split(new[] { '_' });
var indexedValue = strings.Select((v, i) => new { Value = v, Index = i }).FirstOrDefault(x => x.Value.Contains("$"));
if (indexedValue != null)
{
var count = indexedValue.Index + 1;
string[] final = new string[count];
Array.Copy(strings, 0, final, 0, indexedValue.Index);
final[indexedValue.Index] = String.Join("_", strings, indexedValue.Index, strings.Length - indexedValue.Index);
return final;
}
return strings;
}

Here's my version (loops are so last year...)
const char dollar = '$';
const char underscore = '_';
var item = "Hello_To_The$Great_World";
var aTokens = new List<string>();
int dollarIndex = item.IndexOf(dollar);
if (dollarIndex >= 0)
{
int lastUnderscoreIndex = item.LastIndexOf(underscore, dollarIndex);
if (lastUnderscoreIndex >= 0)
{
aTokens.AddRange(item.Substring(0, lastUnderscoreIndex).Split(underscore));
aTokens.Add(item.Substring(lastUnderscoreIndex + 1));
}
else
{
aTokens.Add(item);
}
}
else
{
aTokens.AddRange(item.Split(underscore));
}
Edit:
I should have added, cleaner/more intuitive is very subjective, as you have found out by the variety of answers provided. From a maintainability point of view, it's much more important that the method you write to do the parsing is unit tested!
It's also an interesting exercise to test the performance of the various methods posted here - it quickly becomes apparent that your original version is much faster than using regular expressions! (Although in a real life situation, it's probably quite unlikely that the performance of this method will make any difference to your application!)

Related

C#: Need to split a string into a string[] and keeping the delimiter (also a string) at the beginning of the string

I think I am too dumb to solve this problem...
I have some formulas which need to be "translated" from one syntax to another.
Let's say I have a formula that goes like that (it's a simple one, others have many "Ceilings" in it):
string formulaString = "If([Param1] = 0, 1, Ceiling([Param2] / 0.55) * [Param3])";
I need to replace "Ceiling()" with "Ceiling(; 1)" (basically, insert "; 1" before the ")").
My attempt is to split the fomulaString at "Ceiling(" so I am able to iterate through the string array and insert my string at the correct index (counting every "(" and ")" to get the right index)
What I have so far:
//splits correct, but loses "CEILING("
string[] parts = formulaString.Split(new[] { "CEILING(" }, StringSplitOptions.None);
//splits almost correct, "CEILING(" is in another group
string[] parts = Regex.Split(formulaString, #"(CEILING\()");
//splits almost every letter
string[] parts = Regex.Split(formulaString, #"(?=[(CEILING\()])");
When everything is done, I concat the string so I have my complete formula again.
What do I have to set as Regex pattern to achieve this sample? (Or any other method that will help me)
part1 = "If([Param1] = 0, 1, ";
part2 = "Ceiling([Param2] / 0.55) * [Param3])";
//part3 = next "CEILING(" in a longer formula and so on...
As I mention in a comment, you almost got it: (?=Ceiling). This is incomplete for your use case unfortunately.
I need to replace "Ceiling()" with "Ceiling(; 1)" (basically, insert "; 1" before the ")").
Depending on your regex engine (for example JS) this works:
string[] parts = Regex.Split(formulaString, #"(?<=Ceiling\([^)]*(?=\)))");
string modifiedFormula = String.join("; 1", parts);
The regex
(?<=Ceiling\([^)]*(?=\)))
(?<= ) Positive lookbehind
Ceiling\( Search for literal "Ceiling("
[^)] Match any char which is not ")" ..
* .. 0 or more times
(?=\)) Positive lookahead for ")", effectively making us stop before the ")"
This regex is a zero-assertion, therefore nothing is lost and it will cut your strings before the last ")" in every "Ceiling()".
This solution would break whenever you have nested "Ceiling()". Then your only solution would be writing your own parser for the same reasons why you can't parse markup with regex.
Regex.Replace(formulaString, #"(?<=Ceiling\()(.*?)(?=\))","$1; 1");
Note: This will not work for nested "Ceilings", but it does for Ceiling(), It will also not work fir Ceiling(AnotherFunc(x)). For that you need something like:
Regex.Replace(formulaString, #"(?<=Ceiling\()((.*\((?>[^()]+|(?1))*\))*|[^\)]*)(\))","$1; 1$3");
but I could not get that to work with .NET, only in JavaScript.
This is my solution:
private string ConvertCeiling(string formula)
{
int ceilingsCount = formula.CountOccurences("Ceiling(");
int startIndex = 0;
int bracketCounter;
for (int i = 0; i < ceilingsCount; i++)
{
startIndex = formula.IndexOf("Ceiling(", startIndex);
bracketCounter = 0;
for (int j = 0; j < formula.Length; j++)
{
if (j < startIndex) continue;
var c = formula[j];
if (c == '(')
{
bracketCounter++;
}
if (c == ')')
{
bracketCounter--;
if (bracketCounter == 0)
{
// found end
formula = formula.Insert(j, "; 1");
startIndex++;
break;
}
}
}
}
return formula;
}
And CountOccurence:
public static int CountOccurences(this string value, string parameter)
{
int counter = 0;
int startIndex = 0;
int indexOfCeiling;
do
{
indexOfCeiling = value.IndexOf(parameter, startIndex);
if (indexOfCeiling < 0)
{
break;
}
else
{
startIndex = indexOfCeiling + 1;
counter++;
}
} while (true);
return counter;
}

How to stop String.Concat(); from removing whitespaces? [duplicate]

I would like to split a string with delimiters but keep the delimiters in the result.
How would I do this in C#?
If the split chars were ,, ., and ;, I'd try:
using System.Text.RegularExpressions;
...
string[] parts = Regex.Split(originalString, #"(?<=[.,;])")
(?<=PATTERN) is positive look-behind for PATTERN. It should match at any place where the preceding text fits PATTERN so there should be a match (and a split) after each occurrence of any of the characters.
If you want the delimiter to be its "own split", you can use Regex.Split e.g.:
string input = "plum-pear";
string pattern = "(-)";
string[] substrings = Regex.Split(input, pattern); // Split on hyphens
foreach (string match in substrings)
{
Console.WriteLine("'{0}'", match);
}
// The method writes the following to the console:
// 'plum'
// '-'
// 'pear'
So if you are looking for splitting a mathematical formula, you can use the following Regex
#"([*()\^\/]|(?<!E)[\+\-])"
This will ensure you can also use constants like 1E-02 and avoid having them split into 1E, - and 02
So:
Regex.Split("10E-02*x+sin(x)^2", #"([*()\^\/]|(?<!E)[\+\-])")
Yields:
10E-02
*
x
+
sin
(
x
)
^
2
Building off from BFree's answer, I had the same goal, but I wanted to split on an array of characters similar to the original Split method, and I also have multiple splits per string:
public static IEnumerable<string> SplitAndKeep(this string s, char[] delims)
{
int start = 0, index;
while ((index = s.IndexOfAny(delims, start)) != -1)
{
if(index-start > 0)
yield return s.Substring(start, index - start);
yield return s.Substring(index, 1);
start = index + 1;
}
if (start < s.Length)
{
yield return s.Substring(start);
}
}
Just in case anyone wants this answer aswell...
Instead of string[] parts = Regex.Split(originalString, #"(?<=[.,;])") you could use string[] parts = Regex.Split(originalString, #"(?=yourmatch)") where yourmatch is whatever your separator is.
Supposing the original string was
777- cat
777 - dog
777 - mouse
777 - rat
777 - wolf
Regex.Split(originalString, #"(?=777)") would return
777 - cat
777 - dog
and so on
This version does not use LINQ or Regex and so it's probably relatively efficient. I think it might be easier to use than the Regex because you don't have to worry about escaping special delimiters. It returns an IList<string> which is more efficient than always converting to an array. It's an extension method, which is convenient. You can pass in the delimiters as either an array or as multiple parameters.
/// <summary>
/// Splits the given string into a list of substrings, while outputting the splitting
/// delimiters (each in its own string) as well. It's just like String.Split() except
/// the delimiters are preserved. No empty strings are output.</summary>
/// <param name="s">String to parse. Can be null or empty.</param>
/// <param name="delimiters">The delimiting characters. Can be an empty array.</param>
/// <returns></returns>
public static IList<string> SplitAndKeepDelimiters(this string s, params char[] delimiters)
{
var parts = new List<string>();
if (!string.IsNullOrEmpty(s))
{
int iFirst = 0;
do
{
int iLast = s.IndexOfAny(delimiters, iFirst);
if (iLast >= 0)
{
if (iLast > iFirst)
parts.Add(s.Substring(iFirst, iLast - iFirst)); //part before the delimiter
parts.Add(new string(s[iLast], 1));//the delimiter
iFirst = iLast + 1;
continue;
}
//No delimiters were found, but at least one character remains. Add the rest and stop.
parts.Add(s.Substring(iFirst, s.Length - iFirst));
break;
} while (iFirst < s.Length);
}
return parts;
}
Some unit tests:
text = "[a link|http://www.google.com]";
result = text.SplitAndKeepDelimiters('[', '|', ']');
Assert.IsTrue(result.Count == 5);
Assert.AreEqual(result[0], "[");
Assert.AreEqual(result[1], "a link");
Assert.AreEqual(result[2], "|");
Assert.AreEqual(result[3], "http://www.google.com");
Assert.AreEqual(result[4], "]");
A lot of answers to this! One I knocked up to split by various strings (the original answer caters for just characters i.e. length of 1). This hasn't been fully tested.
public static IEnumerable<string> SplitAndKeep(string s, params string[] delims)
{
var rows = new List<string>() { s };
foreach (string delim in delims)//delimiter counter
{
for (int i = 0; i < rows.Count; i++)//row counter
{
int index = rows[i].IndexOf(delim);
if (index > -1
&& rows[i].Length > index + 1)
{
string leftPart = rows[i].Substring(0, index + delim.Length);
string rightPart = rows[i].Substring(index + delim.Length);
rows[i] = leftPart;
rows.Insert(i + 1, rightPart);
}
}
}
return rows;
}
This seems to work, but its not been tested much.
public static string[] SplitAndKeepSeparators(string value, char[] separators, StringSplitOptions splitOptions)
{
List<string> splitValues = new List<string>();
int itemStart = 0;
for (int pos = 0; pos < value.Length; pos++)
{
for (int sepIndex = 0; sepIndex < separators.Length; sepIndex++)
{
if (separators[sepIndex] == value[pos])
{
// add the section of string before the separator
// (unless its empty and we are discarding empty sections)
if (itemStart != pos || splitOptions == StringSplitOptions.None)
{
splitValues.Add(value.Substring(itemStart, pos - itemStart));
}
itemStart = pos + 1;
// add the separator
splitValues.Add(separators[sepIndex].ToString());
break;
}
}
}
// add anything after the final separator
// (unless its empty and we are discarding empty sections)
if (itemStart != value.Length || splitOptions == StringSplitOptions.None)
{
splitValues.Add(value.Substring(itemStart, value.Length - itemStart));
}
return splitValues.ToArray();
}
Recently I wrote an extension method do to this:
public static class StringExtensions
{
public static IEnumerable<string> SplitAndKeep(this string s, string seperator)
{
string[] obj = s.Split(new string[] { seperator }, StringSplitOptions.None);
for (int i = 0; i < obj.Length; i++)
{
string result = i == obj.Length - 1 ? obj[i] : obj[i] + seperator;
yield return result;
}
}
}
I'd say the easiest way to accomplish this (except for the argument Hans Kesting brought up) is to split the string the regular way, then iterate over the array and add the delimiter to every element but the last.
To avoid adding character to new line try this :
string[] substrings = Regex.Split(input,#"(?<=[-])");
result = originalString.Split(separator);
for(int i = 0; i < result.Length - 1; i++)
result[i] += separator;
(EDIT - this is a bad answer - I misread his question and didn't see that he was splitting by multiple characters.)
(EDIT - a correct LINQ version is awkward, since the separator shouldn't get concatenated onto the final string in the split array.)
Iterate through the string character by character (which is what regex does anyway.
When you find a splitter, then spin off a substring.
pseudo code
int hold, counter;
List<String> afterSplit;
string toSplit
for(hold = 0, counter = 0; counter < toSplit.Length; counter++)
{
if(toSplit[counter] = /*split charaters*/)
{
afterSplit.Add(toSplit.Substring(hold, counter));
hold = counter;
}
}
That's sort of C# but not really. Obviously, choose the appropriate function names.
Also, I think there might be an off-by-1 error in there.
But that will do what you're asking.
veggerby's answer modified to
have no string items in the list
have fixed string as delimiter like "ab" instead of single character
var delimiter = "ab";
var text = "ab33ab9ab"
var parts = Regex.Split(text, $#"({Regex.Escape(delimiter)})")
.Where(p => p != string.Empty)
.ToList();
// parts = "ab", "33", "ab", "9", "ab"
The Regex.Escape() is there just in case your delimiter contains characters which regex interprets as special pattern commands (like *, () and thus have to be escaped.
using System.Collections.Generic;
using System.Text.RegularExpressions;
namespace ConsoleApplication9
{
class Program
{
static void Main(string[] args)
{
string input = #"This;is:a.test";
char sep0 = ';', sep1 = ':', sep2 = '.';
string pattern = string.Format("[{0}{1}{2}]|[^{0}{1}{2}]+", sep0, sep1, sep2);
Regex regex = new Regex(pattern);
MatchCollection matches = regex.Matches(input);
List<string> parts=new List<string>();
foreach (Match match in matches)
{
parts.Add(match.ToString());
}
}
}
}
I wanted to do a multiline string like this but needed to keep the line breaks so I did this
string x =
#"line 1 {0}
line 2 {1}
";
foreach(var line in string.Format(x, "one", "two")
.Split("\n")
.Select(x => x.Contains('\r') ? x + '\n' : x)
.AsEnumerable()
) {
Console.Write(line);
}
yields
line 1 one
line 2 two
I came across same problem but with multiple delimiters. Here's my solution:
public static string[] SplitLeft(this string #this, char[] delimiters, int count)
{
var splits = new List<string>();
int next = -1;
while (splits.Count + 1 < count && (next = #this.IndexOfAny(delimiters, next + 1)) >= 0)
{
splits.Add(#this.Substring(0, next));
#this = new string(#this.Skip(next).ToArray());
}
splits.Add(#this);
return splits.ToArray();
}
Sample with separating CamelCase variable names:
var variableSplit = variableName.SplitLeft(
Enumerable.Range('A', 26).Select(i => (char)i).ToArray());
I wrote this code to split and keep delimiters:
private static string[] SplitKeepDelimiters(string toSplit, char[] delimiters, StringSplitOptions splitOptions = StringSplitOptions.None)
{
var tokens = new List<string>();
int idx = 0;
for (int i = 0; i < toSplit.Length; ++i)
{
if (delimiters.Contains(toSplit[i]))
{
tokens.Add(toSplit.Substring(idx, i - idx)); // token found
tokens.Add(toSplit[i].ToString()); // delimiter
idx = i + 1; // start idx for the next token
}
}
// last token
tokens.Add(toSplit.Substring(idx));
if (splitOptions == StringSplitOptions.RemoveEmptyEntries)
{
tokens = tokens.Where(token => token.Length > 0).ToList();
}
return tokens.ToArray();
}
Usage example:
string toSplit = "AAA,BBB,CCC;DD;,EE,";
char[] delimiters = new char[] {',', ';'};
string[] tokens = SplitKeepDelimiters(toSplit, delimiters, StringSplitOptions.RemoveEmptyEntries);
foreach (var token in tokens)
{
Console.WriteLine(token);
}

C# How to generate a new string based on multiple ranged index

Let's say I have a string like this one, left part is a word, right part is a collection of indices (single or range) used to reference furigana (phonetics) for kanjis in my word:
string myString = "子で子にならぬ時鳥,0:こ;2:こ;7-8:ほととぎす"
The pattern in detail:
word,<startIndex>(-<endIndex>):<furigana>
What would be the best way to achieve something like this (with a space in front of the kanji to mark which part is linked to the [furigana]):
子[こ]で 子[こ]にならぬ 時鳥[ほととぎす]
Edit: (thanks for your comments guys)
Here is what I wrote so far:
static void Main(string[] args)
{
string myString = "ABCDEF,1:test;3:test2";
//Split Kanjis / Indices
string[] tokens = myString.Split(',');
//Extract furigana indices
string[] indices = tokens[1].Split(';');
//Dictionnary to store furigana indices
Dictionary<string, string> furiganaIndices = new Dictionary<string, string>();
//Collect
foreach (string index in indices)
{
string[] splitIndex = index.Split(':');
furiganaIndices.Add(splitIndex[0], splitIndex[1]);
}
//Processing
string result = tokens[0] + ",";
for (int i = 0; i < tokens[0].Length; i++)
{
string currentIndex = i.ToString();
if (furiganaIndices.ContainsKey(currentIndex)) //add [furigana]
{
string currentFurigana = furiganaIndices[currentIndex].ToString();
result = result + " " + tokens[0].ElementAt(i) + string.Format("[{0}]", currentFurigana);
}
else //nothing to add
{
result = result + tokens[0].ElementAt(i);
}
}
File.AppendAllText(#"D:\test.txt", result + Environment.NewLine);
}
Result:
ABCDEF,A B[test]C D[test2]EF
I struggle to find a way to process ranged indices:
string myString = "ABCDEF,1:test;2-3:test2";
Result : ABCDEF,A B[test] CD[test2]EF
I don't have anything against manually manipulating strings per se. But given that you seem to have a regular pattern describing the inputs, it seems to me that a solution that uses regex would be more maintainable and readable. So with that in mind, here's an example program that takes that approach:
class Program
{
private const string _kinvalidFormatException = "Invalid format for edit specification";
private static readonly Regex
regex1 = new Regex(#"(?<word>[^,]+),(?<edit>(?:\d+)(?:-(?:\d+))?:(?:[^;]+);?)+", RegexOptions.Compiled),
regex2 = new Regex(#"(?<start>\d+)(?:-(?<end>\d+))?:(?<furigana>[^;]+);?", RegexOptions.Compiled);
static void Main(string[] args)
{
string myString = "子で子にならぬ時鳥,0:こ;2:こ;7-8:ほととぎす";
string result = EditString(myString);
}
private static string EditString(string myString)
{
Match editsMatch = regex1.Match(myString);
if (!editsMatch.Success)
{
throw new ArgumentException(_kinvalidFormatException);
}
int ichCur = 0;
string input = editsMatch.Groups["word"].Value;
StringBuilder text = new StringBuilder();
foreach (Capture capture in editsMatch.Groups["edit"].Captures)
{
Match oneEditMatch = regex2.Match(capture.Value);
if (!oneEditMatch.Success)
{
throw new ArgumentException(_kinvalidFormatException);
}
int start, end;
if (!int.TryParse(oneEditMatch.Groups["start"].Value, out start))
{
throw new ArgumentException(_kinvalidFormatException);
}
Group endGroup = oneEditMatch.Groups["end"];
if (endGroup.Success)
{
if (!int.TryParse(endGroup.Value, out end))
{
throw new ArgumentException(_kinvalidFormatException);
}
}
else
{
end = start;
}
text.Append(input.Substring(ichCur, start - ichCur));
if (text.Length > 0)
{
text.Append(' ');
}
ichCur = end + 1;
text.Append(input.Substring(start, ichCur - start));
text.Append(string.Format("[{0}]", oneEditMatch.Groups["furigana"]));
}
if (ichCur < input.Length)
{
text.Append(input.Substring(ichCur));
}
return text.ToString();
}
}
Notes:
This implementation assumes that the edit specifications will be listed in order and won't overlap. It makes no attempt to validate that part of the input; depending on where you are getting your input from you may want to add that. If it's valid for the specifications to be listed out of order, you can also extend the above to first store the edits in a list and sort the list by the start index before actually editing the string. (In similar fashion to the way the other proposed answer works; though, why they are using a dictionary instead of a simple list to store the individual edits, I have no idea…that seems arbitrarily complicated to me.)
I included basic input validation, throwing exceptions where failures occur in the pattern matching. A more user-friendly implementation would add more specific information to each exception, describing what part of the input actually was invalid.
The Regex class actually has a Replace() method, which allows for complete customization. The above could have been implemented that way, using Replace() and a MatchEvaluator to provide the replacement text, instead of just appending text to a StringBuilder. Which way to do it is mostly a matter of preference, though the MatchEvaluator might be preferred if you have a need for more flexible implementation options (i.e. if the exact format of the result can vary).
If you do choose to use the other proposed answer, I strongly recommend you use StringBuilder instead of simply concatenating onto the results variable. For short strings it won't matter much, but you should get into the habit of always using StringBuilder when you have a loop that is incrementally adding onto a string value, because for long string the performance implications of using concatenation can be very negative.
This should do it (and even handle ranged indices), based on the formatting of the input string you have-
using System;
using System.Collections.Generic;
public class stringParser
{
private struct IndexElements
{
public int start;
public int end;
public string value;
}
public static void Main()
{
//input string
string myString = "子で子にならぬ時鳥,0:こ;2:こ;7-8:ほととぎす";
int wordIndexSplit = myString.IndexOf(',');
string word = myString.Substring(0,wordIndexSplit);
string indices = myString.Substring(wordIndexSplit + 1);
string[] eachIndex = indices.Split(';');
Dictionary<int,IndexElements> index = new Dictionary<int,IndexElements>();
string[] elements;
IndexElements e;
int dash;
int n = 0;
int last = -1;
string results = "";
foreach (string s in eachIndex)
{
e = new IndexElements();
elements = s.Split(':');
if (elements[0].Contains("-"))
{
dash = elements[0].IndexOf('-');
e.start = int.Parse(elements[0].Substring(0,dash));
e.end = int.Parse(elements[0].Substring(dash + 1));
}
else
{
e.start = int.Parse(elements[0]);
e.end = e.start;
}
e.value = elements[1];
index.Add(n,e);
n++;
}
//this is the part that takes the "setup" from the parts above and forms the result string
//loop through each of the "indices" parsed above
for (int i = 0; i < index.Count; i++)
{
//if this is the first iteration through the loop, and the first "index" does not start
//at position 0, add the beginning characters before its start
if (last == -1 && index[i].start > 0)
{
results += word.Substring(0,index[i].start);
}
//if this is not the first iteration through the loop, and the previous iteration did
//not stop at the position directly before the start of the current iteration, add
//the intermediary chracters
else if (last != -1 && last + 1 != index[i].start)
{
results += word.Substring(last + 1,index[i].start - (last + 1));
}
//add the space before the "index" match, the actual match, and then the formatted "index"
results += " " + word.Substring(index[i].start,(index[i].end - index[i].start) + 1)
+ "[" + index[i].value + "]";
//remember the position of the ending for the next iteration
last = index[i].end;
}
//if the last "index" did not stop at the end of the input string, add the remaining characters
if (index[index.Keys.Count - 1].end + 1 < word.Length)
{
results += word.Substring(index[index.Keys.Count-1].end + 1);
}
//trimming spaces that may be left behind
results = results.Trim();
Console.WriteLine("INPUT - " + myString);
Console.WriteLine("OUTPUT - " + results);
Console.Read();
}
}
input - 子で子にならぬ時鳥,0:こ;2:こ;7-8:ほととぎす
output - 子[こ]で 子[こ]にならぬ 時鳥[ほととぎす]
Note that this should also work with characters the English alphabet if you wanted to use English instead-
input - iliketocodeverymuch,2:A;4-6:B;9-12:CDEFG
output - il i[A]k eto[B]co deve[CDEFG]rymuch

How to split a string with delimited as pipe (which is not inside double quotes

I have a string like below, which is pipe separated. it has double quotes around string (ex: "ANI").
How do I split this with pipe delimiter (which are not inside double quotes) ?
511186|"ANI"|"ABCD-102091474|E|EFG"||"2013-07-20 13:47:19.556"
And splitted values shoule be like below:
511186
"ANI"
"ABCD-102091474|E|EFG"
"2013-07-20 13:47:19.556"
Any help would be appreciated!
EDIT
The answer that I accepted, did not work for those strings which has double quotes inside. Any idea, what should be the issue ?
using System.Text.RegularExpressions;
string regexFormat = string.Format(#"(?:^|\{0})(""[^""]*""|[^\{0}]*)", '|');
string[] result = Regex.Matches("111001103|\"E\"|\"BBB\"|\"XXX\"|||10000009|153086649|\"BCTV\"|\"REV\"|||1.00000000|||||\"ABC-BT AD\"|\"\"\"ABC - BT\"\" AD\"|||\"N\"||\"N\"|||\"N\"||\"N",regexFormat)
.Cast<Match>().Select(m => m.Groups[1].Value).ToArray();
foreach(var i in result)
Console.WriteLine(i)
You can use a regular expression to match the items in the string:
string[] result = Regex.Matches(s, #"(?:^|\|)(""[^""]*""|[^|]*)")
.Cast<Match>()
.Select(m => m.Groups[1].Value)
.ToArray();
Explanation:
(?: A non-capturing group
^|\| Matches start of string or a pipe character
) End of group
( Capturing group
"[^"]*" Zero or more non-quotes surrounded by quotes
| Or
[^|]* Zero or more non-pipes
) End of group
Here is one way to do it:
public List<string> Parse(string str)
{
var parts = str.Split(new[] {"|"}, StringSplitOptions.None);
List<string> result = new List<string>();
for (int i = 0; i < parts.Length; i++)
{
string part = parts[i];
if (IsPartStart(part))
{
List<string> sub_parts = new List<string>();
do
{
sub_parts.Add(part);
i++;
part = parts[i];
} while (!IsPartEnd(part));
sub_parts.Add(part);
part = string.Join("|", sub_parts);
}
result.Add(part);
}
return result;
}
private bool IsPartStart(string part)
{
return (part.StartsWith("\"") && !part.EndsWith("\"")) ;
}
private bool IsPartEnd(string part)
{
return (!part.StartsWith("\"") && part.EndsWith("\""));
}
This works by splitting everything, and it then joins some of the parts that needs joining by searching for parts that starts with " and corresponding parts that ends with ".
string.Split("|", inputString);
...will give you the individual parts, but will fail if any of the parts have a pipe separator in them.
If it's a CSV file, following all the usual CSV rules about character-escaping, etc. (but using a pipe symbol instead of comma), then you should look at using CsvHelper, a NuGet package designed for reading and writing CSV files. It does all the hard work, and deals with all the corner cases that you'd otherwise have to do yourself.
Here's how I'd do it. It's fairly simple and I think you'll find it's very fast as well. I haven't run any tests, but I'm pretty confident that it's faster than regular expressions.
IEnumerable<string> Parse(string s)
{
int pos = 0;
while (pos < s.Length)
{
char endChar = '|';
// Test for quoted value
if (s[pos] == '"')
{
pos++;
endChar = '"';
}
// Extract this value
int newPos = s.IndexOf(endChar, pos);
if (newPos < 0)
newPos = s.Length;
yield return s.Substring(pos, newPos - pos);
// Move to start of next value
pos = newPos + 1;
if (pos < s.Length && s[pos] == '|')
pos++;
}
}

Add spaces before Capital Letters

Given the string "ThisStringHasNoSpacesButItDoesHaveCapitals" what is the best way to add spaces before the capital letters. So the end string would be "This String Has No Spaces But It Does Have Capitals"
Here is my attempt with a RegEx
System.Text.RegularExpressions.Regex.Replace(value, "[A-Z]", " $0")
The regexes will work fine (I even voted up Martin Browns answer), but they are expensive (and personally I find any pattern longer than a couple of characters prohibitively obtuse)
This function
string AddSpacesToSentence(string text, bool preserveAcronyms)
{
if (string.IsNullOrWhiteSpace(text))
return string.Empty;
StringBuilder newText = new StringBuilder(text.Length * 2);
newText.Append(text[0]);
for (int i = 1; i < text.Length; i++)
{
if (char.IsUpper(text[i]))
if ((text[i - 1] != ' ' && !char.IsUpper(text[i - 1])) ||
(preserveAcronyms && char.IsUpper(text[i - 1]) &&
i < text.Length - 1 && !char.IsUpper(text[i + 1])))
newText.Append(' ');
newText.Append(text[i]);
}
return newText.ToString();
}
Will do it 100,000 times in 2,968,750 ticks, the regex will take 25,000,000 ticks (and thats with the regex compiled).
It's better, for a given value of better (i.e. faster) however it's more code to maintain. "Better" is often compromise of competing requirements.
Update
It's a good long while since I looked at this, and I just realised the timings haven't been updated since the code changed (it only changed a little).
On a string with 'Abbbbbbbbb' repeated 100 times (i.e. 1,000 bytes), a run of 100,000 conversions takes the hand coded function 4,517,177 ticks, and the Regex below takes 59,435,719 making the Hand coded function run in 7.6% of the time it takes the Regex.
Update 2
Will it take Acronyms into account? It will now!
The logic of the if statment is fairly obscure, as you can see expanding it to this ...
if (char.IsUpper(text[i]))
if (char.IsUpper(text[i - 1]))
if (preserveAcronyms && i < text.Length - 1 && !char.IsUpper(text[i + 1]))
newText.Append(' ');
else ;
else if (text[i - 1] != ' ')
newText.Append(' ');
... doesn't help at all!
Here's the original simple method that doesn't worry about Acronyms
string AddSpacesToSentence(string text)
{
if (string.IsNullOrWhiteSpace(text))
return "";
StringBuilder newText = new StringBuilder(text.Length * 2);
newText.Append(text[0]);
for (int i = 1; i < text.Length; i++)
{
if (char.IsUpper(text[i]) && text[i - 1] != ' ')
newText.Append(' ');
newText.Append(text[i]);
}
return newText.ToString();
}
Your solution has an issue in that it puts a space before the first letter T so you get
" This String..." instead of "This String..."
To get around this look for the lower case letter preceding it as well and then insert the space in the middle:
newValue = Regex.Replace(value, "([a-z])([A-Z])", "$1 $2");
Edit 1:
If you use #"(\p{Ll})(\p{Lu})" it will pick up accented characters as well.
Edit 2:
If your strings can contain acronyms you may want to use this:
newValue = Regex.Replace(value, #"((?<=\p{Ll})\p{Lu})|((?!\A)\p{Lu}(?>\p{Ll}))", " $0");
So "DriveIsSCSICompatible" becomes "Drive Is SCSI Compatible"
Didn't test performance, but here in one line with linq:
var val = "ThisIsAStringToTest";
val = string.Concat(val.Select(x => Char.IsUpper(x) ? " " + x : x.ToString())).TrimStart(' ');
I know this is an old one, but this is an extension I use when I need to do this:
public static class Extensions
{
public static string ToSentence( this string Input )
{
return new string(Input.SelectMany((c, i) => i > 0 && char.IsUpper(c) ? new[] { ' ', c } : new[] { c }).ToArray());
}
}
This will allow you to use MyCasedString.ToSentence()
I set out to make a simple extension method based on Binary Worrier's code which will handle acronyms properly, and is repeatable (won't mangle already spaced words). Here is my result.
public static string UnPascalCase(this string text)
{
if (string.IsNullOrWhiteSpace(text))
return "";
var newText = new StringBuilder(text.Length * 2);
newText.Append(text[0]);
for (int i = 1; i < text.Length; i++)
{
var currentUpper = char.IsUpper(text[i]);
var prevUpper = char.IsUpper(text[i - 1]);
var nextUpper = (text.Length > i + 1) ? char.IsUpper(text[i + 1]) || char.IsWhiteSpace(text[i + 1]): prevUpper;
var spaceExists = char.IsWhiteSpace(text[i - 1]);
if (currentUpper && !spaceExists && (!nextUpper || !prevUpper))
newText.Append(' ');
newText.Append(text[i]);
}
return newText.ToString();
}
Here are the unit test cases this function passes. I added most of tchrist's suggested cases to this list. The three of those it doesn't pass (two are just Roman numerals) are commented out:
Assert.AreEqual("For You And I", "ForYouAndI".UnPascalCase());
Assert.AreEqual("For You And The FBI", "ForYouAndTheFBI".UnPascalCase());
Assert.AreEqual("A Man A Plan A Canal Panama", "AManAPlanACanalPanama".UnPascalCase());
Assert.AreEqual("DNS Server", "DNSServer".UnPascalCase());
Assert.AreEqual("For You And I", "For You And I".UnPascalCase());
Assert.AreEqual("Mount Mᶜ Kinley National Park", "MountMᶜKinleyNationalPark".UnPascalCase());
Assert.AreEqual("El Álamo Tejano", "ElÁlamoTejano".UnPascalCase());
Assert.AreEqual("The Ævar Arnfjörð Bjarmason", "TheÆvarArnfjörðBjarmason".UnPascalCase());
Assert.AreEqual("Il Caffè Macchiato", "IlCaffèMacchiato".UnPascalCase());
//Assert.AreEqual("Mister Dženan Ljubović", "MisterDženanLjubović".UnPascalCase());
//Assert.AreEqual("Ole King Henry Ⅷ", "OleKingHenryⅧ".UnPascalCase());
//Assert.AreEqual("Carlos Ⅴº El Emperador", "CarlosⅤºElEmperador".UnPascalCase());
Assert.AreEqual("For You And The FBI", "For You And The FBI".UnPascalCase());
Assert.AreEqual("A Man A Plan A Canal Panama", "A Man A Plan A Canal Panama".UnPascalCase());
Assert.AreEqual("DNS Server", "DNS Server".UnPascalCase());
Assert.AreEqual("Mount Mᶜ Kinley National Park", "Mount Mᶜ Kinley National Park".UnPascalCase());
Welcome to Unicode
All these solutions are essentially wrong for modern text. You need to use something that understands case. Since Bob asked for other languages, I'll give a couple for Perl.
I provide four solutions, ranging from worst to best. Only the best one is always right. The others have problems. Here is a test run to show you what works and what doesn’t, and where. I’ve used underscores so that you can see where the spaces have been put, and I’ve marked as wrong anything that is, well, wrong.
Testing TheLoneRanger
Worst: The_Lone_Ranger
Ok: The_Lone_Ranger
Better: The_Lone_Ranger
Best: The_Lone_Ranger
Testing MountMᶜKinleyNationalPark
[WRONG] Worst: Mount_MᶜKinley_National_Park
[WRONG] Ok: Mount_MᶜKinley_National_Park
[WRONG] Better: Mount_MᶜKinley_National_Park
Best: Mount_Mᶜ_Kinley_National_Park
Testing ElÁlamoTejano
[WRONG] Worst: ElÁlamo_Tejano
Ok: El_Álamo_Tejano
Better: El_Álamo_Tejano
Best: El_Álamo_Tejano
Testing TheÆvarArnfjörðBjarmason
[WRONG] Worst: TheÆvar_ArnfjörðBjarmason
Ok: The_Ævar_Arnfjörð_Bjarmason
Better: The_Ævar_Arnfjörð_Bjarmason
Best: The_Ævar_Arnfjörð_Bjarmason
Testing IlCaffèMacchiato
[WRONG] Worst: Il_CaffèMacchiato
Ok: Il_Caffè_Macchiato
Better: Il_Caffè_Macchiato
Best: Il_Caffè_Macchiato
Testing MisterDženanLjubović
[WRONG] Worst: MisterDženanLjubović
[WRONG] Ok: MisterDženanLjubović
Better: Mister_Dženan_Ljubović
Best: Mister_Dženan_Ljubović
Testing OleKingHenryⅧ
[WRONG] Worst: Ole_King_HenryⅧ
[WRONG] Ok: Ole_King_HenryⅧ
[WRONG] Better: Ole_King_HenryⅧ
Best: Ole_King_Henry_Ⅷ
Testing CarlosⅤºElEmperador
[WRONG] Worst: CarlosⅤºEl_Emperador
[WRONG] Ok: CarlosⅤº_El_Emperador
[WRONG] Better: CarlosⅤº_El_Emperador
Best: Carlos_Ⅴº_El_Emperador
BTW, almost everyone here has selected the first way, the one marked "Worst". A few have selected the second way, marked "OK". But no one else before me has shown you how to do either the "Better" or the "Best" approach.
Here is the test program with its four methods:
#!/usr/bin/env perl
use utf8;
use strict;
use warnings;
# First I'll prove these are fine variable names:
my (
$TheLoneRanger ,
$MountMᶜKinleyNationalPark ,
$ElÁlamoTejano ,
$TheÆvarArnfjörðBjarmason ,
$IlCaffèMacchiato ,
$MisterDženanLjubović ,
$OleKingHenryⅧ ,
$CarlosⅤºElEmperador ,
);
# Now I'll load up some string with those values in them:
my #strings = qw{
TheLoneRanger
MountMᶜKinleyNationalPark
ElÁlamoTejano
TheÆvarArnfjörðBjarmason
IlCaffèMacchiato
MisterDženanLjubović
OleKingHenryⅧ
CarlosⅤºElEmperador
};
my($new, $best, $ok);
my $mask = " %10s %-8s %s\n";
for my $old (#strings) {
print "Testing $old\n";
($best = $old) =~ s/(?<=\p{Lowercase})(?=[\p{Uppercase}\p{Lt}])/_/g;
($new = $old) =~ s/(?<=[a-z])(?=[A-Z])/_/g;
$ok = ($new ne $best) && "[WRONG]";
printf $mask, $ok, "Worst:", $new;
($new = $old) =~ s/(?<=\p{Ll})(?=\p{Lu})/_/g;
$ok = ($new ne $best) && "[WRONG]";
printf $mask, $ok, "Ok:", $new;
($new = $old) =~ s/(?<=\p{Ll})(?=[\p{Lu}\p{Lt}])/_/g;
$ok = ($new ne $best) && "[WRONG]";
printf $mask, $ok, "Better:", $new;
($new = $old) =~ s/(?<=\p{Lowercase})(?=[\p{Uppercase}\p{Lt}])/_/g;
$ok = ($new ne $best) && "[WRONG]";
printf $mask, $ok, "Best:", $new;
}
When you can score the same as the "Best" on this dataset, you’ll know you’ve done it correctly. Until then, you haven’t. No one else here has done better than "Ok", and most have done it "Worst". I look forward to seeing someone post the correct ℂ♯ code.
I notice that StackOverflow’s highlighting code is miserably stoopid again. They’re making all the same old lame as (most but not all) of the rest of the poor approaches mentioned here have made. Isn’t it long past time to put ASCII to rest? It doens’t make sense anymore, and pretending it’s all you have is simply wrong. It makes for bad code.
This Regex places a space character in front of every capital letter:
using System.Text.RegularExpressions;
const string myStringWithoutSpaces = "ThisIsAStringWithoutSpaces";
var myStringWithSpaces = Regex.Replace(myStringWithoutSpaces, "([A-Z])([a-z]*)", " $1$2");
Mind the space in front if "$1$2", this is what will get it done.
This is the outcome:
"This Is A String Without Spaces"
Binary Worrier, I have used your suggested code, and it is rather good, I have just one minor addition to it:
public static string AddSpacesToSentence(string text)
{
if (string.IsNullOrEmpty(text))
return "";
StringBuilder newText = new StringBuilder(text.Length * 2);
newText.Append(text[0]);
for (int i = 1; i < result.Length; i++)
{
if (char.IsUpper(result[i]) && !char.IsUpper(result[i - 1]))
{
newText.Append(' ');
}
else if (i < result.Length)
{
if (char.IsUpper(result[i]) && !char.IsUpper(result[i + 1]))
newText.Append(' ');
}
newText.Append(result[i]);
}
return newText.ToString();
}
I have added a condition !char.IsUpper(text[i - 1]). This fixed a bug that would cause something like 'AverageNOX' to be turned into 'Average N O X', which is obviously wrong, as it should read 'Average NOX'.
Sadly this still has the bug that if you have the text 'FromAStart', you would get 'From AStart' out.
Any thoughts on fixing this?
Inspired from #MartinBrown,
Two Lines of Simple Regex, which will resolve your name, including Acyronyms anywhere in the string.
public string ResolveName(string name)
{
var tmpDisplay = Regex.Replace(name, "([^A-Z ])([A-Z])", "$1 $2");
return Regex.Replace(tmpDisplay, "([A-Z]+)([A-Z][^A-Z$])", "$1 $2").Trim();
}
Here's mine:
private string SplitCamelCase(string s)
{
Regex upperCaseRegex = new Regex(#"[A-Z]{1}[a-z]*");
MatchCollection matches = upperCaseRegex.Matches(s);
List<string> words = new List<string>();
foreach (Match match in matches)
{
words.Add(match.Value);
}
return String.Join(" ", words.ToArray());
}
Make sure you aren't putting spaces at the beginning of the string, but you are putting them between consecutive capitals. Some of the answers here don't address one or both of those points. There are other ways than regex, but if you prefer to use that, try this:
Regex.Replace(value, #"\B[A-Z]", " $0")
The \B is a negated \b, so it represents a non-word-boundary. It means the pattern matches "Y" in XYzabc, but not in Yzabc or X Yzabc. As a little bonus, you can use this on a string with spaces in it and it won't double them.
What you have works perfectly. Just remember to reassign value to the return value of this function.
value = System.Text.RegularExpressions.Regex.Replace(value, "[A-Z]", " $0");
Here is how you could do it in SQL
create FUNCTION dbo.PascalCaseWithSpace(#pInput AS VARCHAR(MAX)) RETURNS VARCHAR(MAX)
BEGIN
declare #output varchar(8000)
set #output = ''
Declare #vInputLength INT
Declare #vIndex INT
Declare #vCount INT
Declare #PrevLetter varchar(50)
SET #PrevLetter = ''
SET #vCount = 0
SET #vIndex = 1
SET #vInputLength = LEN(#pInput)
WHILE #vIndex <= #vInputLength
BEGIN
IF ASCII(SUBSTRING(#pInput, #vIndex, 1)) = ASCII(Upper(SUBSTRING(#pInput, #vIndex, 1)))
begin
if(#PrevLetter != '' and ASCII(#PrevLetter) = ASCII(Lower(#PrevLetter)))
SET #output = #output + ' ' + SUBSTRING(#pInput, #vIndex, 1)
else
SET #output = #output + SUBSTRING(#pInput, #vIndex, 1)
end
else
begin
SET #output = #output + SUBSTRING(#pInput, #vIndex, 1)
end
set #PrevLetter = SUBSTRING(#pInput, #vIndex, 1)
SET #vIndex = #vIndex + 1
END
return #output
END
replaceAll("(?<=[^^\\p{Uppercase}])(?=[\\p{Uppercase}])"," ");
static string AddSpacesToColumnName(string columnCaption)
{
if (string.IsNullOrWhiteSpace(columnCaption))
return "";
StringBuilder newCaption = new StringBuilder(columnCaption.Length * 2);
newCaption.Append(columnCaption[0]);
int pos = 1;
for (pos = 1; pos < columnCaption.Length-1; pos++)
{
if (char.IsUpper(columnCaption[pos]) && !(char.IsUpper(columnCaption[pos - 1]) && char.IsUpper(columnCaption[pos + 1])))
newCaption.Append(' ');
newCaption.Append(columnCaption[pos]);
}
newCaption.Append(columnCaption[pos]);
return newCaption.ToString();
}
In Ruby, via Regexp:
"FooBarBaz".gsub(/(?!^)(?=[A-Z])/, ' ') # => "Foo Bar Baz"
I took Kevin Strikers excellent solution and converted to VB. Since i'm locked into .NET 3.5, i also had to write IsNullOrWhiteSpace. This passes all of his tests.
<Extension()>
Public Function IsNullOrWhiteSpace(value As String) As Boolean
If value Is Nothing Then
Return True
End If
For i As Integer = 0 To value.Length - 1
If Not Char.IsWhiteSpace(value(i)) Then
Return False
End If
Next
Return True
End Function
<Extension()>
Public Function UnPascalCase(text As String) As String
If text.IsNullOrWhiteSpace Then
Return String.Empty
End If
Dim newText = New StringBuilder()
newText.Append(text(0))
For i As Integer = 1 To text.Length - 1
Dim currentUpper = Char.IsUpper(text(i))
Dim prevUpper = Char.IsUpper(text(i - 1))
Dim nextUpper = If(text.Length > i + 1, Char.IsUpper(text(i + 1)) Or Char.IsWhiteSpace(text(i + 1)), prevUpper)
Dim spaceExists = Char.IsWhiteSpace(text(i - 1))
If (currentUpper And Not spaceExists And (Not nextUpper Or Not prevUpper)) Then
newText.Append(" ")
End If
newText.Append(text(i))
Next
Return newText.ToString()
End Function
The question is a bit old but nowadays there is a nice library on Nuget that does exactly this as well as many other conversions to human readable text.
Check out Humanizer on GitHub or Nuget.
Example
"PascalCaseInputStringIsTurnedIntoSentence".Humanize() => "Pascal case input string is turned into sentence"
"Underscored_input_string_is_turned_into_sentence".Humanize() => "Underscored input string is turned into sentence"
"Underscored_input_String_is_turned_INTO_sentence".Humanize() => "Underscored input String is turned INTO sentence"
// acronyms are left intact
"HTML".Humanize() => "HTML"
Seems like a good opportunity for Aggregate. This is designed to be readable, not necessarily especially fast.
someString
.Aggregate(
new StringBuilder(),
(str, ch) => {
if (char.IsUpper(ch) && str.Length > 0)
str.Append(" ");
str.Append(ch);
return str;
}
).ToString();
Found a lot of these answers to be rather obtuse but I haven't fully tested my solution, but it works for what I need, should handle acronyms, and is much more compact/readable than the others IMO:
private string CamelCaseToSpaces(string s)
{
if (string.IsNullOrEmpty(s)) return string.Empty;
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < s.Length; i++)
{
stringBuilder.Append(s[i]);
int nextChar = i + 1;
if (nextChar < s.Length && char.IsUpper(s[nextChar]) && !char.IsUpper(s[i]))
{
stringBuilder.Append(" ");
}
}
return stringBuilder.ToString();
}
In addition to Martin Brown's Answer, I had an issue with numbers as well. For Example: "Location2", or "Jan22" should be "Location 2", and "Jan 22" respectively.
Here is my Regular Expression for doing that, using Martin Brown's answer:
"((?<=\p{Ll})\p{Lu})|((?!\A)\p{Lu}(?>\p{Ll}))|((?<=[\p{Ll}\p{Lu}])\p{Nd})|((?<=\p{Nd})\p{Lu})"
Here are a couple great sites for figuring out what each part means as well:
Java Based Regular Expression Analyzer (but works for most .net regex's)
Action Script Based Analyzer
The above regex won't work on the action script site unless you replace all of the \p{Ll} with [a-z], the \p{Lu} with [A-Z], and \p{Nd} with [0-9].
Here's my solution, based on Binary Worriers suggestion and building in Richard Priddys' comments, but also taking into account that white space may exist in the provided string, so it won't add white space next to existing white space.
public string AddSpacesBeforeUpperCase(string nonSpacedString)
{
if (string.IsNullOrEmpty(nonSpacedString))
return string.Empty;
StringBuilder newText = new StringBuilder(nonSpacedString.Length * 2);
newText.Append(nonSpacedString[0]);
for (int i = 1; i < nonSpacedString.Length; i++)
{
char currentChar = nonSpacedString[i];
// If it is whitespace, we do not need to add another next to it
if(char.IsWhiteSpace(currentChar))
{
continue;
}
char previousChar = nonSpacedString[i - 1];
char nextChar = i < nonSpacedString.Length - 1 ? nonSpacedString[i + 1] : nonSpacedString[i];
if (char.IsUpper(currentChar) && !char.IsWhiteSpace(nextChar)
&& !(char.IsUpper(previousChar) && char.IsUpper(nextChar)))
{
newText.Append(' ');
}
else if (i < nonSpacedString.Length)
{
if (char.IsUpper(currentChar) && !char.IsWhiteSpace(nextChar) && !char.IsUpper(nextChar))
{
newText.Append(' ');
}
}
newText.Append(currentChar);
}
return newText.ToString();
}
For anyone who is looking for a C++ function answering this same question, you can use the following. This is modeled after the answer given by #Binary Worrier. This method just preserves Acronyms automatically.
using namespace std;
void AddSpacesToSentence(string& testString)
stringstream ss;
ss << testString.at(0);
for (auto it = testString.begin() + 1; it != testString.end(); ++it )
{
int index = it - testString.begin();
char c = (*it);
if (isupper(c))
{
char prev = testString.at(index - 1);
if (isupper(prev))
{
if (index < testString.length() - 1)
{
char next = testString.at(index + 1);
if (!isupper(next) && next != ' ')
{
ss << ' ';
}
}
}
else if (islower(prev))
{
ss << ' ';
}
}
ss << c;
}
cout << ss.str() << endl;
The tests strings I used for this function, and the results are:
"helloWorld" -> "hello World"
"HelloWorld" -> "Hello World"
"HelloABCWorld" -> "Hello ABC World"
"HelloWorldABC" -> "Hello World ABC"
"ABCHelloWorld" -> "ABC Hello World"
"ABC HELLO WORLD" -> "ABC HELLO WORLD"
"ABCHELLOWORLD" -> "ABCHELLOWORLD"
"A" -> "A"
A C# solution for an input string that consists only of ASCII characters. The regex incorporates negative lookbehind to ignore a capital (upper case) letter that appears at the beginning of the string. Uses Regex.Replace() to return the desired string.
Also see regex101.com demo.
using System;
using System.Text.RegularExpressions;
public class RegexExample
{
public static void Main()
{
var text = "ThisStringHasNoSpacesButItDoesHaveCapitals";
// Use negative lookbehind to match all capital letters
// that do not appear at the beginning of the string.
var pattern = "(?<!^)([A-Z])";
var rgx = new Regex(pattern);
var result = rgx.Replace(text, " $1");
Console.WriteLine("Input: [{0}]\nOutput: [{1}]", text, result);
}
}
Expected Output:
Input: [ThisStringHasNoSpacesButItDoesHaveCapitals]
Output: [This String Has No Spaces But It Does Have Capitals]
Update: Here's a variation that will also handle acronyms (sequences of upper-case letters).
Also see regex101.com demo and ideone.com demo.
using System;
using System.Text.RegularExpressions;
public class RegexExample
{
public static void Main()
{
var text = "ThisStringHasNoSpacesASCIIButItDoesHaveCapitalsLINQ";
// Use positive lookbehind to locate all upper-case letters
// that are preceded by a lower-case letter.
var patternPart1 = "(?<=[a-z])([A-Z])";
// Used positive lookbehind and lookahead to locate all
// upper-case letters that are preceded by an upper-case
// letter and followed by a lower-case letter.
var patternPart2 = "(?<=[A-Z])([A-Z])(?=[a-z])";
var pattern = patternPart1 + "|" + patternPart2;
var rgx = new Regex(pattern);
var result = rgx.Replace(text, " $1$2");
Console.WriteLine("Input: [{0}]\nOutput: [{1}]", text, result);
}
}
Expected Output:
Input: [ThisStringHasNoSpacesASCIIButItDoesHaveCapitalsLINQ]
Output: [This String Has No Spaces ASCII But It Does Have Capitals LINQ]
Here is a more thorough solution that doesn't put spaces in front of words:
Note: I have used multiple Regexs (not concise but it will also handle acronyms and single letter words)
Dim s As String = "ThisStringHasNoSpacesButItDoesHaveCapitals"
s = System.Text.RegularExpressions.Regex.Replace(s, "([a-z])([A-Z](?=[A-Z])[a-z]*)", "$1 $2")
s = System.Text.RegularExpressions.Regex.Replace(s, "([A-Z])([A-Z][a-z])", "$1 $2")
s = System.Text.RegularExpressions.Regex.Replace(s, "([a-z])([A-Z][a-z])", "$1 $2")
s = System.Text.RegularExpressions.Regex.Replace(s, "([a-z])([A-Z][a-z])", "$1 $2") // repeat a second time
In:
"ThisStringHasNoSpacesButItDoesHaveCapitals"
"IAmNotAGoat"
"LOLThatsHilarious!"
"ThisIsASMSMessage"
Out:
"This String Has No Spaces But It Does Have Capitals"
"I Am Not A Goat"
"LOL Thats Hilarious!"
"This Is ASMS Message" // (Difficult to handle single letter words when they are next to acronyms.)
All the previous responses looked too over complicated.
I had string that had a mixture of capitals and _ so used, string.Replace() to make the _, " " and used the following to add a space at the capital letters.
for (int i = 0; i < result.Length; i++)
{
if (char.IsUpper(result[i]))
{
counter++;
if (i > 1) //stops from adding a space at if string starts with Capital
{
result = result.Insert(i, " ");
i++; //Required** otherwise stuck in infinite
//add space loop over a single capital letter.
}
}
}
Inspired by Binary Worrier answer I took a swing at this.
Here's the result:
/// <summary>
/// String Extension Method
/// Adds white space to strings based on Upper Case Letters
/// </summary>
/// <example>
/// strIn => "HateJPMorgan"
/// preserveAcronyms false => "Hate JP Morgan"
/// preserveAcronyms true => "Hate JPMorgan"
/// </example>
/// <param name="strIn">to evaluate</param>
/// <param name="preserveAcronyms" >determines saving acronyms (Optional => false) </param>
public static string AddSpaces(this string strIn, bool preserveAcronyms = false)
{
if (string.IsNullOrWhiteSpace(strIn))
return String.Empty;
var stringBuilder = new StringBuilder(strIn.Length * 2)
.Append(strIn[0]);
int i;
for (i = 1; i < strIn.Length - 1; i++)
{
var c = strIn[i];
if (Char.IsUpper(c) && (Char.IsLower(strIn[i - 1]) || (preserveAcronyms && Char.IsLower(strIn[i + 1]))))
stringBuilder.Append(' ');
stringBuilder.Append(c);
}
return stringBuilder.Append(strIn[i]).ToString();
}
Did test using stopwatch running 10000000 iterations and various string lengths and combinations.
On average 50% (maybe a bit more) faster than Binary Worrier answer.
private string GetProperName(string Header)
{
if (Header.ToCharArray().Where(c => Char.IsUpper(c)).Count() == 1)
{
return Header;
}
else
{
string ReturnHeader = Header[0].ToString();
for(int i=1; i<Header.Length;i++)
{
if (char.IsLower(Header[i-1]) && char.IsUpper(Header[i]))
{
ReturnHeader += " " + Header[i].ToString();
}
else
{
ReturnHeader += Header[i].ToString();
}
}
return ReturnHeader;
}
return Header;
}
This one includes acronyms and acronym plurals and is a bit faster than the accepted answer:
public string Sentencify(string value)
{
if (string.IsNullOrWhiteSpace(value))
return string.Empty;
string final = string.Empty;
for (int i = 0; i < value.Length; i++)
{
if (i != 0 && Char.IsUpper(value[i]))
{
if (!Char.IsUpper(value[i - 1]))
final += " ";
else if (i < (value.Length - 1))
{
if (!Char.IsUpper(value[i + 1]) && !((value.Length >= i && value[i + 1] == 's') ||
(value.Length >= i + 1 && value[i + 1] == 'e' && value[i + 2] == 's')))
final += " ";
}
}
final += value[i];
}
return final;
}
Passes these tests:
string test1 = "RegularOTs";
string test2 = "ThisStringHasNoSpacesASCIIButItDoesHaveCapitalsLINQ";
string test3 = "ThisStringHasNoSpacesButItDoesHaveCapitals";
An implementation with fold, also known as Aggregate:
public static string SpaceCapitals(this string arg) =>
new string(arg.Aggregate(new List<Char>(),
(accum, x) =>
{
if (Char.IsUpper(x) &&
accum.Any() &&
// prevent double spacing
accum.Last() != ' ' &&
// prevent spacing acronyms (ASCII, SCSI)
!Char.IsUpper(accum.Last()))
{
accum.Add(' ');
}
accum.Add(x);
return accum;
}).ToArray());
In addition to the request, this implementation correctly saves leading, inner, trailing spaces and acronyms, for example,
" SpacedWord " => " Spaced Word ",
"Inner Space" => "Inner Space",
"SomeACRONYM" => "Some ACRONYM".

Categories