toTitleCase to ignore ordinals in C# - c#

I'm trying to figure out a way to use toTitleCase to ignore ordinals. It works as I want it to for all string except for ordinals (e.g. 1st, 2nd, 3rd becomes 1St, 2Nd, 3Rd).
Any help would be appreciated. A regular expression may be the way to handle this, I'm just not sure how such a regex would be constructed.
Update: Here is the solution I used (Using John's answer I wrote below extension method):
public static string ToTitleCaseIgnoreOrdinals(this string text)
{
string input = System.Globalization.CultureInfo.CurrentCulture.TextInfo.ToTitleCase(text);
string result = System.Text.RegularExpressions.Regex.Replace(input, "([0-9]st)|([0-9]th)|([0-9]rd)|([0-9]nd)", new System.Text.RegularExpressions.MatchEvaluator((m) => m.Captures[0].Value.ToLower()), System.Text.RegularExpressions.RegexOptions.IgnoreCase);
return result;
}

string input = System.Globalization.CultureInfo.CurrentCulture.TextInfo.ToTitleCase("hello there, this is the 1st");
string result = System.Text.RegularExpressions.Regex.Replace(input, "([0-9]st)|([0-9]th)|([0-9]rd)|([0-9]nd)", new System.Text.RegularExpressions.MatchEvaluator((m) =>
{
return m.Captures[0].Value.ToLower();
}), System.Text.RegularExpressions.RegexOptions.IgnoreCase);

You can use regular expressions to check if the string starts with a digit before you convert to Title Case, like this:
if (!Regex.IsMatch(text, #"^\d+"))
{
CultureInfo.CurrentCulture.TextInfo.toTitleCase(text);
}
Edit: forgot to reverse the conditional... changed so it will apply toTitleCase if it DOESN'T match.
2nd edit: added loop to check all words in a sentence:
string text = "150 east 40th street";
string[] array = text.Split(' ');
for (int i = 0; i < array.Length; i++)
{
if (!Regex.IsMatch(array[i], #"^\d+"))
{
array[i] = CultureInfo.CurrentCulture.TextInfo.ToTitleCase(array[i]);
}
}
string newText = string.Join(" ",array);

I would split the text up and iterate through the resulting array, skipping things that don't start with a letter.
using System.Globalization;
TextInfo textInfo = new CultureInfo("en-US", false).TextInfo;
string[] text = myString.Split();
for(int i = 0; i < text.Length; i++)
{ //Check for zero-length strings, because these will throw an
//index out of range exception in Char.IsLetter
if (text[i].Length > 0 && Char.IsLetter(text[i][0]))
{
text[i] = textInfo.ToTitleCase(text[i]);
}
}

You could simply use String.Replace (or StringBuilder.Replace):
string[] ordinals = { "1St", "2Nd", "3Rd" }; // add all others
string text = "This is just sample text which contains some ordinals, the 1st, the 2nd and the third.";
var sb = new StringBuilder(CultureInfo.InvariantCulture.TextInfo.ToTitleCase(text));
foreach (string ordinal in ordinals)
sb.Replace(ordinal, ordinal.ToLowerInvariant());
text = sb.ToString();
this is not elegant at all. It requires you to maintain an infinite
list of ordinals on the first line. I'm assuming that's why someone
downvoted you.
It's not elegant but it works better than other simple approaches like the regex. You want to title-case words in longer text. But only words which are not ordinal-numbers. An ordinal number is f.e. 1st, 2nd or 3rd and 31st but not 31th. So the simple regex sollutions will fail fast. You also want to title-case words like 10m to 10M (where M could be the abbreviation for million).
So i don't understand why it's so bad to maintain a list of ordinal numbers.
You could even generate them automatically with an upper-limit, for example:
public static IEnumerable<string> GetTitleCaseOrdinalNumbers()
{
for (int num = 1; num <= int.MaxValue; num++)
{
switch (num % 100)
{
case 11:
case 12:
case 13:
yield return num + "Th";
break;
}
switch (num % 10)
{
case 1:
yield return num + "St"; break;
case 2:
yield return num + "Nd"; break;
case 3:
yield return num + "Rd"; break;
default:
yield return num + "Th"; break;
}
}
}
So if you want to check for the first 1000 ordinal numbers:
foreach (string ordinal in GetTitleCaseOrdinalNumbers().Take(1000))
sb.Replace(ordinal, ordinal.ToLowerInvariant());
Update
For what it's worth, here is my try to provide an efficient way that really checks words (and not only substrings) and skips ToTitleCase on words which really represent ordinal numbers(so not 31th but 31st for example). It also takes care of separator chars that are not white-spaces (like dots or commas):
private static readonly char[] separator = { '.', ',', ';', ':', '-', '(', ')', '\\', '{', '}', '[', ']', '/', '\\', '\'', '"', '"', '?', '!', '|' };
public static bool IsOrdinalNumber(string word)
{
if (word.Any(char.IsWhiteSpace))
return false; // white-spaces are not allowed
if (word.Length < 3)
return false;
var numericPart = word.TakeWhile(char.IsDigit);
string numberText = string.Join("", numericPart);
if (numberText.Length == 0)
return false;
int number;
if (!int.TryParse(numberText, out number))
return false; // handle unicode digits which are not really numeric like ۵
string ordinalNumber;
switch (number % 100)
{
case 11:
case 12:
case 13:
ordinalNumber = number + "th";
break;
}
switch (number % 10)
{
case 1:
ordinalNumber = number + "st"; break;
case 2:
ordinalNumber = number + "nd"; break;
case 3:
ordinalNumber = number + "rd"; break;
default:
ordinalNumber = number + "th"; break;
}
string checkForOrdinalNum = numberText + word.Substring(numberText.Length);
return checkForOrdinalNum.Equals(ordinalNumber, StringComparison.CurrentCultureIgnoreCase);
}
public static string ToTitleCaseIgnoreOrdinalNumbers(string text, TextInfo info)
{
if(text.Trim().Length < 3)
return info.ToTitleCase(text);
int whiteSpaceIndex = FindWhiteSpaceIndex(text, 0, separator);
if(whiteSpaceIndex == -1)
{
if(IsOrdinalNumber(text.Trim()))
return text;
else
return info.ToTitleCase(text);
}
StringBuilder sb = new StringBuilder();
int wordStartIndex = 0;
if(whiteSpaceIndex == 0)
{
// starts with space, find word
wordStartIndex = FindNonWhiteSpaceIndex(text, 1, separator);
sb.Append(text.Remove(wordStartIndex)); // append leading spaces
}
while(wordStartIndex >= 0)
{
whiteSpaceIndex = FindWhiteSpaceIndex(text, wordStartIndex + 1, separator);
string word;
if(whiteSpaceIndex == -1)
word = text.Substring(wordStartIndex);
else
word = text.Substring(wordStartIndex, whiteSpaceIndex - wordStartIndex);
if(IsOrdinalNumber(word))
sb.Append(word);
else
sb.Append(info.ToTitleCase(word));
wordStartIndex = FindNonWhiteSpaceIndex(text, whiteSpaceIndex + 1, separator);
string whiteSpaces;
if(wordStartIndex >= 0)
whiteSpaces = text.Substring(whiteSpaceIndex, wordStartIndex - whiteSpaceIndex);
else
whiteSpaces = text.Substring(whiteSpaceIndex);
sb.Append(whiteSpaces); // append spaces between words
}
return sb.ToString();
}
public static int FindWhiteSpaceIndex(string text, int startIndex = 0, params char[] separator)
{
bool checkSeparator = separator != null && separator.Any();
for (int i = startIndex; i < text.Length; i++)
{
char c = text[i];
if (char.IsWhiteSpace(c) || (checkSeparator && separator.Contains(c)))
return i;
}
return -1;
}
public static int FindNonWhiteSpaceIndex(string text, int startIndex = 0, params char[] separator)
{
bool checkSeparator = separator != null && separator.Any();
for (int i = startIndex; i < text.Length; i++)
{
char c = text[i];
if (!char.IsWhiteSpace(text[i]) && (!checkSeparator || !separator.Contains(c)))
return i;
}
return -1;
}
Note that this is really not tested yet but should give you an idea.

This would work for those strings, you could override ToTitleCase() via an Extension method.
string s = "1st";
if ( s[0] >= '0' && s[0] <= '9' ) {
//this string starts with a number
//so don't call ToTitleCase()
}
else { //call ToTileCase() }

Related

How to create a sequence of strings between "start" and "end" strings [duplicate]

I have a question about iterate through the Alphabet.
I would like to have a loop that begins with "a" and ends with "z". After that, the loop begins "aa" and count to "az". after that begins with "ba" up to "bz" and so on...
Anybody know some solution?
Thanks
EDIT: I forgot that I give a char "a" to the function then the function must return b. if u give "bnc" then the function must return "bnd"
First effort, with just a-z then aa-zz
public static IEnumerable<string> GetExcelColumns()
{
for (char c = 'a'; c <= 'z'; c++)
{
yield return c.ToString();
}
char[] chars = new char[2];
for (char high = 'a'; high <= 'z'; high++)
{
chars[0] = high;
for (char low = 'a'; low <= 'z'; low++)
{
chars[1] = low;
yield return new string(chars);
}
}
}
Note that this will stop at 'zz'. Of course, there's some ugly duplication here in terms of the loops. Fortunately, that's easy to fix - and it can be even more flexible, too:
Second attempt: more flexible alphabet
private const string Alphabet = "abcdefghijklmnopqrstuvwxyz";
public static IEnumerable<string> GetExcelColumns()
{
return GetExcelColumns(Alphabet);
}
public static IEnumerable<string> GetExcelColumns(string alphabet)
{
foreach(char c in alphabet)
{
yield return c.ToString();
}
char[] chars = new char[2];
foreach(char high in alphabet)
{
chars[0] = high;
foreach(char low in alphabet)
{
chars[1] = low;
yield return new string(chars);
}
}
}
Now if you want to generate just a, b, c, d, aa, ab, ac, ad, ba, ... you'd call GetExcelColumns("abcd").
Third attempt (revised further) - infinite sequence
public static IEnumerable<string> GetExcelColumns(string alphabet)
{
int length = 0;
char[] chars = null;
int[] indexes = null;
while (true)
{
int position = length-1;
// Try to increment the least significant
// value.
while (position >= 0)
{
indexes[position]++;
if (indexes[position] == alphabet.Length)
{
for (int i=position; i < length; i++)
{
indexes[i] = 0;
chars[i] = alphabet[0];
}
position--;
}
else
{
chars[position] = alphabet[indexes[position]];
break;
}
}
// If we got all the way to the start of the array,
// we need an extra value
if (position == -1)
{
length++;
chars = new char[length];
indexes = new int[length];
for (int i=0; i < length; i++)
{
chars[i] = alphabet[0];
}
}
yield return new string(chars);
}
}
It's possible that it would be cleaner code using recursion, but it wouldn't be as efficient.
Note that if you want to stop at a certain point, you can just use LINQ:
var query = GetExcelColumns().TakeWhile(x => x != "zzz");
"Restarting" the iterator
To restart the iterator from a given point, you could indeed use SkipWhile as suggested by thesoftwarejedi. That's fairly inefficient, of course. If you're able to keep any state between call, you can just keep the iterator (for either solution):
using (IEnumerator<string> iterator = GetExcelColumns())
{
iterator.MoveNext();
string firstAttempt = iterator.Current;
if (someCondition)
{
iterator.MoveNext();
string secondAttempt = iterator.Current;
// etc
}
}
Alternatively, you may well be able to structure your code to use a foreach anyway, just breaking out on the first value you can actually use.
Edit: Made it do exactly as the OP's latest edit wants
This is the simplest solution, and tested:
static void Main(string[] args)
{
Console.WriteLine(GetNextBase26("a"));
Console.WriteLine(GetNextBase26("bnc"));
}
private static string GetNextBase26(string a)
{
return Base26Sequence().SkipWhile(x => x != a).Skip(1).First();
}
private static IEnumerable<string> Base26Sequence()
{
long i = 0L;
while (true)
yield return Base26Encode(i++);
}
private static char[] base26Chars = "abcdefghijklmnopqrstuvwxyz".ToCharArray();
private static string Base26Encode(Int64 value)
{
string returnValue = null;
do
{
returnValue = base26Chars[value % 26] + returnValue;
value /= 26;
} while (value-- != 0);
return returnValue;
}
The following populates a list with the required strings:
List<string> result = new List<string>();
for (char ch = 'a'; ch <= 'z'; ch++){
result.Add (ch.ToString());
}
for (char i = 'a'; i <= 'z'; i++)
{
for (char j = 'a'; j <= 'z'; j++)
{
result.Add (i.ToString() + j.ToString());
}
}
I know there are plenty of answers here, and one's been accepted, but IMO they all make it harder than it needs to be. I think the following is simpler and cleaner:
static string NextColumn(string column){
char[] c = column.ToCharArray();
for(int i = c.Length - 1; i >= 0; i--){
if(char.ToUpper(c[i]++) < 'Z')
break;
c[i] -= (char)26;
if(i == 0)
return "A" + new string(c);
}
return new string(c);
}
Note that this doesn't do any input validation. If you don't trust your callers, you should add an IsNullOrEmpty check at the beginning, and a c[i] >= 'A' && c[i] <= 'Z' || c[i] >= 'a' && c[i] <= 'z' check at the top of the loop. Or just leave it be and let it be GIGO.
You may also find use for these companion functions:
static string GetColumnName(int index){
StringBuilder txt = new StringBuilder();
txt.Append((char)('A' + index % 26));
//txt.Append((char)('A' + --index % 26));
while((index /= 26) > 0)
txt.Insert(0, (char)('A' + --index % 26));
return txt.ToString();
}
static int GetColumnIndex(string name){
int rtn = 0;
foreach(char c in name)
rtn = rtn * 26 + (char.ToUpper(c) - '#');
return rtn - 1;
//return rtn;
}
These two functions are zero-based. That is, "A" = 0, "Z" = 25, "AA" = 26, etc. To make them one-based (like Excel's COM interface), remove the line above the commented line in each function, and uncomment those lines.
As with the NextColumn function, these functions don't validate their inputs. Both with give you garbage if that's what they get.
Here’s what I came up with.
/// <summary>
/// Return an incremented alphabtical string
/// </summary>
/// <param name="letter">The string to be incremented</param>
/// <returns>the incremented string</returns>
public static string NextLetter(string letter)
{
const string alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
if (!string.IsNullOrEmpty(letter))
{
char lastLetterInString = letter[letter.Length - 1];
// if the last letter in the string is the last letter of the alphabet
if (alphabet.IndexOf(lastLetterInString) == alphabet.Length - 1)
{
//replace the last letter in the string with the first leter of the alphbat and get the next letter for the rest of the string
return NextLetter(letter.Substring(0, letter.Length - 1)) + alphabet[0];
}
else
{
// replace the last letter in the string with the proceeding letter of the alphabet
return letter.Remove(letter.Length-1).Insert(letter.Length-1, (alphabet[alphabet.IndexOf(letter[letter.Length-1])+1]).ToString() );
}
}
//return the first letter of the alphabet
return alphabet[0].ToString();
}
just curious , why not just
private string alphRecursive(int c) {
var alphabet = "abcdefghijklmnopqrstuvwxyz".ToCharArray();
if (c >= alphabet.Length) {
return alphRecursive(c/alphabet.Length) + alphabet[c%alphabet.Length];
} else {
return "" + alphabet[c%alphabet.Length];
}
}
This is like displaying an int, only using base 26 in stead of base 10. Try the following algorithm to find the nth entry of the array
q = n div 26;
r = n mod 26;
s = '';
while (q > 0 || r > 0) {
s = alphabet[r] + s;
q = q div 26;
r = q mod 26;
}
Of course, if you want the first n entries, this is not the most efficient solution. In this case, try something like daniel's solution.
I gave this a go and came up with this:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace Alphabetty
{
class Program
{
const string alphabet = "abcdefghijklmnopqrstuvwxyz";
static int cursor = 0;
static int prefixCursor;
static string prefix = string.Empty;
static bool done = false;
static void Main(string[] args)
{
string s = string.Empty;
while (s != "Done")
{
s = GetNextString();
Console.WriteLine(s);
}
Console.ReadKey();
}
static string GetNextString()
{
if (done) return "Done";
char? nextLetter = GetNextLetter(ref cursor);
if (nextLetter == null)
{
char? nextPrefixLetter = GetNextLetter(ref prefixCursor);
if(nextPrefixLetter == null)
{
done = true;
return "Done";
}
prefix = nextPrefixLetter.Value.ToString();
nextLetter = GetNextLetter(ref cursor);
}
return prefix + nextLetter;
}
static char? GetNextLetter(ref int letterCursor)
{
if (letterCursor == alphabet.Length)
{
letterCursor = 0;
return null;
}
char c = alphabet[letterCursor];
letterCursor++;
return c;
}
}
}
Here is something I had cooked up that may be similar. I was experimenting with iteration counts in order to design a numbering schema that was as small as possible, yet gave me enough uniqueness.
I knew that each time a added an Alpha character, it would increase the possibilities 26x but I wasn't sure how many letters, numbers, or the pattern I wanted to use.
That lead me to the code below. Basically you pass it an AlphaNumber string, and every position that has a Letter, would eventually increment to "z\Z" and every position that had a Number, would eventually increment to "9".
So you can call it 1 of two ways..
//This would give you the next Itteration... (H3reIsaStup4dExamplf)
string myNextValue = IncrementAlphaNumericValue("H3reIsaStup4dExample")
//Or Loop it resulting eventually as "Z9zzZzzZzzz9zZzzzzzz"
string myNextValue = "H3reIsaStup4dExample"
while (myNextValue != null)
{
myNextValue = IncrementAlphaNumericValue(myNextValue)
//And of course do something with this like write it out
}
(For me, I was doing something like "1AA000")
public string IncrementAlphaNumericValue(string Value)
{
//We only allow Characters a-b, A-Z, 0-9
if (System.Text.RegularExpressions.Regex.IsMatch(Value, "^[a-zA-Z0-9]+$") == false)
{
throw new Exception("Invalid Character: Must be a-Z or 0-9");
}
//We work with each Character so it's best to convert the string to a char array for incrementing
char[] myCharacterArray = Value.ToCharArray();
//So what we do here is step backwards through the Characters and increment the first one we can.
for (Int32 myCharIndex = myCharacterArray.Length - 1; myCharIndex >= 0; myCharIndex--)
{
//Converts the Character to it's ASCII value
Int32 myCharValue = Convert.ToInt32(myCharacterArray[myCharIndex]);
//We only Increment this Character Position, if it is not already at it's Max value (Z = 90, z = 122, 57 = 9)
if (myCharValue != 57 && myCharValue != 90 && myCharValue != 122)
{
myCharacterArray[myCharIndex]++;
//Now that we have Incremented the Character, we "reset" all the values to the right of it
for (Int32 myResetIndex = myCharIndex + 1; myResetIndex < myCharacterArray.Length; myResetIndex++)
{
myCharValue = Convert.ToInt32(myCharacterArray[myResetIndex]);
if (myCharValue >= 65 && myCharValue <= 90)
{
myCharacterArray[myResetIndex] = 'A';
}
else if (myCharValue >= 97 && myCharValue <= 122)
{
myCharacterArray[myResetIndex] = 'a';
}
else if (myCharValue >= 48 && myCharValue <= 57)
{
myCharacterArray[myResetIndex] = '0';
}
}
//Now we just return an new Value
return new string(myCharacterArray);
}
}
//If we got through the Character Loop and were not able to increment anything, we retun a NULL.
return null;
}
Here's my attempt using recursion:
public static void PrintAlphabet(string alphabet, string prefix)
{
for (int i = 0; i < alphabet.Length; i++) {
Console.WriteLine(prefix + alphabet[i].ToString());
}
if (prefix.Length < alphabet.Length - 1) {
for (int i = 0; i < alphabet.Length; i++) {
PrintAlphabet(alphabet, prefix + alphabet[i]);
}
}
}
Then simply call PrintAlphabet("abcd", "");

Replace character in string with Uppercase of next in line (Pascal Casing)

I want to remove all underscores from a string with the uppercase of the character following the underscore. So for example: _my_string_ becomes: MyString similarly: my_string becomes MyString
Is there a simpler way to do it? I currently have the following (assuming no input has two consecutive underscores):
StringBuilder sb = new StringBuilder();
int i;
for (i = 0; i < input.Length - 1; i++)
{
if (input[i] == '_')
sb.Append(char.ToUpper(input[++i]));
else if (i == 0)
sb.Append(char.ToUpper(input[i]));
else
sb.Append(input[i]);
}
if (i < input.Length && input[i] != '_')
sb.Append(input[i]);
return sb.ToString();
Now I know this is not totally related, but I thought to run some numbers on the implementations provided in the answers, and here are the results in Milliseconds for each implementation using 1000000 iterations of the string: "_my_string_121_a_" :
Achilles: 313
Raj: 870
Damian: 7916
Dmitry: 5380
Equalsk: 574
method utilised:
Stopwatch stp = new Stopwatch();
stp.Start();
for (int i = 0; i < 1000000; i++)
{
sb = Test("_my_string_121_a_");
}
stp.Stop();
long timeConsumed= stp.ElapsedMilliseconds;
In the end I think I'll go with Raj's implementation, because it's just very simple and easy to understand.
This must do it using ToTitleCase using System.Globalization namespace
static string toCamel(string input)
{
TextInfo info = CultureInfo.CurrentCulture.TextInfo;
input= info.ToTitleCase(input).Replace("_", string.Empty);
return input;
}
Shorter (regular expressions), but I doubt if it's better (regular expressions are less readable):
string source = "_my_string_123__A_";
// MyString123A
string result = Regex
// _ + lower case Letter -> upper case letter (thanks to Wiktor Stribiżew)
.Replace(source, #"(_+|^)(\p{Ll})?", match => match.Groups[2].Value.ToUpper())
// all the other _ should be just removed
.Replace("_", "");
Loops over each character and converts to uppercase as necessary.
public string GetNewString(string input)
{
var convert = false;
var sb = new StringBuilder();
foreach (var c in input)
{
if (c == '_')
{
convert = true;
continue;
}
if (convert)
{
sb.Append(char.ToUpper(c));
convert = false;
continue;
}
sb.Append(c);
}
return sb.ToString().First().ToString().ToUpper() + sb.ToString().Substring(1);
}
Usage:
GetNewString("my_string");
GetNewString("___this_is_anewstring_");
GetNewString("___this_is_123new34tring_");
Output:
MyString
ThisIsAnewstring
ThisIs123new34tring
Try with Regex:
var regex = new Regex("^[a-z]|_[a-z]?");
var result = regex.Replace("my_string_1234", x => x.Value== "_" ? "" : x.Value.Last().ToString().ToUpper());
Tested with:
my_string -> MyString
_my_string -> MyString
_my_string_ -> MyString
You need convert snake case to camel case, You can use this code it's working for me
var x ="_my_string_".Split(new[] {"_"}, StringSplitOptions.RemoveEmptyEntries)
.Select(s => char.ToUpperInvariant(s[0]) + s.Substring(1, s.Length - 1))
.Aggregate(string.Empty, (s1, s2) => s1 + s2);
x = MyString
static string toCamel(string input)
{
StringBuilder sb = new StringBuilder();
int i;
for (i = 0; i < input.Length; i++)
{
if ((i == 0) || (i > 0 && input[i - 1] == '_'))
sb.Append(char.ToUpper(input[i]));
else
sb.Append(char.ToLower(input[i]));
}
return sb.ToString();
}

Parse comma seperated string with a complication in C#

I know how to get substrings from a string which are coma seperated but here's a complication: what if substring contains a coma.
If a substring contains a coma, new line or double quotes the entire substring is encapsulated with double quotes.
If a substring contains a double quote the double quote is escaped with another double quote.
Worst case scenario would be if I have something like this:
first,"second, second","""third"" third","""fourth"", fourth"
In this case substrings are:
first
second, second
"third" third
"fourth", fourth
second, second is encapsulated with double quotes, I don't want those double quotes in a list/array.
"third" third is encapsulated with double quotes because it contains double quotes and those are escaped with aditional double quotes. Again I don't want the encapsulating double quotes in a list/array and i don't want the double quotes that escape double quotes, but I want original double quotes which are a part of the substring.
One way using TextFieldParser:
using (var reader = new StringReader("first,\"second, second\",\"\"\"third\"\" third\",\"\"\"fourth\"\", fourth\""))
using (var parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader))
{
parser.Delimiters = new[] { "," };
parser.HasFieldsEnclosedInQuotes = true;
while (!parser.EndOfData)
{
foreach (var field in parser.ReadFields())
Console.WriteLine(field);
}
}
For
first
second, second
"third" third
"fourth", fourth
Try this
string input = "first,\"second, second\",\"\"\"third\"\" third\",\"\"\"fourth\"\", fourth\"";
string[] output = input.Split(new string[] {"\",\""}, StringSplitOptions.RemoveEmptyEntries);
I would suggest you to construct a small state machine for this problem. You would have states like:
Out - before the first field is reached
InQuoted - you were Out and " arrived; now you're in and the field is quoted
InQuotedMaybeOut - you were InQuoted and " arrived; now you wait for the next character to figure whether it is another " or something else; if else, then select the next valid state (character could be space, new line, comma, so you decide the next state); otherwise, if " arrived, you push " to the output and step back to InQuoted
In - after Out, when any character has arrived except , and ", you are automatically inside a new field which is not quoted.
This will certainly read CSV correctly. You can also make the separator configurable, so that you support TSV or semicolon-separated format.
Also keep in mind one very important case in CSV format: Quoted field may contain new line! Another special case to keep an eye on: empty field (like: ,,).
This is not the most elegant solution but it might help you. I would loop through the characters and do an odd-even count of the quotes. For example you have a bool that is true if you have encountered an odd number of quotes and false for an even number of quotes.
Any comma encountered while this bool value is true should not be considered as a separator. If you know it is a separator you can do several things with that information. Below I replaced the delimiter with something more manageable (not very efficient though):
bool odd = false;
char replacementDelimiter = "|"; // Or some very unlikely character
for(int i = 0; i < str.len; ++i)
{
if(str[i] == '\"')
odd = !odd;
else if (str[i] == ',')
{
if(!odd)
str[i] = replacementDelimiter;
}
}
string[] commaSeparatedTokens = str.Split(replacementDelimiter);
At this point you should have an array of strings that are separated on the commas that you have intended. From here on it will be simpler to handle the quotes.
I hope this can help you.
Mini parser
using System;
using System.Collections.Generic;
using System.Text;
namespace ConsoleApp
{
class Program
{
private static IEnumerable<string> Parse(string input)
{
if (string.IsNullOrWhiteSpace(input))
{
// empty string => nothing to do
yield break;
}
int count = input.Length;
StringBuilder sb = new StringBuilder();
int j;
for (int i = 0; i < count; i++)
{
char c = input[i];
if (c == ',')
{
yield return sb.ToString();
sb.Clear();
}
else if (c == '"')
{
// begin quoted string
sb.Clear();
for (j = i + 1; j < count; j++)
{
if (input[j] == '"')
{
// quote
if (j < count - 1 && input[j + 1] == '"')
{
// double quote
sb.Append('"');
j++;
}
else
{
break;
}
}
else
{
sb.Append(input[j]);
}
}
yield return sb.ToString();
// clear buffer and skip to next comma
sb.Clear();
for (i = j + 1; i < count && input[i] != ','; i++) ;
}
else
{
sb.Append(c);
}
}
}
[STAThread]
static void Main(string[] args)
{
foreach (string str in Parse("first,\"second, second\",\"\"\"third\"\" third\",\"\"\"fourth\"\", fourth\""))
{
Console.WriteLine(str);
}
Console.WriteLine();
Console.WriteLine("Press any key to continue...");
Console.ReadKey();
}
}
}
Result
first
second, second
"third" third
"fourth", fourth
Thank you for your answers, but before I got to see them I wrote this solution, it's not pretty but it works for me.
string line = "first,\"second, second\",\"\"\"third\"\" third\",\"\"\"fourth\"\", fourth\"";
var substringArray = new List<string>();
string substring = null;
var doubleQuotesCount = 0;
for (var i = 0; i < line.Length; i++)
{
if (line[i] == ',' && (doubleQuotesCount % 2) == 0)
{
substringArray.Add(substring);
substring = null;
doubleQuotesCount = 0;
continue;
}
else
{
if (line[i] == '"')
doubleQuotesCount++;
substring += line[i];
//If it is a last character
if (i == line.Length - 1)
{
substringArray.Add(substring);
substring = null;
doubleQuotesCount = 0;
}
}
}
for(var i = 0; i < substringArray.Count; i++)
{
if (substringArray[i] != null)
{
//remove first double quote
if (substringArray[i][0] == '"')
{
substringArray[i] = substringArray[i].Substring(1);
}
//remove last double quote
if (substringArray[i][substringArray[i].Length - 1] == '"')
{
substringArray[i] = substringArray[i].Remove(substringArray[i].Length - 1);
}
//Replace double double quotes with single double quote
substringArray[i] = substringArray[i].Replace("\"\"", "\"");
}
}

split a string with max character limit

I'm attempting to split a string into many strings (List) with each one having a maximum limit of characters. So say if I had a string of 500 characters, and I want each string to have a max of 75, there would be 7 strings, and the last one would not have a full 75.
I've tried some of the examples I have found on stackoverflow, but they 'truncate' the results. Any ideas?
You can write your own extension method to do something like that
static class StringExtensions
{
public static IEnumerable<string> SplitOnLength(this string input, int length)
{
int index = 0;
while (index < input.Length)
{
if (index + length < input.Length)
yield return input.Substring(index, length);
else
yield return input.Substring(index);
index += length;
}
}
}
And then you could call it like this
string temp = new string('#', 500);
string[] array = temp.SplitOnLength(75).ToArray();
foreach (string x in array)
Console.WriteLine(x);
I think this is a little cleaner than the other answers:
public static IEnumerable<string> SplitByLength(string s, int length)
{
while (s.Length > length)
{
yield return s.Substring(0, length);
s = s.Substring(length);
}
if (s.Length > 0) yield return s;
}
I would tackle this with a loop using C# String.Substring method.
Note that this isn't exact code, but you get the idea.
var myString = "hello world";
List<string> list = new List();
int maxSize
while(index < myString.Length())
{
if(index + maxSize > myString.Length())
{
// handle last case
list.Add(myString.Substring(index));
break;
}
else
{
list.Add(myString.Substring(index,maxSize));
index+= maxSize;
}
}
When you say split, are you referring to the split function? If not, something like this will work:
List<string> list = new List<string>();
string s = "";
int num = 75;
while (s.Length > 0)
{
list.Add(s.Substring(0, num));
s = s.Remove(0, num);
}
i am assuming maybe a delimiter - like space character.
search on the string (instr) until you find the next position of the delimiter.
if that is < your substring length (75) then append to the current substring.
if not, start a new substring.
special case - if there is no delimiter in the entire substring - then you need to define what happens - like add a '-' then continue.
public static string SplitByLength(string s, int length)
{
ArrayList sArrReturn = new ArrayList();
String[] sArr = s.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
foreach (string sconcat in sArr)
{
if (((String.Join(" ", sArrReturn.ToArray()).Length + sconcat.Length)+1) < length)
sArrReturn.Add(sconcat);
else
break;
}
return String.Join(" ", sArrReturn.ToArray());
}
public static string SplitByLengthOld(string s, int length)
{
try
{
string sret = string.Empty;
String[] sArr = s.Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries);
foreach (string sconcat in sArr)
{
if ((sret.Length + sconcat.Length + 1) < length)
sret = string.Format("{0}{1}{2}", sret, string.IsNullOrEmpty(sret) ? string.Empty : " ", sconcat);
}
return sret;
}
catch
{
return string.Empty;
}
}

Parse an integer from a string with trailing garbage

I need to parse a decimal integer that appears at the start of a string.
There may be trailing garbage following the decimal number. This needs to be ignored (even if it contains other numbers.)
e.g.
"1" => 1
" 42 " => 42
" 3 -.X.-" => 3
" 2 3 4 5" => 2
Is there a built-in method in the .NET framework to do this?
int.TryParse() is not suitable. It allows trailing spaces but not other trailing characters.
It would be quite easy to implement this but I would prefer to use the standard method if it exists.
You can use Linq to do this, no Regular Expressions needed:
public static int GetLeadingInt(string input)
{
return Int32.Parse(new string(input.Trim().TakeWhile(c => char.IsDigit(c) || c == '.').ToArray()));
}
This works for all your provided examples:
string[] tests = new string[] {
"1",
" 42 ",
" 3 -.X.-",
" 2 3 4 5"
};
foreach (string test in tests)
{
Console.WriteLine("Result: " + GetLeadingInt(test));
}
foreach (var m in Regex.Matches(" 3 - .x. 4", #"\d+"))
{
Console.WriteLine(m);
}
Updated per comments
Not sure why you don't like regular expressions, so I'll just post what I think is the shortest solution.
To get first int:
Match match = Regex.Match(" 3 - .x. - 4", #"\d+");
if (match.Success)
Console.WriteLine(int.Parse(match.Value));
There's no standard .NET method for doing this - although I wouldn't be surprised to find that VB had something in the Microsoft.VisualBasic assembly (which is shipped with .NET, so it's not an issue to use it even from C#).
Will the result always be non-negative (which would make things easier)?
To be honest, regular expressions are the easiest option here, but...
public static string RemoveCruftFromNumber(string text)
{
int end = 0;
// First move past leading spaces
while (end < text.Length && text[end] == ' ')
{
end++;
}
// Now move past digits
while (end < text.Length && char.IsDigit(text[end]))
{
end++;
}
return text.Substring(0, end);
}
Then you just need to call int.TryParse on the result of RemoveCruftFromNumber (don't forget that the integer may be too big to store in an int).
I like #Donut's approach.
I'd like to add though, that char.IsDigit and char.IsNumber also allow for some unicode characters which are digits in other languages and scripts (see here).
If you only want to check for the digits 0 to 9 you could use "0123456789".Contains(c).
Three example implementions:
To remove trailing non-digit characters:
var digits = new string(input.Trim().TakeWhile(c =>
("0123456789").Contains(c)
).ToArray());
To remove leading non-digit characters:
var digits = new string(input.Trim().SkipWhile(c =>
!("0123456789").Contains(c)
).ToArray());
To remove all non-digit characters:
var digits = new string(input.Trim().Where(c =>
("0123456789").Contains(c)
).ToArray());
And of course: int.Parse(digits) or int.TryParse(digits, out output)
This doesn't really answer your question (about a built-in C# method), but you could try chopping off characters at the end of the input string one by one until int.TryParse() accepts it as a valid number:
for (int p = input.Length; p > 0; p--)
{
int num;
if (int.TryParse(input.Substring(0, p), out num))
return num;
}
throw new Exception("Malformed integer: " + input);
Of course, this will be slow if input is very long.
ADDENDUM (March 2016)
This could be made faster by chopping off all non-digit/non-space characters on the right before attempting each parse:
for (int p = input.Length; p > 0; p--)
{
char ch;
do
{
ch = input[--p];
} while ((ch < '0' || ch > '9') && ch != ' ' && p > 0);
p++;
int num;
if (int.TryParse(input.Substring(0, p), out num))
return num;
}
throw new Exception("Malformed integer: " + input);
string s = " 3 -.X.-".Trim();
string collectedNumber = string.empty;
int i;
for (x = 0; x < s.length; x++)
{
if (int.TryParse(s[x], out i))
collectedNumber += s[x];
else
break; // not a number - that's it - get out.
}
if (int.TryParse(collectedNumber, out i))
Console.WriteLine(i);
else
Console.WriteLine("no number found");
This is how I would have done it in Java:
int parseLeadingInt(String input)
{
NumberFormat fmt = NumberFormat.getIntegerInstance();
fmt.setGroupingUsed(false);
return fmt.parse(input, new ParsePosition(0)).intValue();
}
I was hoping something similar would be possible in .NET.
This is the regex-based solution I am currently using:
int? parseLeadingInt(string input)
{
int result = 0;
Match match = Regex.Match(input, "^[ \t]*\\d+");
if (match.Success && int.TryParse(match.Value, out result))
{
return result;
}
return null;
}
Might as well add mine too.
string temp = " 3 .x£";
string numbersOnly = String.Empty;
int tempInt;
for (int i = 0; i < temp.Length; i++)
{
if (Int32.TryParse(Convert.ToString(temp[i]), out tempInt))
{
numbersOnly += temp[i];
}
}
Int32.TryParse(numbersOnly, out tempInt);
MessageBox.Show(tempInt.ToString());
The message box is just for testing purposes, just delete it once you verify the method is working.
I'm not sure why you would avoid Regex in this situation.
Here's a little hackery that you can adjust to your needs.
" 3 -.X.-".ToCharArray().FindInteger().ToList().ForEach(Console.WriteLine);
public static class CharArrayExtensions
{
public static IEnumerable<char> FindInteger(this IEnumerable<char> array)
{
foreach (var c in array)
{
if(char.IsNumber(c))
yield return c;
}
}
}
EDIT:
That's true about the incorrect result (and the maintenance dev :) ).
Here's a revision:
public static int FindFirstInteger(this IEnumerable<char> array)
{
bool foundInteger = false;
var ints = new List<char>();
foreach (var c in array)
{
if(char.IsNumber(c))
{
foundInteger = true;
ints.Add(c);
}
else
{
if(foundInteger)
{
break;
}
}
}
string s = string.Empty;
ints.ForEach(i => s += i.ToString());
return int.Parse(s);
}
private string GetInt(string s)
{
int i = 0;
s = s.Trim();
while (i<s.Length && char.IsDigit(s[i])) i++;
return s.Substring(0, i);
}
Similar to Donut's above but with a TryParse:
private static bool TryGetLeadingInt(string input, out int output)
{
var trimmedString = new string(input.Trim().TakeWhile(c => char.IsDigit(c) || c == '.').ToArray());
var canParse = int.TryParse( trimmedString, out output);
return canParse;
}

Categories