Regex.Split and string.Split not working as expected

Regex.Split and string.Split not working as expected - c#

I am attempting to split strings using '?' as the delimiter. My code reads data from a CSV file, and certain symbols (like fractions) are not recognized by C#, so I am trying to replace them with a relevant piece of data (bond coupon in this case). I have print statements in the following code (which is embedded in a loop with index variable i) to test the output:
string[] l = lines[i][1].Split('?');
//string[] l = Regex.Split(lines[i][1], #"\?");
System.Console.WriteLine("L IS " + l.Length.ToString() + " LONG");
for (int j = 0; j < l.Length; j++)
System.Console.WriteLine("L["+ j.ToString() + "] IS " + l[j]);
if (l.Length > 1)
{
double cpn = Convert.ToDouble(lines[i][12]);
string couponFrac = (cpn - Math.Floor(cpn)).ToString().Remove(0,1);
lines[i][1] = l[0].Remove(l[0].Length-1) + couponFrac + l[1]; // Recombine, replacing '?' with CPN
}
The issue is that both split methods (string.Split() and Regex.Split() ) produce inconsistent results with some of the string elements in lines splitting correctly and the others not splitting at all (and thus the question mark is still in the string).
Any thoughts? I've looked at similar posts on split methods and they haven't been too helpful.

I had no problem using String.Split. Could you post your input and output?
If at all you could probably use String.Replace to replace your desired '?' with a character that does not occur in the string and then use String.Split on that character to split the resultant string for the same effect. (just a try)

I didn't have any trouble parsing the following.
var qsv = "now?is?the?time";
var keywords = qsv.Split('?');
keywords.Dump();
screenshot of code and output...
UPDATE:
There doesn't appear to be any problem with Split. There is a problem somewhere else because in this small scale test it works just fine. I would suggest you use LinqPad to test out these kinds of scenarios small scale.
var qsv = "TII 0 ? 04/15/15";
var keywords = qsv.Split('?');
keywords.Dump();
qsv = "TII 0 ? 01/15/22";
keywords = qsv.Split('?');
keywords.Dump();
New updated output:

Related

How to contact whole text from file into the string avoiding empty lines beetwen strings

How to get whole text from document contacted into the string. I'm trying to split text by dot: string[] words = s.Split('.'); I want take this text from text document. But if my text document contains empty lines between strings, for example:
pat said, “i’ll keep this ring.”
she displayed the silver and jade wedding ring which, in another time track,
she and joe had picked out; this
much of the alternate world she had elected to retain. he wondered what - if any - legal basis she had kept in addition. none, he hoped; wisely, however, he said nothing. better not even to ask.
result looks like this:
1. pat said ill keep this ring
2. she displayed the silver and jade wedding ring which in another time track
3. she and joe had picked out this
4. much of the alternate world she had elected to retain
5. he wondered what if any legal basis she had kept in addition
6. none he hoped wisely however he said nothing
7. better not even to ask
but desired correct output should be like this:
1. pat said ill keep this ring
2. she displayed the silver and jade wedding ring which in another time track she and joe had picked out this much of the alternate world she had elected to retain
3. he wondered what if any legal basis she had kept in addition
4. none he hoped wisely however he said nothing
5. better not even to ask
So to do this first I need to process text file content to get whole text as single string, like this:
pat said, “i’ll keep this ring.” she displayed the silver and jade wedding ring which, in another time track, she and joe had picked out; this much of the alternate world she had elected to retain. he wondered what - if any - legal basis she had kept in addition. none, he hoped; wisely, however, he said nothing. better not even to ask.
I can't to do this same way as it would be with list content for example: string concat = String.Join(" ", text.ToArray());,
I'm not sure how to contact text into string from text document

I think this is what you want:
var fileLocation = #"c:\\myfile.txt";
var stringFromFile = File.ReadAllText(fileLocation);
//replace Environment.NewLine with any new line character your file uses
var withoutNewLines = stringFromFile.Replace(Environment.NewLine, "");
//modify to remove any unwanted character
var withoutUglyCharacters = Regex.Replace(withoutNewLines, "[“’”,;-]", "");
var withoutTwoSpaces = withoutUglyCharacters.Replace(" ", " ");
var result = withoutTwoSpaces.Split('.').Where(i => i != "").Select(i => i.TrimStart()).ToList();
So first you read all text from your file, then you remove all unwanted characters and then split by . and return non empty items

Have you tried replacing double new-lines before splitting using a period?
static string[] GetSentences(string filePath) {
if (!File.Exists(filePath))
throw new FileNotFoundException($"Could not find file { filePath }!");
var lines = string.Join("", File.ReadLines(filePath).Where(line => !string.IsNullOrEmpty(line) && !string.IsNullOrWhiteSpace(line)));
var sentences = Regex.Split(lines, #"\.[\s]{1,}?");
return sentences;
}
I haven't tested this, but it should work.
Explanation:
if (!File.Exists(filePath))
throw new FileNotFoundException($"Could not find file { filePath }!");
Throws an exception if the file could not be found. It is advisory you surround the method call with a try/catch.
var lines = string.Join("", File.ReadLines(filePath).Where(line => !string.IsNullOrEmpty(line) && !string.IsNullOrWhiteSpace(line)));
Creates a string, and ignores any lines which are purely whitespace or empty.
var sentences = Regex.Split(lines, #".[\s]{1,}?");
Creates a string array, where the string is split at every period and whitespace following the period.
E.g:
The string "I came. I saw. I conquered" would become
I came
I saw
I conquered
Update:
Here's the method as a one-liner, if that's your style?
static string[] SplitSentences(string filePath) => File.Exists(filePath) ? Regex.Split(string.Join("", File.ReadLines(filePath).Where(line => !string.IsNullOrEmpty(line) && !string.IsNullOrWhiteSpace(line))), #"") : null;

I would suggest you to iterate through all characters and just check if they are in range of 'a' >= char <= 'z' or if char == ' '. If it matches the condition then add it to the newly created string else check if it is '.' character and if it is then end your line and add another one :
List<string> lines = new List<string>();
string line = string.Empty;
foreach(char c in str)
{
if((char.ToLower(c) >= 'a' && char.ToLower(c) <= 'z') || c == 0x20)
line += c;
else if(c == '.')
{
lines.Add(line.Trim());
line = string.Empty;
}
}
Working online example
Or if you prefer "one-liner"s :
IEnumerable<string> lines = new string(str.Select(c => (char)(((char.ToLower(c) >= 'a' && char.ToLower(c) <= 'z') || c == 0x20) ? c : c == '.' ? '\n' : '\0')).ToArray()).Split('\n').Select(s => s.Trim());

I may be wrong about this. I would think that you may not want to alter the string if you are splitting it. Example, there are double/single quote(s) (“) in part of the string. Removing them may not be desired which brings up the possibly of a question, reading a text file that contains single/double quotes (as your example data text shows) like below:
var stringFromFile = File.ReadAllText(fileLocation);
will not display those characters properly in a text box or the console because the default encoding using the ReadAllText method is UTF8. Example the single/double quotes will display (replacement characters) as diamonds in a text box on a form and will be displayed as a question mark (?) when displayed to the console. To keep the single/double quotes and have them display properly you can get the encoding for the OS’s current ANSI encoding by adding a parameter to the ReadAllText method like below:
string stringFromFile = File.ReadAllText(fileLocation, ASCIIEncoding.Default);
Below is code using a simple split method to .split the string on periods (.) Hope this helps.
private void button1_Click(object sender, EventArgs e) {
string fileLocation = #"C:\YourPath\YourFile.txt";
string stringFromFile = File.ReadAllText(fileLocation, ASCIIEncoding.Default);
string bigString = stringFromFile.Replace(Environment.NewLine, "");
string[] result = bigString.Split('.');
int count = 1;
foreach (string s in result) {
if (s != "") {
textBox1.Text += count + ". " + s.Trim() + Environment.NewLine;
Console.WriteLine(count + ". " + s.Trim());
count++;
}
else {
// period at the end of the string
}
}
}

How to write sorting in more efficient way?

I have a project where I have to write an efficient code which will be working as fast as possible, but I have lack of knowledge do to it so...
So I have an asp.net(MVC) project using entity framework and as well I have to use Web Service to get info about details from it.
First I make request to Web service and is responds with a long string, which i have to parse in a list of strings for further activities.
I parse this string like this:
string resultString;
char[] delimiterChars = { ',', ':', '"', '}', '{' };
List<string> words = resultString.Split(delimiterChars).ToList();
From here i have list with a lot of rows, which have information and a lot of junk rows, which look like this:
I decided to clear this list from junk info, so as not to work with it in further methods and not to check this rows with ifs and so on:
for (int i = words.Count - 1; i >= 0; i--)
{
if (words[i] == "" || words[i] == "data" || words[i] == "array") words.RemoveAt(i);
}
After this I got clear list, but every decimal number like prices, sizes and so on got separated by ,, so if I had price 21,55 in my list it now looks like 2 elements 21 and 55. I cant just delete , from separators, because string I get as a response from web service mainly separates info by putting ,.
So I decided to glue decimal numbers back (before this block list elements looked like: 1)attrValue 2)21 3)55 and after like : 1)attrValue 2)21.55):
for (int i = 0; i < words.Count(); i++)
{
if (words[i] == "attrValue")
{
try
{
var seconPartInt = Int32.Parse(words[i + 2]);
words[i + 1] += "." + words[i + 2];
}
catch { }
}
if (words[i].Contains("\\/")) words[i].Replace("\\/", "/");
}
Every thing is ok, list is sorted, decimals are gathered, but speed is slowed down by 30%. After some tests with stopwatch and commenting blocks of code it became clear that this code above slows down the whole program too much...
To sum up:
I cant use that slow code and at the same time do not know how to make it work faster. May be the problem is that I convert string to int so as to check whether next element in the list is second part if my number.
How could I optimize my code?

The first thing you should do is use this version of Split to avoid getting empty entries (https://msdn.microsoft.com/en-us/library/ms131448(v=vs.110).aspx).
List<string> words = resultString.Split(delimiterChars, StringSplitOptions.RemoveEmptyEntries)
.ToList();
Also, if you know that "data" and "array" are in the string and you never want them, replace them with blanks before you split the string.
resultString = resultString.Replace("data", String.Empty)
.Replace("array", String.Empty);
What I don't understand is how the comma can be both a field delimiter and a meaningful character, and how you can possibly know the difference (i.e. whether 25,50 should be a single value or two values).

C# how to split a string backwards?

What i'm trying to do is split a string backwards. Meaning right to left.
string startingString = "<span class=\"address\">Hoopeston,, IL 60942</span><br>"
What I would do normally is this.
string[] splitStarting = startingString.Split('>');
so my splitStarting[1] would = "Hoopeston,, IL 60942</span"
then I would do
string[] splitAgain = splitStarting[1].Split('<');
so splitAgain[0] would = "Hoopeston,, IL 60942"
Now this is what I want to do, I want to split by ' ' (a space) reversed for the last 2 instances of ' '.
For example my array would come back like so:
[0]="60942"
[1]="IL"
[2] = "Hoopeston,,"
To make this even harder I only ever want the first two reverse splits, so normally I would do something like this
string[] splitCity,Zip = splitAgain[0].Split(new char[] { ' ' }, 3);
but how would you do that backwards? The reason for that is, is because it could be a two name city so an extra ' ' would break the city name.

Regular expression with named groups to make things so much simpler. No need to reverse strings. Just pluck out what you want.
var pattern = #">(?<city>.*) (?<state>.*) (?<zip>.*?)<";
var expression = new Regex(pattern);
Match m = expression .Match(startingString);
if(m.success){
Console.WriteLine("Zip: " + m.Groups["zip"].Value);
Console.WriteLine("State: " + m.Groups["state"].Value);
Console.WriteLine("City: " + m.Groups["city"].Value);
}
Should give the following results:
Found 1 match:
1. >Las Vegas,, IL 60942< has 3 groups:
1. Las Vegas,, (city)
2. IL (state)
3. 60942 (zip)
String literals for use in programs:
C#
#">(?<city>.*) (?<state>.*) (?<zip>.*?)<"

One possible solution - not optimal but easy to code - is to reverse the string, then to split that string using the "normal" function, then to reverse each of the individual split parts.
Another possible solution is to use regular expressions instead.

I think you should do it like this:
var s = splitAgain[0];
var zipCodeStart = s.LastIndexOf(' ');
var zipCode = s.Substring(zipCodeStart + 1);
s = s.Substring(0, zipCodeStart);
var stateStart = s.LastIndexOf(' ');
var state = s.Substring(stateStart + 1);
var city = s.Substring(0, stateStart );
var result = new [] {zipCode, state, city};
Result will contain what you requested.

If Split could do everything there would be so many overloads that it would become confusing.
Don't use split, just custom code it with substrings and lastIndexOf.
string str = "Hoopeston,, IL 60942";
string[] parts = new string[3];
int place = str.LastIndexOf(' ');
parts[0] = str.Substring(place+1);
int place2 = str.LastIndexOf(' ',place-1);
parts[1] = str.Substring(place2 + 1, place - place2 -1);
parts[2] = str.Substring(0, place2);

You can use a regular expression to get the three parts of the string inside the tag, and use LINQ extensions to get the strings in the right order.
Example:
string startingString = "<span class=\"address\">East St Louis,, IL 60942</span><br>";
string[] city =
Regex.Match(startingString, #"^.+>(.+) (\S+) (\S+?)<.+$")
.Groups.Cast<Group>().Skip(1)
.Select(g => g.Value)
.Reverse().ToArray();
Console.WriteLine(city[0]);
Console.WriteLine(city[1]);
Console.WriteLine(city[2]);
Output:
60942
IL
East St Louis,,

How about
using System.Linq
...
splitAgain[0].Split(' ').Reverse().ToArray()
-edit-
ok missed the last part about multi word cites, you can still use linq though:
splitAgain[0].Split(' ').Reverse().Take(2).ToArray()
would get you the
[0]="60942"
[1]="IL"
The city would not be included here though, you could still do the whole thing in one statement but it would be a little messy:
var elements = splitAgain[0].Split(' ');
var result = elements
.Reverse()
.Take(2)
.Concat( new[ ] { String.Join( " " , elements.Take( elements.Length - 2 ).ToArray( ) ) } )
.ToArray();
So we're
Splitting the string,
Reversing it,
Taking the two first elements (the last two originally)
Then we make a new array with a single string element, and make that string from the original array of elements minus the last 2 elements (Zip and postal code)
As i said, a litle messy, but it will get you the array you want. if you dont need it to be an array of that format you could obviously simplfy the above code a little bit.
you could also do:
var result = new[ ]{
elements[elements.Length - 1], //last element
elements[elements.Length - 2], //second to last
String.Join( " " , elements.Take( elements.Length - 2 ).ToArray( ) ) //rebuild original string - 2 last elements
};

At first I thought you should use Array.Reverse() method, but I see now that it is the splitting on the ' ' (space) that is the issue.
Your first value could have a space in it (ie "New York"), so you dont want to split on spaces.
If you know the string is only ever going to have 3 values in it, then you could use String.LastIndexOf(" ") and then use String.SubString() to trim that off and then do the same again to find the middle value and then you will be left with the first value, with or without spaces.

Was facing similar issue with audio FileName conventions.
Followed this way: String to Array conversion, reverse and split, and reverse each part back to normal.
char[] addressInCharArray = fullAddress.ToCharArray();
Array.Reverse(addressInCharArray);
string[] parts = (new string(addressInCharArray)).Split(new char[] { ' ' }, 3);
string[] subAddress = new string[parts.Length];
int j = 0;
foreach (string part in parts)
{
addressInCharArray = part.ToCharArray();
Array.Reverse(addressInCharArray);
subAddress[j++] = new string(addressInCharArray);
}

Splitting a CSV and excluding commas within elements

I've got a CSV string an I want to separate it into an array. However the CSV is a mix of strings and numbers where the strings are enclosed in quotes and may contain commas.
For example, I might have a CSV as follows:
1,"Hello",2,"World",3,"Hello, World"
I would like it so the string is split into:
1
"Hello"
2
"World"
3
"Hello, World"
If I use String.Split(','); I get:
1
"Hello"
2
"World"
3
"Hello
World"
Is there an easy way of doing this? A library that is already written or do I have to parse the string character by character?

The "A Fast CSV Reader" article on Code Project. I've used it happily many times.

String.Split() is icky for this. Not only does it have nasty corner cases where it doesn't work like the one you just found (and others you haven't seen yet), but performance is less than ideal as well. The FastCSVReader posted by others will work, there's a decent csv parser built into the framework (Microsoft.VisualBasic.TextFieldParser), and I have a simple parser that behaves correctly posted to this question.

I would suggest using one of the following solutions, was just testing a few of them (hence the delay):-
Regex matching commas not found within an enclosing double aprostophe
A Fast CSV Reader - for read CSV only
FileHelpers Library 2.0 - for read/write CSV
Hope this helps.

It's not the most elegant solution, but the quickest if you want to just quickly copy and paste code (avoiding having to import DLLs or other code libraries):
private string[] splitQuoted(string line, char delimeter)
{
string[] array;
List<string> list = new List<string>();
do
{
if (line.StartsWith("\""))
{
line = line.Substring(1);
int idx = line.IndexOf("\"");
while (line.IndexOf("\"", idx) == line.IndexOf("\"\"", idx))
{
idx = line.IndexOf("\"\"", idx) + 2;
}
idx = line.IndexOf("\"", idx);
list.Add(line.Substring(0, idx));
line = line.Substring(idx + 2);
}
else
{
list.Add(line.Substring(0, Math.Max(line.IndexOf(delimeter), 0)));
line = line.Substring(line.IndexOf(delimeter) + 1);
}
}
while (line.IndexOf(delimeter) != -1);
list.Add(line);
array = new string[list.Count];
list.CopyTo(array);
return array;
}

Parse Text Row with Empty Spaces

I have a file, the text format is like this:
.640 .070 -.390 -.740 -1.030 -1.410 -1.780 -1.840
-1.360 -.360 .860 1.880 2.340 2.250 1.950 1.710
1.410 .700 -.300 -.840 -.280 1.020 1.860 1.460
.310 -.460 -.320 .350 1.020 1.650 2.430 3.070
2.840 1.440 -.460 -1.650 -1.520 -.520 .250 .190
-.420 -.870 -.800 -.280 .570 1.660 2.500 2.220
.520 -1.560 -2.530 -2.030 -1.200 -1.060 -1.230 -.600
.990 2.300 2.180 .940 -.090 -.140 .320 .470
.330 .420 .830 1.080 1.090 1.530 2.740 3.800
3.410 1.610 -.150 -.900 -1.120 -1.640 -2.140 -1.590
.210 2.210 3.290 3.170 2.380 1.880 2.530 4.210
5.280 3.820 -.040 -3.670 -4.190 -1.260 2.930 5.740
5.980 3.920 .540 -2.890 -5.010 -4.780 -2.150 1.640
4.670 5.540 4.230 1.950 .120 -.470 -.010 .340
-.710 -2.940 -4.070 -1.810 3.000 6.590 6.140 2.750
-.490 -2.460 -4.180 -5.660 -4.800 -.560 4.510 6.630
5.140 2.860 2.230 2.510 1.670 -.440 -2.030 -2.330
Note that there are a lot of white characters between one value and another.
I tried to read each line, and then split the line according to a ' ' character. My code is something like this:
public List<double> Parse(StreamReader sr)
{
var dataList = new List<double>();
while (sr.Peek() >= 0)
{
string line = sr.ReadLine();
if (lineCount > 1)
{
string[] columns = line.Split(' ');
for (var j = 0; j < columns.Length; j++)
{
dataList.Add(double.Parse(columns[j]) ));
}
}
}
return dataList ;
}
The problem with the above code is that it is only able to handle the case where values are separated by a single white character.
Any idea ?

The simplest way is probably to use an overload of String.Split which includes a StringSplitOptions parameter, and specify StringSplitOptions.RemoveEmptyEntries.
I would also personally just call ReadLine until that returned null, rather than using TextReader.Peek. Aside from anything else, it's more general - it will work even if the underlying stream (if any) doesn't support seeking.

Before you do the split, replace all multi spaces with a single space, something like:
line = System.Text.RegularExpressions.Regex.Replace(line, #" +", #" ");

You may use the simple one line code for this. Let your text is in the string named input.
string[] values = System.Text.RegularExpressions.Regex.Split(input, #"\s+");
You will get all values in a string array simply

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.