CSV Parsing with double quotes

CSV Parsing with double quotes - c#

I am trying to use C# to parse CSV. I used regular expressions to find "," and read string if my header counts were equal to my match count.
Now this will not work if I have a value like:
"a",""b","x","y"","c"
then my output is:
'a'
'"b'
'x'
'y"'
'c'
but what I want is:
'a'
'"b","x","y"'
'c'
Is there any regex or any other logic I can use for this ?

CSV, when dealing with things like multi-line, quoted, different delimiters* etc - can get trickier than you might think... perhaps consider a pre-rolled answer? I use this, and it works very well.
*=remember that some locales use [tab] as the C in CSV...

CSV is a great example for code reuse - No matter which one of the csv parsers you choose, don't choose your own. Stop Rolling your own CSV parser

I would use FileHelpers if I were you. Regular Expressions are fine but hard to read, especially if you go back, after a while, for a quick fix.
Just for sake of exercising my mind, quick & dirty working C# procedure:
public static List<string> SplitCSV(string line)
{
if (string.IsNullOrEmpty(line))
throw new ArgumentException();
List<string> result = new List<string>();
bool inQuote = false;
StringBuilder val = new StringBuilder();
// parse line
foreach (var t in line.Split(','))
{
int count = t.Count(c => c == '"');
if (count > 2 && !inQuote)
{
inQuote = true;
val.Append(t);
val.Append(',');
continue;
}
if (count > 2 && inQuote)
{
inQuote = false;
val.Append(t);
result.Add(val.ToString());
continue;
}
if (count == 2 && !inQuote)
{
result.Add(t);
continue;
}
if (count == 2 && inQuote)
{
val.Append(t);
val.Append(',');
continue;
}
}
// remove quotation
for (int i = 0; i < result.Count; i++)
{
string t = result[i];
result[i] = t.Substring(1, t.Length - 2);
}
return result;
}

There's an oft quoted saying:
Some people, when confronted with a
problem, think "I know, I'll use
regular expressions." Now they have
two problems. (Jamie Zawinski)
Given that there's no official standard for CSV files (instead there are a large number of slightly incompatible styles), you need to make sure that what you implement suits the files you will be receiving. No point in implementing anything fancier than what you need - and I'm pretty sure you don't need Regular Expressions.
Here's my stab at a simple method to extract the terms - basically, it loops through the line looking for commas, keeping track of whether the current index is within a string or not:
public IEnumerable<string> SplitCSV(string line)
{
int index = 0;
int start = 0;
bool inString = false;
foreach (char c in line)
{
switch (c)
{
case '"':
inString = !inString;
break;
case ',':
if (!inString)
{
yield return line.Substring(start, index - start);
start = index + 1;
}
break;
}
index++;
}
if (start < index)
yield return line.Substring(start, index - start);
}
Standard caveat - untested code, there may be off-by-one errors.
Limitations
The quotes around a value aren't removed automatically.
To do this, add a check just before the yield return statement near the end.
Single quotes aren't supported in the same way as double quotes
You could add a separate boolean inSingleQuotedString, renaming the existing boolean to inDoubleQuotedString and treating both the same way. (You can't make the existing boolean do double work because you need the string to end with the same quote that started it.)
Whitespace isn't automatically removed
Some tools introduce whitespace around the commas in CSV files to "pretty" the file; it then becomes difficult to tell intentional whitespace from formatting whitespace.

In order to have a parseable CSV file, any double quotes inside a value need to be properly escaped somehow. The two standard ways to do this are by representing a double quote either as two double quotes back to back, or a backslash double quote. That is one of the following two forms:
""
\"
In the second form your initial string would look like this:
"a","\"b\",\"x\",\"y\"","c"
If your input string is not formatted against some rigorous format like this then you have very little chance of successfully parsing it in an automated environment.

If all your values are guaranteed to be in quotes, look for values, not for commas:
("".*?""|"[^"]*")
This takes advantage of the fact that "the earliest longest match wins" - it looks for double quoted values first, and with a lower priority for normal quoted values.
If you don't want the enclosing quote to be part of the match, use:
"(".*?"|[^"]*)"
and go for the value in match group 1.
As I said: Prerequisite for this to work is well-formed input with guaranteed quotes or double quotes around each value. Empty values must be quoted as well! A nice side-effect is that it does not care for the separator char. Commas, TABs, semi-colons, spaces, you name it. All will work.

FileHelpers supports multiline fields.
You could parse files like these:
a,"line 1
line 2
line 3"
b,"line 1
line 2
line 3"
Here is the datatype declaration:
[DelimitedRecord(",")]
public class MyRecord
{
public string field1;
[FieldQuoted('"', QuoteMode.OptionalForRead, MultilineMode.AllowForRead)]
public string field2;
}
Here is the usage:
static void Main()
{
FileHelperEngine engine = new FileHelperEngine(typeof(MyRecord));
MyRecord[] res = engine.ReadFile("file.csv");
}

Try CsvHelper (a library I maintain) or FastCsvReader. Both work well. CsvHelper does writing also. Like everyone else has been saying, don't roll your own. :P

FileHelpers for .Net is your friend.

See the link "Regex fun with CSV" at:
http://snippets.dzone.com/posts/show/4430

The Lumenworks CSV parser (open source, free but needs a codeproject login) is by far the best one I've used. It'll save you having to write the regex and is intuitive to use.

Well, I'm no regex wiz, but I'm certain they have an answer for this.
Procedurally it's going through letter by letter. Set a variable, say dontMatch, to FALSE.
Each time you run into a quote toggle dontMatch.
each time you run into a comma, check dontMatch. If it's TRUE, ignore the comma. If it's FALSE, split at the comma.
This works for the example you give, but the logic you use for quotation marks is fundamentally faulty - you must escape them or use another delimiter (single quotes, for instance) to set major quotations apart from minor quotations.
For instance,
"a", ""b", ""c", "d"", "e""
will yield bad results.
This can be fixed with another patch. Rather than simply keeping a true false you have to match quotes.
To match quotes you have to know what was last seen, which gets into pretty deep parsing territory. You'll probably, at that point, want to make sure your language is designed well, and if it is you can use a compiler tool to create a parser for you.
-Adam

I have just try your regular expression in my code..its work fine for formated text with quote ...
but wondering if we can parse below value by Regex..
"First_Bat7679",""NAME","ENAME","FILE"","","","From: "DDD,_Ala%as"#sib.com"
I am looking for result as:
'First_Bat7679'
'"NAME","ENAME","FILE"'
''
''
'From: "DDD,_Ala%as"#sib.com'
Thanx

Related

Split string with plus sign as a delimiter

I have an issue with a string containing the plus sign (+).
I want to split that string (or if there is some other way to solve my problem)
string ColumnPlusLevel = "+-J10+-J10+-J10+-J10+-J10";
string strpluslevel = "";
strpluslevel = ColumnPlusLevel;
string[] strpluslevel_lines = Regex.Split(strpluslevel, "+");
foreach (string line in strpluslevel_lines)
{
MessageBox.Show(line);
strpluslevel_summa = strpluslevel_summa + line;
}
MessageBox.Show(strpluslevel_summa, "summa sumarum");
The MessageBox is for my testing purpose.
Now... The ColumnPlusLevel string can have very varied entry but it is always a repeated pattern starting with the plus sign.
i.e. "+MJ+MJ+MJ" or "+PPL14.1+PPL14.1+PPL14.1" as examples.
(It comes form Another software and I cant edit the output from that software)
How can I find out what that pattern is that is being repeated?
That in this exampels is the +-J10 or +MJ or +PPL14.1
In my case above I have tested it by using only a MessageBox to show the result but I want the repeated pattering stored in a string later on.
Maybe im doing it wrong by using Split, maybe there is another solution.
Maybe I use Split in the wrong way.
Hope you understand my problem and the result I want.
Thanks for any advice.
/Tomas

How can I find out what that pattern is that is being repeated?
Maybe i didn't understand the requirement fully, but isn't it easy as:
string[] tokens = ColumnPlusLevel.Split(new[]{'+'}, StringSplitOptions.RemoveEmptyEntries);
string first = tokens[0];
bool repeatingPattern = tokens.Skip(1).All(s => s == first);
If repeatingPattern is true you know that the pattern itself is first.
Can you maybe explain how the logic works
The line which contains tokens.Skip(1) is a LINQ query, so you need to add using System.Linq at the top of your code file. Since tokens is a string[] which implements IEnumerable<string> you can use any LINQ (extension-)method. Enumerable.Skip(1) will skip the first because i have already stored that in a variable and i want to know if all others are same. Therefore i use All which returns false as soon as one item doesn't match the condition(so one string is different to the first). If all are same you know that there is a repeating pattern which is already stored in the variable first.

You should use String.Split function :
string pattern = ColumnPlusLevel.Split("+")[0];

...but it is always a repeated pattern starting with the plus sign.
Why do you even need String.Split() here if the pattern always only repeats itself?
string input = #"+MJ+MJ+MJ";
int indexOfSecondPlus = input.IndexOf('+', 1);
string pattern = input.Remove(indexOfSecondPlus, input.Length - indexOfSecondPlus);
//pattern is now "+MJ"
No need of string split, no need to use LinQ

String has a method called Split which let's you split/divide the string based on a given character/character-set:
string givenString = "+-J10+-J10+-J10+-J10+-J10"'
string SplittedString = givenString.Split("+")[0] ///Here + is the character based on which the string would be splitted and 0 is the index number
string result = SplittedString.Replace("-","") //The mothod REPLACE replaces the given string with a targeted string,i added this so that you can get the numbers only from the string

Most efficient way to parse a delimited string in C#

This has been asked a few different ways but I am debating on "my way" vs "your way" with another developer. Language is C#.
I want to parse a pipe delimited string where the first 2 characters of each chunk is my tag.
The rules. Not my rules but rules I have been given and must follow.
I can't change the format of the string.
This function will be called possibly many times so efficiency is key.
I need to keep is simple.
The input string and tag I am looking for may/will change during runtime.
Example input string: AOVALUE1|ABVALUE2|ACVALUE3|ADVALUE4
Example tag I may need value for: AB
I split string into an array based on delimiter and loop through the array each time the function is called. I then looked at the first 2 characters and return the value minus the first 2 characters.
The "other guys" way is to take the string and use a combination of IndexOf and SubString to find the starting point and ending point of the field I am looking for. Then using SubString again to pullout the value minus the first 2 characters. So he would say IndexOf("|AB") the find then next pipe in the string. This would be the start and end. Then SubString that out.
Now I should think that IndexOf and SubString would parse the string each time at a char by char level so this would be less efficient than using large chunks and reading the string minus the first 2 characters. Or is there another way the is better then what both of us has proposed?

The other guy's approach is going to be more efficient in time given that input string needs to be reevaluated each time. If the input string is long, it is also won't require the extra memory that splitting the string would.
If I'm trying to code a really tight loop I prefer to directly use array/string operators rather than LINQ to avoid that additional overhead:
string inputString = "AOVALUE1|ABVALUE2|ACVALUE3|ADVALUE4";
static string FindString(string tag)
{
int startIndex;
if (inputString.StartsWith(tag))
{
startIndex = tag.Length;
}
else
{
startIndex = inputString.IndexOf(string.Format("|{0}", tag));
if (startIndex == -1)
return string.Empty;
startIndex += tag.Length + 1;
}
int endIndex = inputString.IndexOf('|', startIndex);
if (endIndex == -1)
endIndex = inputString.Length;
return inputString.Substring(startIndex, endIndex - startIndex);
}

I've done a lot of parsing in C# and I would probably take the approach suggested by the "other guys" just because it would be a bit lighter on resources used and likely to be a little faster as well.
That said, as long as the data isn't too big, there's nothing wrong with the first approach and it will be much easier to program.

Something like this may work ok
string myString = "AOVALUE1|ABVALUE2|ACVALUE3|ADVALUE4";
string selector = "AB";
var results = myString.Split('|').Where(x => x.StartsWith(selector)).Select(x => x.Replace(selector, ""));
Returns: list of the matches, in this case just one "VALUE2"
If you are just looking for the first or only match this will work.
string result = myString.Split('|').Where(x => x.StartsWith(selector)).Select(x => x.Replace(selector, "")).FirstOrDefault();

SubString does not parse the string.
IndexOf does parse the string.
My preference would be the Split method, primarily code coding efficiency:
string[] inputArr = input.Split("|".ToCharArray()).Select(s => s.Substring(3)).ToArray();
is pretty concise. How many LoC does the substring/indexof method take?

Regular expression for parsing CSV

I'm trying to parse a CSV file in C#. Split on commas (,). I got it to work with this:
[\t,](?=(?:[^\"]|\"[^\"]*\")*$)
Splitting this string:
2012-01-06,"Some text with, comma",,"300,00","143,52"
Gives me:
2012-01-06
"Some text with, comma"
"300,00"
"143,52"
But I can't figure out how to lose the "" from the output so I get this instead:
2012-01-06
Some text with, comma
300,00
143,52
Any suggestions?

If you are trying to parse a CSV and using .NET, don't use regular expressions. Use a component that was created for this purpose. See the question CSV File Imports in .Net.
I know the CSV specification looks simple enough, but trust me, you are in for heartache and destruction if you continue down this path.

Why are you using regular expressions for this? Ensuring the file is well-formed?
You can use String.Replace()
String s = "Some text with, comma";
s = s.Replace("\"", "");
// After matched
String line = 2012-01-06,"Some text with, comma",,"300,00","143,52";
String []fields = line.Split(',');
for (int i = 0; i < fields.Length; i++)
{
// Call a function to remove quotes
fields[i] = removeQuotes(fields[i]);
}
String removeQuotes(String s)
{
return s.Replace("\"", "");
}

So, something like this. Again, I wouldn't use RegEx for this purpose, but YMMV.
var sp = Regex.Split(a, "[\t,](?=(?:[^\"]|\"[^\"]*\")*$)")
.Select(s => Regex.Replace(s.Replace("\"\"","\""),"^\"|\"$","")).ToArray();
So, the idea here is that first of all, you want to replace double double quotes with a single double quote. And then that string is fed to the second regex which simply removes double quotes at the beginning and end of the string.
The reason for the first replace is because of strings like this:
var a = "1999,Chevy,\"Venture \"\"Extended Edition, Very Large\"\" Dude\",\"\",\"5000.00\"";
So, this would give you a string like this: ""Extended Edition"", and the double quotes need to be changed to single quotes.

Find if string contains at least 2 characters similar to another? C#

I need a method to check if a string contains one or more similar characters to another. I dont want to find all strings containing the letter "D".
For example, if I have a string "Christopher" and want to see if "Chris" is contained in "Christopher", I want that to return. However, if I want to see if "Candy" is in the string "Christopher", I wont want it to return just because it has a "C" in common.
I have tried the .Contains() method but cant give that rules for 2 or more similar characters and I have thought about using regular expressions but that might be a bit over kill. The similar letters must be next to eachother.
Thank you :)

This looks for each 2-character-gram of s1 and looks for it in s2.
string s1 = "Chrx";
string s2 = "Christopher";
IsMatchOn2Characters(s1, s2);
static bool IsMatchOn2Characters(string a, string b)
{
string s1 = a.ToLowerInvariant();
string s2 = b.ToLowerInvariant();
for (int i = 0; i < s1.Length - 1; i++)
{
if (s2.IndexOf(s1.Substring(i,2)) >= 0)
return true; // match
}
return false; // no match
}

This looks a lot like a longest common substring problem. This can be solved easily using DP in O(m*n).
If you are not worried about performance and don't really want to implement this, you can also go with the brute force solution of searching every substring of s1 into s2.

Parse without string split

This is a spin-off from the discussion in some other question.
Suppose I've got to parse a huge number of very long strings. Each string contains a sequence of doubles (in text representation, of course) separated by whitespace. I need to parse the doubles into a List<double>.
The standard parsing technique (using string.Split + double.TryParse) seems to be quite slow: for each of the numbers we need to allocate a string.
I tried to make it old C-like way: compute the indices of the beginning and the end of substrings containing the numbers, and parse it "in place", without creating additional string. (See http://ideone.com/Op6h0, below shown the relevant part.)
int startIdx, endIdx = 0;
while(true)
{
startIdx = endIdx;
// no find_first_not_of in C#
while (startIdx < s.Length && s[startIdx] == ' ') startIdx++;
if (startIdx == s.Length) break;
endIdx = s.IndexOf(' ', startIdx);
if (endIdx == -1) endIdx = s.Length;
// how to extract a double here?
}
There is an overload of string.IndexOf, searching only within a given substring, but I failed to find a method for parsing a double from substring, without actually extracting that substring first.
Does anyone have an idea?

There is no managed API to parse a double from a substring. My guess is that allocating the string will be insignificant compared to all the floating point operations in double.Parse.
Anyway, you can save the allocation by creating a "buffer" string once of length 100 consisting of whitespace only. Then, for every string you want to parse, you copy the chars into this buffer string using unsafe code. You fill the buffer string with whitespace. And for parsing you can use NumberStyles.AllowTrailingWhite which will cause trailing whitespace to be ignored.
Getting a pointer to string is actually a fully supported operation:
string l_pos = new string(' ', 100); //don't write to a shared string!
unsafe
{
fixed (char* l_pSrc = l_pos)
{
// do some work
}
}
C# has special syntax to bind a string to a char*.

if you want to do it really fast, i would use a state machine
this could look like:
enum State
{
Separator, Sign, Mantisse etc.
}
State CurrentState = State.Separator;
int Prefix, Exponent, Mantisse;
foreach(var ch in InputString)
{
switch(CurrentState)
{ // set new currentstate in dependence of ch and CurrentState
case Separator:
GotNewDouble(Prefix, Exponent, Mantisse);
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

CSV Parsing with double quotes - c#

CSV, when dealing with things like multi-line, quoted, different delimiters* etc - can get trickier than you might think... perhaps consider a pre-rolled answer? I use this, and it works very well. *=remember that some locales use [tab] as the C in CSV...

CSV is a great example for code reuse - No matter which one of the csv parsers you choose, don't choose your own. Stop Rolling your own CSV parser

Try CsvHelper (a library I maintain) or FastCsvReader. Both work well. CsvHelper does writing also. Like everyone else has been saying, don't roll your own. :P

FileHelpers for .Net is your friend.

See the link "Regex fun with CSV" at: http://snippets.dzone.com/posts/show/4430

The Lumenworks CSV parser (open source, free but needs a codeproject login) is by far the best one I've used. It'll save you having to write the regex and is intuitive to use.

Related

Split string with plus sign as a delimiter

Most efficient way to parse a delimited string in C#

Regular expression for parsing CSV

Find if string contains at least 2 characters similar to another? C#

Parse without string split

Categories

Resources