I can't seem to handle a CSV I got. It's a file generated by a bank, which looks like this:
"000,""PLN"",""XYZ"",""2011-08-31"",""2011-08-31"",""0,00"""
1,""E"",""2011-08-30"",""2011-08-31"",""2011-08-31"",""399,00"",""0000103817846977"",""UZNANIE OTRZYMANE ELIXIR"",""23103015080000000550217023"",""XXX"",""POLISA UBEZPIECZENIA NR XXX "",""000""
3,""E"",""2011-08-31"",""2011-08-31"",""2011-08-31"",""1433,00"",""0000154450232753"",""UZNANIE OTRZYMANE ELIXIR"",""000"",""XXX"",""POLISA UBEZPIECZENIA XXX "",""000""
(I changed all sensitive information).
I've been trying to parse it since morning but no biggie. I used the LINQ to CSV example found somwhere on the net, the CodeProject one (both of them threw an error which said that the CSV is corrupted) and I ended with FileHelpers which SEEMS to work BUT:
It splits the "399,00" and similar values into two fields.
When I use the [(FieldQuoted()] attribute it all goes to hell, since all the fields are quoted in DOUBLE quotation marks. I suspect that is the reason why the other parsers wouldn't work.
Any ideas how to handle it?
If the problem seems to be the double quote, you could preprocess each line by substituting the double double quotes by single double quotes:
line = line.Replace( "\"\"", "\"" );
Once the whole file has been processed, you can let it handled by any other CSV processor.
It will be probably easier to write your own, anyway.
I have been using Lumen, CommonLibrary, FileHelpers etc. and I ended up with TextFieldParser class (from Visual Basic namespace, but can be used in C# without any problem). I recommend you try that. The only downside is that it's relatively slow. But it seems to cope with edge cases quite well.
I even invented a trick getting it to work with obviously invalid CSV files (""" etc.; OpenOffice Calc couldn't handle them properly) - when I'd encounter such a line and got a MalformedLineException, I'd still parse it within the catch block with the HasFieldsEnclosedInQuotes property set to false, for a change.
It would split the line properly, just leaving all the values in double apostrophes. All I had to do then was to remove these double quotes "manually".
Related
This is likely a very basic question that I could not, despite trying, find a satsifying answer to. Feel free to skip to the question at the end if you aren't interested in the background.
The task:
I wish to create an easy localisation solution for my unity projects. After some initial research I concluded it would be best to use a .csv file read by a streamreader, so that translators would only ever have to interact with the csv table, where information is neatly organized.
The main problem:
Due to the nature of the text, I need to account for linebreaks and special characters in the actual fields. As such I could not use the normal readLine() method.
This I worked with by using Read() and checking if a linebreak is within a text delimiter bracket. But as I check for the text delimiter, I am afraid it might run into an un-escaped delimiter part of the normal in-cell text (since the normal text delimiter is quotation marks).
So I switched the delimiter to §. But now every time I open the file I have to re-enter § as a text delimiter in OpenOfficeCalc, probably due to encoding differences. Which is annoying but not the end of the world.
My question:
How does OpenOffice (or similar software) usually tell in-cell commas/quotation marks apart from the ones used as delimiters? If I knew that, I could probably incorporate a similar approach in my reading of the file.
I've tried to look at the files with NotePad++, revealing a difference in linebreaks (/r instead of /r/n) and obviously it's within a text delimiter bracket, but when it comes to how it seperates its delimiters from ones just entered in the text/field, I am drawing a blank.
Translation file in OpenOffice Calc:
Translation file in NotePad++, showing all characters:
I'd appreciate any insight or links on the topic.
From https://en.wikipedia.org/wiki/Comma-separated_values:
The CSV file format is not fully standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line breaks.
LibreOffice Calc has a reasonable way to handle these things.
Use LF for line breaks and CR at the end of each record. It seems your code already handles this.
Use quotes to delimit strings when needed. If the string contains one or more quotes, then duplicate the quote to make it literal.
From the example in your question, it looks like you told Calc not to use any quotes as string delimiters. Why did you do this? When I tried it, LibreOffice (or Apache OpenOffice) showed the fields in different columns after opening the file saved that way.
The following example CSV file has fields that contain commas, quotes and line breaks.
When viewed in Calc:
A B
--------- --
1 | 1,",2", 3
--------- --
2 | a c
| b
Calc correctly reads and saves the file as shown below. Settings when saving are Field delimiter , and String delimiter " which are the defaults.
"1,"",2"",",3[CR]
"a
b",c[CR]
Ok, I'm racking my brains over this one. It's pretty simple though (I think).
I'm currently creating a text file as a comma separated string of values.
Later, I read in that file data and then use the .split function to split the data by commas.
I discovered that sometimes one of the description fields in the data conatins an embedded comma, which ends up throwing the split command off.
Is there any special character I could use that could pretty much guarantee wouldn't be in the data, or is there a better way to accomplish this? Thanks!
// Initial Load
fullString = fileName + "," + String.Join(",", fieldValues);
// Access later
String[] valuesArray = myString.Split(',');
Short answer, there's no "simple" way to do it using Split. The best you can hope for is to set the deliminator as something cooky that wouldn't ever get used (but even that's not a guarantee).
The simple method would be to used something like CsvHelper (get it through Nuget) or any of the other dozen or so packages that are designed for parsing CSV.
Having used SQL Server Bulk insert of CSV file with inconsistent quotes (CsvToOtherDelimiter option) as my basis, I discovered a few weirdnesses with the RemoveCSVQuotes part [it chopped the last char from quoted strings that contained a comma!]. So.. rewrote that bit (maybe a mistake?)
One wrinkle is that the client has asked 'what about data like this?'
""17.5179C,""
I assume if I wanted to keep using the CsvToOtherDelimiter solution, I'd have to amend the RegExp...but it's WAY beyond me... what's the best approach?
To clarify: we are using C# to pre-process the file into a pipe-delimited format prior to running a bulk insert using a format file. Speed is pretty vital.
The accepted answer from your link starts with:
You are going to need to preprocess the file, period.
Why not transform your csv to xml? Then you would be able to verify your data against an xsd before storing into a database.
To convert a CSV string into a list of elements, you could write a program that keeps track of state (in quotes or out of quotes) as it processes the string one character at a time, and emits the elements it finds. The rules for quoting in CSV are weird, so you'll want to make sure you have plenty of test data.
The state machine could go like this:
scan until quote (go to 2) or comma (go to 3)
if the next character is a quote, add only one of the two quotes to the field and return to 1. Otherwise, go to 4 (or report an error if the quote isn't the first character in the field).
emit the field, go to 1
scan until quote (go to 5)
if the next character is a quote, add only one of the two quotes to the field and return to 4. Otherwise, emit the field, scan for a comma, and go to 1.
This should correctly scan stuff like:
hello, world, 123, 456
"hello world", 123, 456
"He said ""Hello, world!""", "and I said hi"
""17.5179C,"" (correctly reports an error, since there should be a
separator between the first quoted string "" and the second field
17.5179C).
Another way would be to find some existing library that does it well. Surely, CSV is common enough that such a thing must exist?
edit:
You mention that speed is vital, so I wanted to point out that (so long as the quoted strings aren't allowed to include line returns...) each line may be processed independently in parallel.
I ended up using the csv parser that I don't know we had already (comes as part of our code generation tool) - and noting that ""17.5179C,"" is not valid and will cause errors.
I have a CSV file I need to process which is a bit of a nightmare. Esentially it is the following
"Id","Name","Description"
"1","Test1","Test description text"
"2","Test2","<doc><style>body{font-family:"Calibri","sans-serif";}</style><p class="test_class"
name="test_name">Lots of word xdoc content here.</p></doc>"
"guid-xxxx-xxxx-xxxx-xxxx","Test3","Test description text 3"
I'm using the File Helpers library to process the CSV rather than reinvent the wheel. However, due to the description field containing unescaped Word xdoc xml which contains quotes it's getting rather confused when it comes to the start and end points of each record.
The following is an example mapping class.
[DelimitedRecord(","), IgnoreFirst(1), IgnoreEmptyLines()]
public class CSVDoc
{
#region Properties
[FieldQuoted('"', QuoteMode.AlwaysQuoted), FieldTrim(TrimMode.Both)]
public string Id;
[FieldQuoted('"', QuoteMode.AlwaysQuoted), FieldTrim(TrimMode.Both)]
public string Name;
[FieldQuoted('"', QuoteMode.AlwaysQuoted), FieldTrim(TrimMode.Both)]
public string Description;
[FieldQuoted('"', QuoteMode.AlwaysQuoted), FieldTrim(TrimMode.Both)]
}
I considered (despite my hate of regex for this kind of task) replacing all " with ' and then using ((?<=(^|',))'|'(?=($|,'))) pattern to replace all ' with " at the start and end of lines and where they are formatted ','. However, the dirty file contains some lines which end with a " and some css style attributes which are formatted ","
So now I'm left scratching my head trying to figure out how to do this and how it can be automated.
Any ideas?
You're going to have to re-invent the wheel, because that's not valid CSV or indeed a reasonable file at all - it doesn't have any sort of provably consistent escaping rules (e.g. we don't know if the plain-text columns are escaped correctly or not).
Your best bet is to ask the person producing this to fix the bug, it should be e.g.:
"2","Test2","<doc><style>body{font-family:""Calibri"",""sans-serif"";}</style><p class=""test_class""
name=""test_name"">Lots of word xdoc content here.</p></doc>"
Which your parser should handle fine, and which should not be hard for them to produce in a simple and efficient manner.
Failing that, you'll have to hand-code the parser to:
Read a line.
Check for unescaped " (any "that isn't followed by a " a , or whitespace.
If none found, parse as CSV.
If any found, parse as this horrible thing until you hit the line ending with "
It may be easier to look for < if that is consistently not used in the other lines. Or perhaps for <doc if it consistently identifies the correct rows.
If you don't mind doing some pre-processing before, you can change the first and second "," to "|" and then use FileHelper to parse the file normally (Assuming you don't have | in the last column where there are HTML tags)
The pre-processing could be something like (Simple pseudo code) :
var sb = new StringBuilder()
var regex = new Regex("\",\"");
foreach(string line in textFileLines)
{
sb.AppendLine(regex.Replace(line , "\"|\"", 2));
}
I worked on the CSV-1203 File Format standard a few months ago, so the first thing to realise is that you're not dealing with a CSV file - even though it's named "xyz.CSV".
As said by others here, it will be easier to write your own reader, they're not too difficult. I too have a hatred of everything regex, but the good news is you can code any solution without ever using it.
A couple of things: There's a really weird thing Excel does to CSV files that begin with the two capital letters ID (without quotes). It thinks your CSV is a corrupted SYLK file! Try it.
For details of this issue and a detailed CSV File Format specification, please refer to http://mastpoint.curzonnassau.com/csv-1203
I have a CSV file (which I didn't design and I can't change now nor will I ever be able to change it) that contains lines like the following:
"Surname, Firstname", yes, no, somestring, whatever, etc
As you can see here, the first , is not a comma on which I'd want to split the string. Notice that this particular comma is enclosed within the quotation marks.
Because of this, a simple string.split(',') obviously won't work, as it would give me an array of length 7 for the above string instead of 6.
Is there a way to get around this? I was thinking of using regex to split the string instead but I'm not competent enough in regex to think of a pattern that would only split on commas that are not enclosed inside quotation marks.
I can think of ugly, hacky ways to do it by reading each string char by char but this would have to be a last resort as I'm sure there's a better way to do it!
You can handle this easily by using the TextFieldParser class. Just set HasFieldsEnclosedInQuotes to true.
I would suggest using a CSV parser library - there are other cases that you wouldn't have thought of (new line as part of a quoted field).
The VisualBasic namespace has a nice library that can help - the TextFieldParser.
I know there's a lot of people here who think character-by-character comparisons should never be used and will strongly disagree with me but I'm not convinced companies like Microsoft aren't the only ones who should be doing that sort of programming.
Afterall, Split does character-by-character comparisons so why is it any less ugly when you call existing code that doesn't quite do exactly what you want?
At any rate, my approach was to write my own code. And I've posted the code online at http://www.blackbeltcoder.com/Articles/files/reading-and-writing-csv-files-in-c.