String processing / CSV challenge

String processing / CSV challenge - c#

Having used SQL Server Bulk insert of CSV file with inconsistent quotes (CsvToOtherDelimiter option) as my basis, I discovered a few weirdnesses with the RemoveCSVQuotes part [it chopped the last char from quoted strings that contained a comma!]. So.. rewrote that bit (maybe a mistake?)
One wrinkle is that the client has asked 'what about data like this?'
""17.5179C,""
I assume if I wanted to keep using the CsvToOtherDelimiter solution, I'd have to amend the RegExp...but it's WAY beyond me... what's the best approach?
To clarify: we are using C# to pre-process the file into a pipe-delimited format prior to running a bulk insert using a format file. Speed is pretty vital.

The accepted answer from your link starts with:
You are going to need to preprocess the file, period.
Why not transform your csv to xml? Then you would be able to verify your data against an xsd before storing into a database.

To convert a CSV string into a list of elements, you could write a program that keeps track of state (in quotes or out of quotes) as it processes the string one character at a time, and emits the elements it finds. The rules for quoting in CSV are weird, so you'll want to make sure you have plenty of test data.
The state machine could go like this:
scan until quote (go to 2) or comma (go to 3)
if the next character is a quote, add only one of the two quotes to the field and return to 1. Otherwise, go to 4 (or report an error if the quote isn't the first character in the field).
emit the field, go to 1
scan until quote (go to 5)
if the next character is a quote, add only one of the two quotes to the field and return to 4. Otherwise, emit the field, scan for a comma, and go to 1.
This should correctly scan stuff like:
hello, world, 123, 456
"hello world", 123, 456
"He said ""Hello, world!""", "and I said hi"
""17.5179C,"" (correctly reports an error, since there should be a
separator between the first quoted string "" and the second field
17.5179C).
Another way would be to find some existing library that does it well. Surely, CSV is common enough that such a thing must exist?
edit:
You mention that speed is vital, so I wanted to point out that (so long as the quoted strings aren't allowed to include line returns...) each line may be processed independently in parallel.

I ended up using the csv parser that I don't know we had already (comes as part of our code generation tool) - and noting that ""17.5179C,"" is not valid and will cause errors.

Related

Escape semicolon in interpolated string C#

I am writing data to .csv file and I need the below expression to be read correctly:
csvWriter.WriteLine($"Security: {sec.ID} ; Serial number: {sec.SecuritySerialNo}");
the semicolon in between is used to put the serial number in a separate cell.
The problem is that ID can also contain semicolons and mess up the data, therefore I need to escape it.
I have tried to use replace:
csvWriter.WriteLine($"Security: {sec.ID.Replace(";", "")} ; Serial number: {sec.SecuritySerialNo}");
though deleting semicolons is not what I want to achieve, I just want to escape them.

Let's emphasize again that the best way to create a CSV file is through a specialized CSV Parser library.
However, just to resolve the simple case presented by your question you could add double quotes around each field. This should be enough to explain to the Excel parser how to handle your fields.
So, export the fields in this way:
csvWriter.WriteLine($"\"Security: {sec.ID}\";\"Serial number: {sec.SecuritySerialNo}\"");
Notice that I have removed the blank space around the semicolon. It is important otherwise Excel will not parse the line correctly

Field and text delimiters within cells in csv files

This is likely a very basic question that I could not, despite trying, find a satsifying answer to. Feel free to skip to the question at the end if you aren't interested in the background.
The task:
I wish to create an easy localisation solution for my unity projects. After some initial research I concluded it would be best to use a .csv file read by a streamreader, so that translators would only ever have to interact with the csv table, where information is neatly organized.
The main problem:
Due to the nature of the text, I need to account for linebreaks and special characters in the actual fields. As such I could not use the normal readLine() method.
This I worked with by using Read() and checking if a linebreak is within a text delimiter bracket. But as I check for the text delimiter, I am afraid it might run into an un-escaped delimiter part of the normal in-cell text (since the normal text delimiter is quotation marks).
So I switched the delimiter to §. But now every time I open the file I have to re-enter § as a text delimiter in OpenOfficeCalc, probably due to encoding differences. Which is annoying but not the end of the world.
My question:
How does OpenOffice (or similar software) usually tell in-cell commas/quotation marks apart from the ones used as delimiters? If I knew that, I could probably incorporate a similar approach in my reading of the file.
I've tried to look at the files with NotePad++, revealing a difference in linebreaks (/r instead of /r/n) and obviously it's within a text delimiter bracket, but when it comes to how it seperates its delimiters from ones just entered in the text/field, I am drawing a blank.
Translation file in OpenOffice Calc:
Translation file in NotePad++, showing all characters:
I'd appreciate any insight or links on the topic.

From https://en.wikipedia.org/wiki/Comma-separated_values:
The CSV file format is not fully standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line breaks.
LibreOffice Calc has a reasonable way to handle these things.
Use LF for line breaks and CR at the end of each record. It seems your code already handles this.
Use quotes to delimit strings when needed. If the string contains one or more quotes, then duplicate the quote to make it literal.
From the example in your question, it looks like you told Calc not to use any quotes as string delimiters. Why did you do this? When I tried it, LibreOffice (or Apache OpenOffice) showed the fields in different columns after opening the file saved that way.
The following example CSV file has fields that contain commas, quotes and line breaks.
When viewed in Calc:
A B
--------- --
1 | 1,",2", 3
--------- --
2 | a c
| b
Calc correctly reads and saves the file as shown below. Settings when saving are Field delimiter , and String delimiter " which are the defaults.
"1,"",2"",",3[CR]
"a
b",c[CR]

How do I determine a delimiter in a text file

I have 2 types of input files:
1. comma delimited (i.e: lastName, firstName, Address)
2. space delimited (i.e lastName firstName Address)
The comma delimited file HAS spaces between the ',' and the next word.
How do I go about determining which file I am dealing with ?
I am using C# btw

I've done tons of work with various delimited file types and as everyone else is saying, without normalization you can't really handle the whole thing programmatically.
Generally (and it seems like it would be totally necessary for space-delim) a delimited file will have a text qualifier character (often double-quotes). A couple examples of this points:
Space Delimited:
lastName "Von Marshall" is impossible
without qualifiers.
Addresses would be altogether impossible as well.
Comma Delimited:
addresses are generally unworkable unless they are broken into separate fields or having a solid string is acceptable for your use-case.
So the space delim should be easy enough to determine since you're looking for " ". If this is the case I'd (personally) replace all " " with "," to change it to comma-delim. That way you'd only have to build a single method for handling the text, otherwise I imagine you'll need methods for spaces and commas separately.
If your comma-delim file does not have a text qualifier, you're in a really tricky spot. I haven't found any "perfect" way of addressing this without any human work, but it can be minimized. I've used Notepad++ a lot to do batch replacement with its regular expression functions.
However, you can also use C#'s regex abilities. Here's what MSDN says on that.
So, to answer your question to the best of my ability, unless you can establish a uniqueness between the 2 file types - there's no way. However, if the text has proper text qualifiers, the files have different file extensions, or if the are generated in different directories - you could use any of those qualities or a mix thereof to decide what type of file it is. I have no experience doing this as yet (though I've just started a project using it), so I can't give an exact example, but I can say for anyone to build a perfect example it'd be best if you showed example strings for each file.

As other users have said with some guaranty of having no commas in the space delimited version you cannot with 100% accuracy.
With some information, say that there will always be three fields for all records in all cases when parsed correctly you could just do both and test the results for the correct number of fields. Address is a big block here though since we do not know what that format could be. Also these rules seems odd at best when talking about address.... is
1111somestreest.houston,tx11111 or
1111 somestreet st. Houston, Tx 11111
a valid format?

You could count the number of commas per line of the file. If you have at least 2 commas per line (considering your info is last name, first name, address), you probably have a comma separated. If you have, in at least one line, less than 2 commas, you should consider it as space separated.
I, however, would skip this step and ignore the commas when evaluating the input by replacing all of them by spaces and would implement a single read/grab information procedure (considering only space separated files).

Parsing a CSV file

We have an integration with another system that relies on passing CSV files back and forth (really old school).
The structure is generally:
ID, Name, PhoneNumber, comments, fathersname
1, tom, 555-1234, just some random text, bill
2, jill smith, 555-4234, other random text, richard
Every so often we see this:
3, jacked up, 999-1231, here
be dragons
amongst us, ted
The primary problem I care about is detecting that a line breaker (\n) occurs in the middle of the record when that is the record terminator.
Is there anyway I can preprocess this to reliably fix it?
Note that we have zero control over what the other system emits.

So you should be able to do something more or less like this:
for (int i = 0; i < lines.Count; i++)
{
var fields = lines[i].Split(',').ToList();
while (fields.Count < numFields)//here be dragons amonst us
{
i++;//include next line in this line
//check to make sure we haven't run out of lines.
//combine end of previous field with start of the next one,
//and add the line break back in.
var innerFields = lines[i].Split(',');
fields[fields.Count - 1] += "\n" + innerFields[0];
fields.AddRange(innerFields.Skip(1));
}
//we now know we have a "real" full line
processFields(fields);
}
(For simplicity I assumed all lines were read in at the start; I assume you could alter it to lazily fetch each line easily enough.)

Let me start and say that the CSV file in your example is invalid. If a line break occurs inside a string, it should be wrapped with double quote characters.
Now for the answer - In order to parse this invalid csv format you must do several assumptions. In this case I made 2 assumptions: 1) The ID column must be numeric 2) The comment field can not contain digits.
Based on these assumptions you can check the first character after the line break character. If it is digit, you assume its a new record. If not you should treat it as a continue value of the comment field.
I don't know if the second assumption is valid, if not, you can enhance the logic so it will cover the business rules of the system.
Good Luck!

Firstly I would recommend using a tool to manage reading and writing your csv files, I use the FileHelpers library which is great.
You can essentially type your records and it will do all the validation and such for you. Worth the effort.
To your question perhaps you can do some preprocessing on the file and use Regex to replace any line breaks with a space?
I do something similar (not with files but) try
line.Replace(Environment.NewLine, " ");
With FileHelpers you could write a custom converter to do this during processing, or hook into the BeforeRead event.

Easiest way of working with a comma separated list

I'm about to build a solution to where I receive a comma separated list every night. It's a list with around 14000 rows, and I need to go through the list and select some of the values in the list.
The document I receive is built up with around 50 semicolon separated values for every "case". How the document is structured:
"";"2010-10-17";"";"";"";Period-Last24h";"Problem is that the customer cant find....";
and so on, with 43 more semicolon statements. And every "case" ends with the value "Total 515";
What I need to do is go through all these "cases" and withdraw some of the values in the "cases". The "cases" is always built up in the same order and I know that it's always the 3, 15 and 45'th semicolon value that I need to withdraw.
How can I do this in the easiest way?

I think you should decompose this problem into smaller problems. Here are the steps I'd take:
Each semi-colon separated record represents a single object. C# is an object-oriented language. Stop thinking in terms of .csv records and start thinking in terms of objects. Break up the input into semi-colon delimited records.
Given a single comma-separated record, the values represent the properties of your object. Give them meaningful names.
Parse a comma-separated record into an object. When you're done, you'll have a collection of objects that you can deal with.
Use C#'s collections and LINQ to filter your list based on those cases that you need to withdraw. When you're done, you'll have a collection of objects with the desired cases removed.
Don't worry about the "easiest" way. You need one way that works. Whatever you do, get something working and worry about optimizing it to make it easiest, fastest, smallest, etc. later on.

Assuming the "rows" are lines and that you read line by line, your main tool should be string.Split:
foreach (string line in ... )
{
string [] parts = line.split (';');
string part3 = parts[2];
string part15 = parts[14];
// etc
}
Note that this is a simple approach that will fail if the content of any column can contain ';'

You could use String.Split twice.
The first time using "Total 515"; as the split string using this overload. This will give you an array of cases.
The second time using ";" as the split character using this overload on each of the cases. This will give you a data array for each case. As the data is consistent you can extract the 3rd, 15th and 45th elements of this array.

I'd search for an existing csv library. The escaping rules are probably not that easily mapped to regex.
If writing a library myself I'd first parse each line into a list/an array of strings. And then in a second step(probably outside of the csv library itself) convert the stringlist to a strongly typed object.

A simple but slow approach would be reading single characters from the input (StringReader class, for example). Write a ReadItem method that reads a quote, continues to read until the next quote, and then looks for the next character. If it is a newline of semicolon, one item has been read. If it is another quote, add a single quote to the item being read. Otherwise, throw an exception. Then use this method to split up the input data into a series of items, each line stored e.g. in a string[number of items in a row], lines stored in a List<>. Then you can use this class to read the CSV data inside another class that decodes the data read into objects that you can get your data out of.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.