How do I determine a delimiter in a text file - c#

I have 2 types of input files:
1. comma delimited (i.e: lastName, firstName, Address)
2. space delimited (i.e lastName firstName Address)
The comma delimited file HAS spaces between the ',' and the next word.
How do I go about determining which file I am dealing with ?
I am using C# btw

I've done tons of work with various delimited file types and as everyone else is saying, without normalization you can't really handle the whole thing programmatically.
Generally (and it seems like it would be totally necessary for space-delim) a delimited file will have a text qualifier character (often double-quotes). A couple examples of this points:
Space Delimited:
lastName "Von Marshall" is impossible
without qualifiers.
Addresses would be altogether impossible as well.
Comma Delimited:
addresses are generally unworkable unless they are broken into separate fields or having a solid string is acceptable for your use-case.
So the space delim should be easy enough to determine since you're looking for " ". If this is the case I'd (personally) replace all " " with "," to change it to comma-delim. That way you'd only have to build a single method for handling the text, otherwise I imagine you'll need methods for spaces and commas separately.
If your comma-delim file does not have a text qualifier, you're in a really tricky spot. I haven't found any "perfect" way of addressing this without any human work, but it can be minimized. I've used Notepad++ a lot to do batch replacement with its regular expression functions.
However, you can also use C#'s regex abilities. Here's what MSDN says on that.
So, to answer your question to the best of my ability, unless you can establish a uniqueness between the 2 file types - there's no way. However, if the text has proper text qualifiers, the files have different file extensions, or if the are generated in different directories - you could use any of those qualities or a mix thereof to decide what type of file it is. I have no experience doing this as yet (though I've just started a project using it), so I can't give an exact example, but I can say for anyone to build a perfect example it'd be best if you showed example strings for each file.

As other users have said with some guaranty of having no commas in the space delimited version you cannot with 100% accuracy.
With some information, say that there will always be three fields for all records in all cases when parsed correctly you could just do both and test the results for the correct number of fields. Address is a big block here though since we do not know what that format could be. Also these rules seems odd at best when talking about address.... is
1111somestreest.houston,tx11111 or
1111 somestreet st. Houston, Tx 11111
a valid format?

You could count the number of commas per line of the file. If you have at least 2 commas per line (considering your info is last name, first name, address), you probably have a comma separated. If you have, in at least one line, less than 2 commas, you should consider it as space separated.
I, however, would skip this step and ignore the commas when evaluating the input by replacing all of them by spaces and would implement a single read/grab information procedure (considering only space separated files).

Related

Field and text delimiters within cells in csv files

This is likely a very basic question that I could not, despite trying, find a satsifying answer to. Feel free to skip to the question at the end if you aren't interested in the background.
The task:
I wish to create an easy localisation solution for my unity projects. After some initial research I concluded it would be best to use a .csv file read by a streamreader, so that translators would only ever have to interact with the csv table, where information is neatly organized.
The main problem:
Due to the nature of the text, I need to account for linebreaks and special characters in the actual fields. As such I could not use the normal readLine() method.
This I worked with by using Read() and checking if a linebreak is within a text delimiter bracket. But as I check for the text delimiter, I am afraid it might run into an un-escaped delimiter part of the normal in-cell text (since the normal text delimiter is quotation marks).
So I switched the delimiter to §. But now every time I open the file I have to re-enter § as a text delimiter in OpenOfficeCalc, probably due to encoding differences. Which is annoying but not the end of the world.
My question:
How does OpenOffice (or similar software) usually tell in-cell commas/quotation marks apart from the ones used as delimiters? If I knew that, I could probably incorporate a similar approach in my reading of the file.
I've tried to look at the files with NotePad++, revealing a difference in linebreaks (/r instead of /r/n) and obviously it's within a text delimiter bracket, but when it comes to how it seperates its delimiters from ones just entered in the text/field, I am drawing a blank.
Translation file in OpenOffice Calc:
Translation file in NotePad++, showing all characters:
I'd appreciate any insight or links on the topic.
From https://en.wikipedia.org/wiki/Comma-separated_values:
The CSV file format is not fully standardized. The basic idea of separating fields with a comma is clear, but that idea gets complicated when the field data may also contain commas or even embedded line breaks.
LibreOffice Calc has a reasonable way to handle these things.
Use LF for line breaks and CR at the end of each record. It seems your code already handles this.
Use quotes to delimit strings when needed. If the string contains one or more quotes, then duplicate the quote to make it literal.
From the example in your question, it looks like you told Calc not to use any quotes as string delimiters. Why did you do this? When I tried it, LibreOffice (or Apache OpenOffice) showed the fields in different columns after opening the file saved that way.
The following example CSV file has fields that contain commas, quotes and line breaks.
When viewed in Calc:
A B
--------- --
1 | 1,",2", 3
--------- --
2 | a c
| b
Calc correctly reads and saves the file as shown below. Settings when saving are Field delimiter , and String delimiter " which are the defaults.
"1,"",2"",",3[CR]
"a
b",c[CR]

regex that can handle horribly misspelled words

Is there a way to create a regex will insure that five out of eight characters are present in order in a given character range (like 20 chars for example)?
I am dealing with horrible OCR/scanning, and I can stand the false positives.
Is there a way to do this?
Update: I want to match for example "mshpeln" as misspelling. I do not want to do OCR. The OCR job has been done, but is has been done poorly (i.e. it originally said misspelling, but the OCR'd copy reads "mshpeln"). I do not know what the text that I will have to match against will be (i.e. I do not know that it is "mshpeln" it could be "mispel" or any number of other combinations).
I am not trying to use this as a spell checker, but merely find the end of a capture group. As an aside, I am currently having trouble getting the all.css file, so commenting is impossible temporarily.
I think you need not regex, but database with all valid words and creative usage of functions like soundex() and/or levenshtein().
You can do this: create table with all valid words (dictionary), populate it with columns like word and snd (computed as soundex(word)), create indexes for both word and snd columns.
For example, for word mispeling you would fill snd as M214. If you use SQLite, it has soundex() implemented by default.
Now, when you get new bad word, compute soundex() for it and look it up in your indexed table. For example, for word mshpeln it would be soundex('mshpeln') = M214. There you go, this way you can get back correct word.
But this would not look anything like regex - sorry.
To be honest, I think that a project like this would be better for an actual human to do, not a computer. If the project is to large for 1 or 2 people to do easily, you might want to look into something like Amazon's Mechanical Turk where you can outsource to work for pennies per solution.
This can't be done with a regex, but it can be done with a custom algorithm.
For example, to find words that are like 'misspelling' in your body of text:
1) Preprocess. Create a Set (in the mathematical sense, collection of guaranteed to be unique elements) with all of the unique letters that are in misspelling - {e, i, g, l, m, n, p, s}
2) Split the body of text into words.
3) For each word, create a Set with all of its unique letters. Then, perform the operation of set intersection on this set and the set of the word you are matching against - this will get you letters that are contained by both sets. If this set has 5 or more characters left in it, you have a possible match here.
If the OCR can add in erroneous spaces, then consider two words at a time instead of single words. And etc based on what your requirements are.
I have no solution for this problem, in fact, here's exactly the opposite.
Correcting OCR errors is not programmaticaly possible for two reasons:
You cannot quantify the error that was made by the OCR algorithm as it can goes between 0 and 100%
To apply a correction, you need to know what the maximum error could be in order to set an acceptable level.
Let nello world be the first guess of "hello world", which is quite similar. Then, with another font that is written in "painful" yellow or something, a second guess is noiio verio for the same expression. How should a computer know that this word would have been similar if it was better recognized?
Otherwise, given a predetermined error, mvp's solution seems to be the best in my opinion.
UPDATE:
After digging a little, I found a reference that may be relevant: String similarity measures

Importing csv values having multiple comma in one column in Asp.net

I am having an issue with importing a CSV file. The problem arises when an address field has multiple comma seperated values e.g. home no, street no, town etc.
I tried to use http://forums.asp.net/t/1705264.aspx/1 this article but, the problem did not solved because of a single field containing multiple comma separated values.
Any idea or solution? because I didnt found any help
Thanks
Don't split the string yourself. Parsing CSV files is not trivial, and using str.Split(',') will give you a lot of headaches. Try using a more robust library like CsvHelper
- https://github.com/JoshClose/CsvHelper
If that doesn't work then the format of the file is probably incorrect. Check to make sure the text fields are properly quoted.
Do you control the format of the CSV file? You could see about changing it to qualify the values by surrounding them with double quotes (""). Another option is to switch to a different delimiter like tabs.
Barring that, if the address is always in the same format, you could read in the file, and then inspect each record and manually concatenate columns 2, 3, and 4.
Are the fields surrounded by quotation marks? If so, split on "," rather than just ,.
Is the address field at the beginning or end of a record? If so, you can ignore the first x commas (if at the beginning) or split only the correct number of fields (if at the end).
Ideally, if you have control of the source file's creation, you would change the delimiter of either the address sub-fields, or the record fields.

String processing / CSV challenge

Having used SQL Server Bulk insert of CSV file with inconsistent quotes (CsvToOtherDelimiter option) as my basis, I discovered a few weirdnesses with the RemoveCSVQuotes part [it chopped the last char from quoted strings that contained a comma!]. So.. rewrote that bit (maybe a mistake?)
One wrinkle is that the client has asked 'what about data like this?'
""17.5179C,""
I assume if I wanted to keep using the CsvToOtherDelimiter solution, I'd have to amend the RegExp...but it's WAY beyond me... what's the best approach?
To clarify: we are using C# to pre-process the file into a pipe-delimited format prior to running a bulk insert using a format file. Speed is pretty vital.
The accepted answer from your link starts with:
You are going to need to preprocess the file, period.
Why not transform your csv to xml? Then you would be able to verify your data against an xsd before storing into a database.
To convert a CSV string into a list of elements, you could write a program that keeps track of state (in quotes or out of quotes) as it processes the string one character at a time, and emits the elements it finds. The rules for quoting in CSV are weird, so you'll want to make sure you have plenty of test data.
The state machine could go like this:
scan until quote (go to 2) or comma (go to 3)
if the next character is a quote, add only one of the two quotes to the field and return to 1. Otherwise, go to 4 (or report an error if the quote isn't the first character in the field).
emit the field, go to 1
scan until quote (go to 5)
if the next character is a quote, add only one of the two quotes to the field and return to 4. Otherwise, emit the field, scan for a comma, and go to 1.
This should correctly scan stuff like:
hello, world, 123, 456
"hello world", 123, 456
"He said ""Hello, world!""", "and I said hi"
""17.5179C,"" (correctly reports an error, since there should be a
separator between the first quoted string "" and the second field
17.5179C).
Another way would be to find some existing library that does it well. Surely, CSV is common enough that such a thing must exist?
edit:
You mention that speed is vital, so I wanted to point out that (so long as the quoted strings aren't allowed to include line returns...) each line may be processed independently in parallel.
I ended up using the csv parser that I don't know we had already (comes as part of our code generation tool) - and noting that ""17.5179C,"" is not valid and will cause errors.

Splitting string on commas when data can contain commas

I have a CSV file (which I didn't design and I can't change now nor will I ever be able to change it) that contains lines like the following:
"Surname, Firstname", yes, no, somestring, whatever, etc
As you can see here, the first , is not a comma on which I'd want to split the string. Notice that this particular comma is enclosed within the quotation marks.
Because of this, a simple string.split(',') obviously won't work, as it would give me an array of length 7 for the above string instead of 6.
Is there a way to get around this? I was thinking of using regex to split the string instead but I'm not competent enough in regex to think of a pattern that would only split on commas that are not enclosed inside quotation marks.
I can think of ugly, hacky ways to do it by reading each string char by char but this would have to be a last resort as I'm sure there's a better way to do it!
You can handle this easily by using the TextFieldParser class. Just set HasFieldsEnclosedInQuotes to true.
I would suggest using a CSV parser library - there are other cases that you wouldn't have thought of (new line as part of a quoted field).
The VisualBasic namespace has a nice library that can help - the TextFieldParser.
I know there's a lot of people here who think character-by-character comparisons should never be used and will strongly disagree with me but I'm not convinced companies like Microsoft aren't the only ones who should be doing that sort of programming.
Afterall, Split does character-by-character comparisons so why is it any less ugly when you call existing code that doesn't quite do exactly what you want?
At any rate, my approach was to write my own code. And I've posted the code online at http://www.blackbeltcoder.com/Articles/files/reading-and-writing-csv-files-in-c.

Categories