Data processing puzzle/headache

Data processing puzzle/headache - c#

I have a CSV file I need to process which is a bit of a nightmare. Esentially it is the following
"Id","Name","Description"
"1","Test1","Test description text"
"2","Test2","<doc><style>body{font-family:"Calibri","sans-serif";}</style><p class="test_class"
name="test_name">Lots of word xdoc content here.</p></doc>"
"guid-xxxx-xxxx-xxxx-xxxx","Test3","Test description text 3"
I'm using the File Helpers library to process the CSV rather than reinvent the wheel. However, due to the description field containing unescaped Word xdoc xml which contains quotes it's getting rather confused when it comes to the start and end points of each record.
The following is an example mapping class.
[DelimitedRecord(","), IgnoreFirst(1), IgnoreEmptyLines()]
public class CSVDoc
{
#region Properties
[FieldQuoted('"', QuoteMode.AlwaysQuoted), FieldTrim(TrimMode.Both)]
public string Id;
[FieldQuoted('"', QuoteMode.AlwaysQuoted), FieldTrim(TrimMode.Both)]
public string Name;
[FieldQuoted('"', QuoteMode.AlwaysQuoted), FieldTrim(TrimMode.Both)]
public string Description;
[FieldQuoted('"', QuoteMode.AlwaysQuoted), FieldTrim(TrimMode.Both)]
}
I considered (despite my hate of regex for this kind of task) replacing all " with ' and then using ((?<=(^|',))'|'(?=($|,'))) pattern to replace all ' with " at the start and end of lines and where they are formatted ','. However, the dirty file contains some lines which end with a " and some css style attributes which are formatted ","
So now I'm left scratching my head trying to figure out how to do this and how it can be automated.
Any ideas?

You're going to have to re-invent the wheel, because that's not valid CSV or indeed a reasonable file at all - it doesn't have any sort of provably consistent escaping rules (e.g. we don't know if the plain-text columns are escaped correctly or not).
Your best bet is to ask the person producing this to fix the bug, it should be e.g.:
"2","Test2","<doc><style>body{font-family:""Calibri"",""sans-serif"";}</style><p class=""test_class""
name=""test_name"">Lots of word xdoc content here.</p></doc>"
Which your parser should handle fine, and which should not be hard for them to produce in a simple and efficient manner.
Failing that, you'll have to hand-code the parser to:
Read a line.
Check for unescaped " (any "that isn't followed by a " a , or whitespace.
If none found, parse as CSV.
If any found, parse as this horrible thing until you hit the line ending with "
It may be easier to look for < if that is consistently not used in the other lines. Or perhaps for <doc if it consistently identifies the correct rows.

If you don't mind doing some pre-processing before, you can change the first and second "," to "|" and then use FileHelper to parse the file normally (Assuming you don't have | in the last column where there are HTML tags)
The pre-processing could be something like (Simple pseudo code) :
var sb = new StringBuilder()
var regex = new Regex("\",\"");
foreach(string line in textFileLines)
{
sb.AppendLine(regex.Replace(line , "\"|\"", 2));
}

I worked on the CSV-1203 File Format standard a few months ago, so the first thing to realise is that you're not dealing with a CSV file - even though it's named "xyz.CSV".
As said by others here, it will be easier to write your own reader, they're not too difficult. I too have a hatred of everything regex, but the good news is you can code any solution without ever using it.
A couple of things: There's a really weird thing Excel does to CSV files that begin with the two capital letters ID (without quotes). It thinks your CSV is a corrupted SYLK file! Try it.
For details of this issue and a detailed CSV File Format specification, please refer to http://mastpoint.curzonnassau.com/csv-1203

Related

character to use when splitting strings in visual c#?

Ok, I'm racking my brains over this one. It's pretty simple though (I think).
I'm currently creating a text file as a comma separated string of values.
Later, I read in that file data and then use the .split function to split the data by commas.
I discovered that sometimes one of the description fields in the data conatins an embedded comma, which ends up throwing the split command off.
Is there any special character I could use that could pretty much guarantee wouldn't be in the data, or is there a better way to accomplish this? Thanks!
// Initial Load
fullString = fileName + "," + String.Join(",", fieldValues);
// Access later
String[] valuesArray = myString.Split(',');

Short answer, there's no "simple" way to do it using Split. The best you can hope for is to set the deliminator as something cooky that wouldn't ever get used (but even that's not a guarantee).
The simple method would be to used something like CsvHelper (get it through Nuget) or any of the other dozen or so packages that are designed for parsing CSV.

Best Method of standard string to XML legal string - C#

Currently my understanding of XML legal strings is that all is required is that you convert any instances of: &, ", ', <, > with & " &apos; < >
So I made the following parser:
private static string ToXmlCompliantStr(string uriStr)
{
string uriXml = uriStr;
uriXml = uriXml.Replace("&", "&");
uriXml = uriXml.Replace("\"", """);
uriXml = uriXml.Replace("'", "&apos;");
uriXml = uriXml.Replace("<", "<");
uriXml = uriXml.Replace(">", ">");
return uriXml;
}
I am aware that there are similar questions out there with good answers (which is how I was able to write this function) I am writing this question to ask if this code will translate ANY string that C# can throw at it and have XDocument parse it as a part of a whole document without any complaints as all the questions out there that I've found state that these are the only escape characters, not that parsing them will cause 100% valid XML string. I've gone as far as reading through the decompiled XNode class trying to see how that parse it.
Thanks

Firstly, you should absolutely not do this yourself. Use an XML API - that way you can trust that to do the right thing, rather than worrying about covering corner cases etc. You generally shouldn't be trying to come up with an "escaped string" at all - you should pass the string to the XElement constructor (or XAttribute, or whatever your situation is).
In other words, I think you should try really hard to design your application so that you don't need a method of the kind you've shown in your question at all. Look at where you'd be using that method, and see whether you can just create an XElement (or whatever) instead. If you try to treat XML as a data structure in itself rather than just as text, you'll have a much better experience in my experience.
Secondly, you need to understand that in XML 1.0 at least, there are Unicode characters that cannot be validly represented in XML, no matter how much escaping you use. In particular, values U+0000 to U+001F are unrepresentable other than U+0009 (tab), U+000A (line feed) and U+000D (carriage return). Also if you have a string which contains invalid UTF-16 (e.g. an unmatched half of a surrogate pair), that can't be correctly represented in XML.

"Evaluate" a c# string

I am reading a C# source file.
When I encounter a string, I want to get it's value.
For instance, in the following example:
public class MyClass
{
public MyClass()
{
string fileName = "C:\\Temp\\A Weird\"FileName";
}
}
I would like to retrieve
C:\Temp\A Weird"FileName
Is there an existing procedure to do that?
Coding a solution with all the possible cases should be quite tricky (#, escape sequences. ...).
I am convinced such procedure exists...
I would like to have the dual function too (to inject a string into a C# source file)
Thanks in advance.
Philippe
P.S:
I gave an example with a filename, but I look for a solution working for all kinds of strings.

I'm pretty sure you can use CodeDOM to read a C# code file and parse its elements. It generates a code tree, and then you can look for nodes representing strings.
http://www.codeproject.com/Articles/2502/C-CodeDOM-parser
Other CodeDom parsers:
http://www.codeproject.com/Articles/14383/An-Expression-Parser-for-CodeDom
NRefactory: https://github.com/icsharpcode/NRefactory and http://www.codeproject.com/Articles/408663/Using-NRefactory-for-analyzing-Csharp-code

There is a way of extracting these strings using a regular expression:
("(\\"|[^"])*")
This particular one works on your simple example and gives the filename (complete with leading and trailing quote characters); whether it would work on more complex ones I can't easily tell unfortunately.
For clarity, (\\"|[^"]) matches any character apart from ", except where it has a leading \ character.

Just use ".*" Regex to match all string values, then remove trailing inverted commas and unescape it.
this will allow \" and "" characters inside your string
so both "C:\\Temp\\A Weird\"FileName" and "Hello ""World""" will match

CSV + FileHelpers + Double Quotes = Nightmare

I can't seem to handle a CSV I got. It's a file generated by a bank, which looks like this:
"000,""PLN"",""XYZ"",""2011-08-31"",""2011-08-31"",""0,00"""
1,""E"",""2011-08-30"",""2011-08-31"",""2011-08-31"",""399,00"",""0000103817846977"",""UZNANIE OTRZYMANE ELIXIR"",""23103015080000000550217023"",""XXX"",""POLISA UBEZPIECZENIA NR XXX "",""000""
3,""E"",""2011-08-31"",""2011-08-31"",""2011-08-31"",""1433,00"",""0000154450232753"",""UZNANIE OTRZYMANE ELIXIR"",""000"",""XXX"",""POLISA UBEZPIECZENIA XXX "",""000""
(I changed all sensitive information).
I've been trying to parse it since morning but no biggie. I used the LINQ to CSV example found somwhere on the net, the CodeProject one (both of them threw an error which said that the CSV is corrupted) and I ended with FileHelpers which SEEMS to work BUT:
It splits the "399,00" and similar values into two fields.
When I use the [(FieldQuoted()] attribute it all goes to hell, since all the fields are quoted in DOUBLE quotation marks. I suspect that is the reason why the other parsers wouldn't work.
Any ideas how to handle it?

If the problem seems to be the double quote, you could preprocess each line by substituting the double double quotes by single double quotes:
line = line.Replace( "\"\"", "\"" );
Once the whole file has been processed, you can let it handled by any other CSV processor.
It will be probably easier to write your own, anyway.

I have been using Lumen, CommonLibrary, FileHelpers etc. and I ended up with TextFieldParser class (from Visual Basic namespace, but can be used in C# without any problem). I recommend you try that. The only downside is that it's relatively slow. But it seems to cope with edge cases quite well.
I even invented a trick getting it to work with obviously invalid CSV files (""" etc.; OpenOffice Calc couldn't handle them properly) - when I'd encounter such a line and got a MalformedLineException, I'd still parse it within the catch block with the HasFieldsEnclosedInQuotes property set to false, for a change.
It would split the line properly, just leaving all the values in double apostrophes. All I had to do then was to remove these double quotes "manually".

Parsing csv file with commas and quotes as deliminators Pin

So Im reading a csv file and splitting the string with "," as the deliminator
but some of them have quotes as to not split the specific field because it has a comma in it.
1530,Pasadena CA,"2008, 05/01","2005, 12/14"
with just comma it would be:
1530
Pasadena CA
"2008
05/01"
"2005
12/14"
I need it to take commas into consideration when splitting so its like this
1530
Pasadena CA
"2008 05/01"
"2005 12/14"

Take a look at this page for a library that offers quick and easy CSV reading.

While it still may be a new reference, there is a class within the Visual Basic assemblies that should handle this well. At least then you know it's a part of the framework. You can find details here: http://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser.aspx

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Data processing puzzle/headache - c#

Related

character to use when splitting strings in visual c#?

Best Method of standard string to XML legal string - C#

"Evaluate" a c# string

CSV + FileHelpers + Double Quotes = Nightmare

Parsing csv file with commas and quotes as deliminators Pin

Categories

Resources