I am currently using the FileHelpers library (v2.0.0.0) to parse a CSV file. The CSV file is mapped to a class that has a handful of public properties, let's say there are N. The problem is that, by default, FileHelpers doesn't seem to correctly handle cases where the user specifies a CSV file that has more than N-1 commas. The remaining commas just get appended to the last property value.
I figured this must be configurable via FileHelpers' attributes, but I didn't see anything that would ignore fields that don't have a matching property in the record.
I looked into the RecordConditions, but using something like ExcludeIfEnds(",") looks like it will skip the line entirely if it ends with a comma, but I just want them stripped.
It's possible that my only recourse is to pre-process the file and strip any trailing commas, which is totally fine, but I wanted to know if FileHelpers can do this as well, and perhaps I'm just not seeing it in the docs.
Just an idea for a hack / workaround: you could create a property called "ExtraCommas" and add it to your class, so that extra commas are serialized there and not in the real properties of your object...
If the number of commas varies, I think you are out of luck and would have to do post processing. However you can set blank fields in your class if there are a fixed amount.
[FieldOrder(5)]
public string Blank1;
[FieldOrder(6)]
public string Blank2;
This doesn't really ever bite me because I don't use a FileHelpers class as a business class, I use it as an object to build the business class from. I store it for auditing. I think at one point I played around with making the fields for the Blanks private, not sure how that turned out.
Here is a custom method that you can use, it might not be the best solution, but it will solve the last comma problem. The code could be more optimized for sure, this is just to give you the idea of how to get around this kind of problem.
int main(){
StreamReader sr = new StreamReader(#"C:\Users\musab.shaheed\Desktop\csv.csv");
var lineCount=File.ReadLines(#"C:\Users\musab.shaheed\Desktop\csv.csv").Count();
for (int i = 0; i < lineCount;i++ ) {
String fileText = sr.ReadLine();
fileText=fileText.Substring(0, fileText.Length - 1);
//store your data in here
Console.WriteLine(fileText);
};
sr.Close();
}
Related
Ok, I'm racking my brains over this one. It's pretty simple though (I think).
I'm currently creating a text file as a comma separated string of values.
Later, I read in that file data and then use the .split function to split the data by commas.
I discovered that sometimes one of the description fields in the data conatins an embedded comma, which ends up throwing the split command off.
Is there any special character I could use that could pretty much guarantee wouldn't be in the data, or is there a better way to accomplish this? Thanks!
// Initial Load
fullString = fileName + "," + String.Join(",", fieldValues);
// Access later
String[] valuesArray = myString.Split(',');
Short answer, there's no "simple" way to do it using Split. The best you can hope for is to set the deliminator as something cooky that wouldn't ever get used (but even that's not a guarantee).
The simple method would be to used something like CsvHelper (get it through Nuget) or any of the other dozen or so packages that are designed for parsing CSV.
We have an integration with another system that relies on passing CSV files back and forth (really old school).
The structure is generally:
ID, Name, PhoneNumber, comments, fathersname
1, tom, 555-1234, just some random text, bill
2, jill smith, 555-4234, other random text, richard
Every so often we see this:
3, jacked up, 999-1231, here
be dragons
amongst us, ted
The primary problem I care about is detecting that a line breaker (\n) occurs in the middle of the record when that is the record terminator.
Is there anyway I can preprocess this to reliably fix it?
Note that we have zero control over what the other system emits.
So you should be able to do something more or less like this:
for (int i = 0; i < lines.Count; i++)
{
var fields = lines[i].Split(',').ToList();
while (fields.Count < numFields)//here be dragons amonst us
{
i++;//include next line in this line
//check to make sure we haven't run out of lines.
//combine end of previous field with start of the next one,
//and add the line break back in.
var innerFields = lines[i].Split(',');
fields[fields.Count - 1] += "\n" + innerFields[0];
fields.AddRange(innerFields.Skip(1));
}
//we now know we have a "real" full line
processFields(fields);
}
(For simplicity I assumed all lines were read in at the start; I assume you could alter it to lazily fetch each line easily enough.)
Let me start and say that the CSV file in your example is invalid. If a line break occurs inside a string, it should be wrapped with double quote characters.
Now for the answer - In order to parse this invalid csv format you must do several assumptions. In this case I made 2 assumptions: 1) The ID column must be numeric 2) The comment field can not contain digits.
Based on these assumptions you can check the first character after the line break character. If it is digit, you assume its a new record. If not you should treat it as a continue value of the comment field.
I don't know if the second assumption is valid, if not, you can enhance the logic so it will cover the business rules of the system.
Good Luck!
Firstly I would recommend using a tool to manage reading and writing your csv files, I use the FileHelpers library which is great.
You can essentially type your records and it will do all the validation and such for you. Worth the effort.
To your question perhaps you can do some preprocessing on the file and use Regex to replace any line breaks with a space?
I do something similar (not with files but) try
line.Replace(Environment.NewLine, " ");
With FileHelpers you could write a custom converter to do this during processing, or hook into the BeforeRead event.
I have a CSV file that is being exported from another system whereby the column orders and definitions may change. I have found that FileHelpers is perfect for reading csv files, but it seems you cannot use it unless you know the ordering of the columns before compiling the application. I want to know if its at all possible to use FileHelpers in a non-typed way. Currently I am using it to read the file but then everything else I am doing by hand, so I have a class:
[DelimitedRecord(",")]
public class CSVRow
{
public string Content { get; set; }
}
Which means each row is within Content, which is fine, as I have then split the row etc, however I am now having issues with this method because of commas inherent within the file, so a line might be:
"something",,,,0,,1,,"something else","","",,,"something, else"
My simple split on commas on this string doesnt work as there is a comma in `"something, else" which gets split. Obviously here is where something like FileHelpers comes in real handy, parsing these values and taking the quote marks into consideration. So is it possible to use FileHelpers in this way, without having a known column definition, or at least being able to pass it a csv string and get a list of values back, or is there any good library that does this?
You can use FileHelpers' RunTime records if you know (or can deduce) the order and definitions of the columns at runtime.
Otherwise, there are lots of questions about CSV libraries, eg Reading CSV files in C#
Edit: updated link. Original is archived here
I can't seem to handle a CSV I got. It's a file generated by a bank, which looks like this:
"000,""PLN"",""XYZ"",""2011-08-31"",""2011-08-31"",""0,00"""
1,""E"",""2011-08-30"",""2011-08-31"",""2011-08-31"",""399,00"",""0000103817846977"",""UZNANIE OTRZYMANE ELIXIR"",""23103015080000000550217023"",""XXX"",""POLISA UBEZPIECZENIA NR XXX "",""000""
3,""E"",""2011-08-31"",""2011-08-31"",""2011-08-31"",""1433,00"",""0000154450232753"",""UZNANIE OTRZYMANE ELIXIR"",""000"",""XXX"",""POLISA UBEZPIECZENIA XXX "",""000""
(I changed all sensitive information).
I've been trying to parse it since morning but no biggie. I used the LINQ to CSV example found somwhere on the net, the CodeProject one (both of them threw an error which said that the CSV is corrupted) and I ended with FileHelpers which SEEMS to work BUT:
It splits the "399,00" and similar values into two fields.
When I use the [(FieldQuoted()] attribute it all goes to hell, since all the fields are quoted in DOUBLE quotation marks. I suspect that is the reason why the other parsers wouldn't work.
Any ideas how to handle it?
If the problem seems to be the double quote, you could preprocess each line by substituting the double double quotes by single double quotes:
line = line.Replace( "\"\"", "\"" );
Once the whole file has been processed, you can let it handled by any other CSV processor.
It will be probably easier to write your own, anyway.
I have been using Lumen, CommonLibrary, FileHelpers etc. and I ended up with TextFieldParser class (from Visual Basic namespace, but can be used in C# without any problem). I recommend you try that. The only downside is that it's relatively slow. But it seems to cope with edge cases quite well.
I even invented a trick getting it to work with obviously invalid CSV files (""" etc.; OpenOffice Calc couldn't handle them properly) - when I'd encounter such a line and got a MalformedLineException, I'd still parse it within the catch block with the HasFieldsEnclosedInQuotes property set to false, for a change.
It would split the line properly, just leaving all the values in double apostrophes. All I had to do then was to remove these double quotes "manually".
I've got a text file full of records where each field in each record is a fixed width. My first approach would be to parse each record simply using string.Substring(). Is there a better way?
For example, the format could be described as:
<Field1(8)><Field2(16)><Field3(12)>
And an example file with two records could look like:
SomeData0000000000123456SomeMoreData
Data2 0000000000555555MoreData
I just want to make sure I'm not overlooking a more elegant way than Substring().
Update: I ultimately went with a regex like Killersponge suggested:
private readonly Regex reLot = new Regex(REGEX_LOT, RegexOptions.Compiled);
const string REGEX_LOT = "^(?<Field1>.{6})" +
"(?<Field2>.{16})" +
"(?<Field3>.{12})";
I then use the following to access the fields:
Match match = reLot.Match(record);
string field1 = match.Groups["Field1"].Value;
Use FileHelpers.
Example:
[FixedLengthRecord()]
public class MyData
{
[FieldFixedLength(8)]
public string someData;
[FieldFixedLength(16)]
public int SomeNumber;
[FieldFixedLength(12)]
[FieldTrim(TrimMode.Right)]
public string someMoreData;
}
Then, it's as simple as this:
var engine = new FileHelperEngine<MyData>();
// To Read Use:
var res = engine.ReadFile("FileIn.txt");
// To Write Use:
engine.WriteFile("FileOut.txt", res);
Substring sounds good to me. The only downside I can immediately think of is that it means copying the data each time, but I wouldn't worry about that until you prove it's a bottleneck. Substring is simple :)
You could use a regex to match a whole record at a time and capture the fields, but I think that would be overkill.
Why reinvent the wheel? Use .NET's TextFieldParser class per this how-to for Visual Basic: How to read from fixed-width text files.
You may have to watch out, if the end of the lines aren't padded out with spaces to fill the field, your substring won't work without a bit of fiddling to work out how much more of the line there is to read. This of course only applies to the last field :)
Unfortunately out of the box the CLR only provides Substring for this.
Someone over at CodeProject made a custom parser using attributes to define fields, you might wanna look at that.
Nope, Substring is fine. That's what it's for.
You could set up an ODBC data source for the fixed format file, and then access it as any other database table.
This has the added advantage that specific knowledge of the file format is not compiled into your code for that fateful day that someone decides to stick an extra field in the middle.