I am working with files that range between 150MB and 250MB, and I need to append a form feed (/f) character to each match found in a match collection. Currently, my regular expression for each match is this:
Regex myreg = new Regex("ABC: DEF11-1111(.*?)MORE DATA(.*?)EVEN MORE DATA(.*?)\f", RegexOptions.Singleline);
and I'd like to modify each match in the file (and then overwrite the file) to become something that could be later found with a shorter regular expression:
Regex myreg = new Regex("ABC: DEF11-1111(.*?)\f\f, RegexOptions.Singleline);
Put another way, I want to simply append a form feed character (\f) to each match that is found in my file and save it.
I see a ton of examples on stack overflow for replacing text, but not so much for larger files. Typical examples of what to do would include:
Using streamreader to store the entire file in a string, then do a
find and replace in that string.
Using MatchCollection in combination
with File.ReadAllText()
Read the file line by line and look for
matches there.
The problem with the first two is that is just eats up a ton of memory, and I worry about the program being able to handle all of that. The problem with the 3rd option is that my regular expression spans over many rows, and thus will not be found in a single line. I see other posts out there as well, but they cover replacing specific strings of text rather than working with regular expressions.
What would be a good approach for me to append a form feed character to each match found in a file, and then save that file?
Edit:
Per some suggestions, I tried playing around with StreamReader.ReadLine(). Specifically, I would read a line, see if it matched my expression, and then based on that result I would write to a file. If it matched the expression, I would write to the file. If it didn't match the expression, I would just append it to a string until it did match the expression. Like this:
Regex myreg = new Regex("ABC: DEF11-1111(.?)MORE DATA(.?)EVEN MORE DATA(.*?)\f", RegexOptions.Singleline);
//For storing/comparing our match.
string line, buildingmatch, match, whatremains;
buildingmatch = "";
match = "";
whatremains = "";
//For keep track of trailing bits after our match.
int matchlength = 0;
using (StreamWriter sw = new StreamWriter(destFile))
using (StreamReader sr = new StreamReader(srcFile))
{
//While we are still reading lines in the file...
while ((line = sr.ReadLine()) != null)
{
//Keep adding lines to buildingmatch until we can match the regular expression.
buildingmatch = buildingmatch + line + "\r\n";
if (myreg.IsMatch(buildingmatch)
{
match = myreg.Match(buildingmatch).Value;
matchlength = match.Lengh;
//Make sure we are not at the end of the file.
if (matchlength < buildingmatch.Length)
{
whatremains = buildingmatch.SubString(matchlength, buildingmatch.Length - matchlength);
}
sw.Write(match, + "\f\f");
buildingmatch = whatremains;
whatremains = "";
}
}
}
The problem is that this took about 55 minutes to run a roughly 150MB file. There HAS to be a better way to do this...
If you can load the whole string data into a single string variable, there is no need to first match and then append text to matches in a loop. You can use a single Regex.Replace operation:
string text = File.ReadAllText(srcFile);
using (StreamWriter sw = new StreamWriter(destfile, false, Encoding.UTF8, 5242880))
{
sw.Write(myregex.Replace(text, "$&\f\f"));
}
Details:
string text = File.ReadAllText(srcFile); - reads the srcFile file to the text variable (match would be confusing)
myregex.Replace(text, "$&\f\f") - replaces all occurrences of myregex matches with themselves ($& is a backreference to the whole match value) while appending two \f chars right after each match.
I was able to find a solution that works in a reasonable time; it can process my entire 150MB file in under 5 minutes.
First, as mentioned in the comments, it's a waste to compare the string to the Regex after every iteration. Rather, I started with this:
string match = File.ReadAllText(srcFile);
MatchCollection mymatches = myregex.Matches(match);
Strings can hold up to 2GB of data, so while not ideal, I figured roughly 150MB worth wouldn't hurt to be stored in a string. Then, as opposed to checking a match every x amount of lines read in from the file, I can check the file for matches all at once!
Next, I used this:
StringBuilder matchsb = new StringBuilder(134217728);
foreach (Match m in mymatches)
{
matchsb.Append(m.Value + "\f\f");
}
Since I already know (roughly) the size of my file, I can go ahead and initialize my stringbuilder. Not to mention, it's a lot more efficient to use string builder if you are doing multiple operations on a string (which I was). From there, it's just a matter of appending the form feed to each of my matches.
Finally, the part the cost the most on performance:
using (StreamWriter sw = new StreamWriter(destfile, false, Encoding.UTF8, 5242880))
{
sw.Write(matchsb.ToString());
}
The way that you initialize StreamWriter is critical. Normally, you just declare it as:
StreamWriter sw = new StreamWriter(destfile);
This is fine for most use cases, but the problem becomes apparent with you are dealing with larger files. When declared like this, you are writing to the file with a default buffer of 4KB. For a smaller file, this is fine. But for 150MB files? This will end up taking a long time. So I corrected the issue by changing the buffer to approximately 5MB.
I found this resource really helped me to understand how to write to files more efficiently: https://www.jeremyshanks.com/fastest-way-to-write-text-files-to-disk-in-c/
Hopefully this will help the next person along as well.
So in my program the user can choose a file with an OpenFileDialog and if he wants to save the file with a SaveFileDialog, the columns and rows of the csv file should change. For this I have already tried this
SaveFileDialog:
List<string> liste = new List<string>();
// Build the header row into the output:
liste.Add(String.Join(',', Enum.GetNames<CsvColumn>()));
CultureInfo ci = new CultureInfo("de-DE"); // neccessary only when running the code from other cultures.
SaveFileDialog dialog = new SaveFileDialog();
dialog.Filter = "CVS (*.cvs)|*.csv|All files (*.*)|*.*";
if (dialog.ShowDialog() == true)
{
string line;
// Read the file and display it line by line.
try
{
System.IO.StreamReader file = new System.IO.StreamReader(path);
while ((line = file.ReadLine()) != null)
{
var cellArray = Regex.Split(line, #"[\t,](?=(?:[^\""]|\""[^\""]*\"")*$)")
.Select(s => Regex.Replace(s.Replace("\"\"", "\""), "^\"|\"$", "")).ToArray();
// Check value of Betrag, only operate on it if there is a decimal value there
if (decimal.TryParse(cellArray[(int)CsvColumn.Betrag], NumberStyles.Any, ci, out decimal betrag))
{
if (betrag >= 0)
{
cellArray[(int)CsvColumn.Soll] = "42590";
cellArray[(int)CsvColumn.Haben] = "441206";
}
else
{
cellArray[(int)CsvColumn.Soll] = "441206";
cellArray[(int)CsvColumn.Haben] = "42590";
}
// Assuming we only write to the purple field when the green field was a decimal:
cellArray[(int)CsvColumn.Belegnummer] = "a dummy text";
}
// Make sure you escape the columns that have a comma
liste.Add(string.Join(",", cellArray.Select(x => x.Contains(',') ? $"\"{x}\"" : x)) + "\n");
}
File.WriteAllLines(dialog.FileName, liste);
file.Close();
}
catch
{
MessageBox.Show("Der Gewählte Prozess wird bereits von einem anderen verwendet,\n " +
" bitte versuchen sie es erneut");
}
}
With that I change the header now, but now I want that when you look at the picture here:
Operations I want to perform:
when the green field is in the positive area, 42590 should be in the blue field and 441206 in the orange field.
if the green value is negative then 441206 should be in the blue field and 42590 in the orange.
In the purple field, a dummy text should simply be automatically written.
So how do I use my C # code to fill in the fields that I have marked in the code?
EDIT
An Example from my Input CSV File in Text Format:
Datum;Wertstellung;Kategorie;Name;Verwendungszweck;Konto;Bank;Betrag;Währung
31.10.2019;01.11.2019;;Some Text;;;;-42,89;EUR
31.10.2019;01.11.2019;;Some Text;;;;-236,98;EUR
31.10.2019;31.10.2019;;Some Text;;;;-200;EUR
30.10.2019;29.10.2019;;Some Text;;;;204,1;EUR
30.10.2019;31.10.2019;;Some Text;;;;-646,98;EUR
The task itself is pretty simple, but your attempt shows many external influences and almost no documentation. This leads to many comments to your post, but a best-practise answer really needs to address many of the smaller elements that you have so far overlooked. You already have the file management sorted out, so I'll try to focus on the array logic.
Make sure you have run and debugged your code before posting, the output from the initial post has a few quirks:
Your input file uses a semi-colon, so you need to split the line by ONLY THAT CHARACTER in your regular expression:
var cellArray = Regex.Split(line, #"[;](?=(?:[^\""]|\""[^\""]*\"")*$)")
.Select(s => Regex.Replace(s.Replace("\"\"", "\""), "^\"|\"$", "")).ToArray();
You can't assume to split the string by multiple delimiters at the same time because only value that contain the file specific delimiter will be quote escaped.
This line is doing nothing, it looks like a previous attempt, .Split() and .ToArray() return new values, they do not manipulate the source value, as you are not using the result of this line of code just remove it:
//line.Split(new char[] { '\t' }).ToArray();
The header row is being written into the first cell of the first row, while it may look like it works, I challenge you to explain the intent. You have also used a semicolon as the delimiter, even though the rest of your output is using comma, so this is fixed too. You will also find it far simpler to write this header row first, before we even read the input file:
List<String> liste = new List<string>();
// Build the header row into the output:
liste.Add("Belegdatum,Buchungsdatum,Belegnummer,Buchungstext,Verwendungszweck,Soll,Haben,Betrag,Währung");
With the german decimal separator being a comma, you will also need to escape the Betrag decimal value in the output
liste.Add(string.Join(",", cellArray.Select(x => x.Contains(',') ? $"\"{x}\"" : x)) + "\n");
Alternatively, you could use a semi-colon like your input data however it is still good practise to test for and escape the values that might contain the delimiter character.
Do you really want the additional line break in the output?
It is not necessary to append each line with the "\n" line feed character because you are later using WriteAllLines(). This method accepts an array of lines and will inject the line break between each line for you. In file processing like this it is only necessary to manually include the line feed if you were storing the output as a single string variable and perhaps later using WriteAllText() to write the final output to file.
This is often not clear when referencing different guidance material on text file manipulations, be aware of this if you copy one technique from an article that maintains an array of the lines, and a separate example that uses a single string variable or StringBuilder or StringWriter approaches.
The line from above now becomes this, note the trailing \n has been removed:
liste.Add(string.Join(",", cellArray.Select(x => x.Contains(',') ? $"\"{x}\"" : x)));
tldr; - Show me the codez!
Simple Forward Processing with Enum for Cell References
Object Oriented Solution
A Simple forward processing approach
It makes for light-weight code but complex logic can be much harder to read, however as you parse each line into the array, you can simply manipulate the values based on your rules. We can refer to this as sequential, in-line or forward processing because we read the input, process and prepare the output one row at a time.
List<string> liste = new List<string>();
// Build the header row into the output:
liste.Add("Belegdatum,Buchungsdatum,Belegnummer,Buchungstext,Verwendungszweck,Soll,Haben,Betrag,Währung");
CultureInfo ci = new CultureInfo("de-DE"); // necessary only when running the code from other cultures.
SaveFileDialog dialog = new SaveFileDialog();
dialog.Filter = "CVS (*.cvs)|*.csv|All files (*.*)|*.*";
if (dialog.ShowDialog() == true)
{
string line;
// Read the file and display it line by line.
try
{
System.IO.StreamReader file = new System.IO.StreamReader(path);
int counter = 0;
while ((line = file.ReadLine()) != null)
{
counter++;
var cellArray = Regex.Split(line, #"[;](?=(?:[^\""]|\""[^\""]*\"")*$)")
.Select(s => Regex.Replace(s.Replace("\"\"", "\""), "^\"|\"$", "")).ToArray();
// Skip lines that fail for any reason
try
{
// Check value of Betrag, only operate on it if there is a decimal value there
if (decimal.TryParse(cellArray[7], NumberStyles.Any, ci, out decimal betrag))
{
if (betrag >= 0)
{
cellArray[5] = "42590";
cellArray[6] = "441206";
}
else
{
cellArray[5] = "441206";
cellArray[6] = "42590";
}
// Assuming we only write to the purple field when the green field was a decimal:
cellArray[2] = "a dummy text";
}
else
{
// Skip lines where the Betrag is not a decimal
// this will cover the case when or if the first line is the header.
continue;
}
}
catch(Exception ex)
{
// Construct a message box to help the user resolve the issue.
// You can use the MessageBox API to allow the user to cancel the process if you want to extend this.
// or remove the message altogether if you want it to silently skip the erroneous rows.
MessageBox.Show("Fehler beim Analysieren der Eingabezeile,\n" +
$"{ex.Message}\n\n " +
$"{counter}:> {line} \n " +
$"{new String(' ', counter.ToString().Length)} - {cellArray.Length} Cells\n " +
$"|{String.Join("|", cellArray)}|\n " +
"\n " +
" Zeile wird verworfen, weiter!");
continue; // advance to the next iteration of the while loop.
}
// Make sure you escape the columns that have a comma
liste.Add(string.Join(",", cellArray.Select(x => x.Contains(',') ? $"\"{x}\"" : x)));
}
File.WriteAllLines(dialog.FileName, liste);
file.Close();
}
catch
{
MessageBox.Show("Der Gewählte Prozess wird bereits von einem anderen verwendet,\n " +
" bitte versuchen sie es erneut");
}
}
Use Named Constants
If you are trying to avoid an OO approach, then it can make the code easier to read by introducing some constants to refer to the indexes, there are many variations to this, but making the code more human readable will assist in future maintenance and understanding of the code.
Define the constants, I recommend doing this inside a static class definition to group these values together, rather than just defining them a local or instance variables.
An enum is another way to do this if you simply need to map strings to ints, or just want to give integer values a name.
public enum CsvColumn
{
Belegdatum = 0,
Buchungsdatum = 1,
Belegnummer = 2,
Buchungstext = 3,
Verwendungszweck = 4,
Soll = 5,
Haben = 6,
Betrag = 7,
Währung = 8
}
Enums have the added benefit of simple commands to retrive all the names of the columns, now we can use this to build the header line AND as the index references in the code:
List<string> liste = new List<string>();
// Build the header row into the output:
liste.Add(String.Join(',', Enum.GetNames<CsvColumn>()));
In previous versions of .Net the generic overload for Enum functions were not defined, in that case you will need to cast the type of the enum:
liste.Add(String.Join(',', Enum.GetNames(typeof(CsvColumn))));
https://learn.microsoft.com/en-us/dotnet/api/system.enum.getnames?view=netframework-4.7.2
In the following logic using the enum references we need to explicitly cast the enum values to int. If you were using int constants instead, then the (int) explicit cast is not required. Either way now we can immediately understand the intent of the logic, wihtout having to remember what the columns at index 5 and 6 are supposed to mean.
if (decimal.TryParse(cellArray[(int)CsvColumn.Betrag], NumberStyles.Any, ci, out decimal betrag))
{
if (betrag >= 0)
{
cellArray[(int)CsvColumn.Soll] = "42590";
cellArray[(int)CsvColumn.Haben] = "441206";
}
else
{
cellArray[(int)CsvColumn.Soll] = "441206";
cellArray[(int)CsvColumn.Haben] = "42590";
}
// Assuming we only write to the purple field when the green field was a decimal:
cellArray[(int)CsvColumn.Belegnummer] = "a dummy text";
}
View a fiddle of this implementation: https://dotnetfiddle.net/Cd10Cd
Of course a similar technique could be used for the "42590" and "441206" values, these must have some sort of business relevance/significance. So store them again as constant named string variables.
What you have here I refer to as Magic Strings, they have no meaning and could easily be corrupted during code refactoring, if the discrete value has a specific business meaning, then it should also have a specific name in the code.
OO Approach
Using an Object-Oriented approach can mean a lot of things, in this context there are 3 different concerns or processes that we want to separate, Parsing the input, executing business logic, Formatting the output. You could simply make 3 methods that accept an array of strings, but this code becomes hard to understand, by using a structured object to model our business domain concept of a row from the CSV file we can remove many auumptions, for instance, which element in the array is the Betrag (Value).
View the OO Fiddle here: https://dotnetfiddle.net/tjxcQN
You could use this Object-Oriented concept in the above code directly, parsing each line into the object, processing and serializing back to a string value all in one code block, however that makes it hard to gain a higer level view of the process which is necessary to understand the code itself. Even if you do this in your head, when we look at our peer's code, we break it down into blocks or discrete steps. So to be a good coding citizen, modularise your logic into functional methods where you can, it will assist you in the future when you need to write unit tests and it will help to keep your code clean, but also to allow us to extend your code in the future.
For this example we will create a simple model to represent each line. Note that this example takes the extra step of parsing the date fields into DateTime properties even though you do not need them for this example. I am deliberately using constants instead of an enum to show you a different concept. You use what ever makes sense on the day, this is still a first principals approach, there are different libraries you can use to manage serialization to and from CSV, XML, JSON and other text formats.
If your business needs are to display this information in an application, rather than just reading a file and writing directly back out to another file, then this information may be helpful to you, otherwise it is a good habit to get into if you are just practising because larger applications or larger teams will require this level of modularisation, which itself is not specifically an OO concept... The OO part comes from where we define the processing logic, in this example the BankRecord contains the logic to parse the CSV string input and how to serialize back to a CSV output.
public class BankRecord
{
/// <summary> Receipt Date </summary>
public DateTime Belegdatum { get; set; }
/// <summary> Entry Date </summary>
public DateTime Buchungsdatum { get; set; }
/// <summary>Sequence number</summary>
public string Belegnummer { get; set; }
/// <summary>Memo - Description</summary>
public string Buchungstext { get; set; }
/// <summary>Purpose</summary>
public string Verwendungszweck { get; set; }
/// <summary>Debit</summary>
public string Soll { get; set; }
/// <summary>Credit</summary>
public string Haben { get; set; }
/// <summary>Amount</summary>
public decimal Betrag { get; set; }
/// <summary>Currency</summary>
public string Währung { get; set; }
/// <summary> Column Index Definitions to simplify the CSV parsing</summary>
public static class Columns
{
public const int Belegdatum = 0;
public const int Buchungsdatum = 1;
public const int Belegnummer = 2;
public const int Buchungstext = 3;
public const int Verwendungszweck = 4;
public const int Soll = 5;
public const int Haben = 6;
public const int Betrag = 7;
public const int Währung = 8;
/// <summary>
/// Construct a CSV Header row from these column definitions
/// </summary>
public static string BuildCsvHeader()
{
return String.Join(',',
nameof(Belegdatum),
nameof(Buchungsdatum),
nameof(Belegnummer),
nameof(Buchungstext),
nameof(Verwendungszweck),
nameof(Soll),
nameof(Haben),
nameof(Betrag),
nameof(Währung));
}
}
/// <summary>
/// Parse a CSV string using the <see cref="Columns"/> definitions as the index for each of the named properties in this class
/// </summary>
/// <param name="csvLine">The CSV Line to parse</param>
/// <param name="provider">An object that supplies culture-specific formatting information.</param>
/// <returns>BankRecord populated from the input string</returns>
public static BankRecord FromCSV(string csvLine, IFormatProvider provider)
{
var cellArray = Regex.Split(csvLine, #"[\t,](?=(?:[^\""]|\""[^\""]*\"")*$)")
.Select(s => Regex.Replace(s.Replace("\"\"", "\""), "^\"|\"$", "")).ToArray();
// TODO: add in some validation, today we'll just check the number of cells.
if (cellArray.Length != 9)
throw new NotSupportedException("Input CSV did not contain the expected number of columns. (Expected 9)");
// The following is redimentary and doesn't perform any active error checking, the good news is that when it fails you
// will atleast know that it was in this specific method. Proper production level error handling is out of scope for this issue.
var transaction = new BankRecord();
transaction.Belegdatum = DateTime.Parse(cellArray[Columns.Belegdatum], provider);
transaction.Buchungsdatum = DateTime.Parse(cellArray[Columns.Buchungsdatum], provider);
transaction.Belegnummer = cellArray[Columns.Belegnummer];
transaction.Buchungstext = cellArray[Columns.Buchungstext];
transaction.Verwendungszweck = cellArray[Columns.Verwendungszweck];
transaction.Soll = cellArray[Columns.Soll];
transaction.Haben = cellArray[Columns.Haben];
transaction.Betrag = Decimal.Parse(cellArray[Columns.Betrag], provider);
transaction.Währung = cellArray[Columns.Währung];
return transaction;
}
/// <summary>
/// Write this object out to a CSV string that can be interpreted using the <see cref="Columns"/> definitions as the index for each of the named properties in this class
/// </summary>
/// <param name="provider">An object that supplies culture-specific formatting information.</param>
/// <returns>CSV string that represents this record./returns>
public string ToCSV(IFormatProvider provider)
{
return String.Join(',',
CsvEscape(Belegdatum, provider),
CsvEscape(Buchungsdatum, provider),
CsvEscape(Belegnummer, provider),
CsvEscape(Buchungstext, provider),
CsvEscape(Verwendungszweck, provider),
CsvEscape(Soll, provider),
CsvEscape(Haben, provider),
CsvEscape(Betrag, provider),
CsvEscape(Währung, provider));
}
/// <summary>
/// Simple routine to format a value for CSV output
/// </summary>
/// <param name="value">The value to serialize</param>
/// <param name="provider">An object that supplies culture-specific formatting information.</param>
/// <returns>Value escaped and safe for direct inclusion in a CSV output</returns>
private string CsvEscape(object value, IFormatProvider provider)
{
if (value == null)
return string.Empty;
string stringValue = String.Format(provider, "{0}", value);
if (stringValue.Contains(','))
return $"\"{stringValue}\"";
else
return stringValue;
}
/// <summary>
/// Format a Date value for CSV output
/// </summary>
/// <param name="value">The value to serialize</param>
/// <param name="provider">An object that supplies culture-specific formatting information.</param>
/// <remarks>Simple override to allow for common syntax between types, removes the need to the caller to understand the differences</remarks>
/// <returns>Value escaped and safe for direct inclusion in a CSV output</returns>
private string CsvEscape(DateTime value, IFormatProvider provider)
{
string stringValue = String.Format(provider, "{0:d}", value);
if (stringValue.Contains(','))
return $"\"{stringValue}\"";
else
return stringValue;
}
}
The following is the process logic:
CultureInfo ci = new CultureInfo("de-DE"); // neccessary only when running the code from other cultures.
// I'll leave this in, but don't call your list, "liste" instead give it some context or meaing, like "records" or "transactions"
List<BankRecord> liste = new List<BankRecord>();
SaveFileDialog dialog = new SaveFileDialog();
dialog.Filter = "CVS (*.cvs)|*.csv|All files (*.*)|*.*";
if (dialog.ShowDialog() == true)
{
string line;
// Read the file line by line.
try
{
#region Parse the input File
System.IO.StreamReader file = new System.IO.StreamReader(path);
while ((line = file.ReadLine()) != null)
{
try
{
liste.Add(BankRecord.FromCSV(line, ci));
}
catch
{
// TODO: re-raise or otherwise handle this error if you want.
// today we will simply ignore erroneous entries and will suppress this error
}
}
#endregion Parse the input File
#region Evaluate your business rules
// Evaluate your business rules here, natively in C#, no arrays or indexes, just manipulate the business domain object.
// assuming that Belegnummer is a sequencing number, not sure if it is from the start of this file or a different context...
// This just demonstrates a potential reason for NOT encapsulating the processing logic inside the BankRecord class.
int previousLineNumber = 47; // aribrary start...
foreach (var transaction in liste)
{
// Check value of Betrag, only operate on it if there is a decimal value there
if (transaction.Betrag >= 0)
{
transaction.Soll = "42590";
transaction.Haben = "441206";
}
else
{
transaction.Soll = "441206";
transaction.Haben = "42590";
}
transaction.Belegnummer = $"#{++previousLineNumber}";
}
#endregion Evaluate your business rules
#region Now write to the output
List<string> outputLines = new List<string>();
outputLines.Add(BankRecord.Columns.BuildCsvHeader());
outputLines.AddRange(liste.Select(x => x.ToCSV(ci)));
File.WriteAllLines(dialog.FileName, outputLines);
file.Close();
#endregion Now write to the output
}
catch
{
MessageBox.Show("Der Gewählte Prozess wird bereits von einem anderen verwendet,\n " +
" bitte versuchen sie es erneut");
}
}
Final Output:
Belegdatum,Buchungsdatum,Belegnummer,Buchungstext,Verwendungszweck,Soll,Haben,Betrag,Währung
31.10.2019,01.11.2019,#48,Some Text,,42590,441206,"50,43",EUR
31.10.2019,01.11.2019,#49,Some Text,,441206,42590,"-239,98",EUR
31.10.2019,31.10.2019,#50,Some Text,,441206,42590,-500,EUR
Belegdatum
Buchungsdatum
Belegnummer
Buchungstext
Verwendungszweck
Soll
Haben
Betrag
Währung
31.10.2019
01.11.2019
#48
Some Text
42590
441206
50,43
EUR
31.10.2019
01.11.2019
#49
Some Text
441206
42590
-239,98
EUR
31.10.2019
31.10.2019
#50
Some Text
441206
42590
-500
EUR
I am looking for suggestions on how to handle a csv file that is being created, then uploaded by our customers, and that may have a comma in a value, like a company name.
Some of the ideas we are looking at are: quoted Identifiers (value "," values ","etc) or using a | instead of a comma. The biggest problem is that we have to make it easy, or the customer won't do it.
There's actually a spec for CSV format, RFC 4180 and how to handle commas:
Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.
http://tools.ietf.org/html/rfc4180
So, to have values foo and bar,baz, you do this:
foo,"bar,baz"
Another important requirement to consider (also from the spec):
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
As others have said, you need to escape values that include quotes. Here’s a little CSV reader in C♯ that supports quoted values, including embedded quotes and carriage returns.
By the way, this is unit-tested code. I’m posting it now because this question seems to come up a lot and others may not want an entire library when simple CSV support will do.
You can use it as follows:
using System;
public class test
{
public static void Main()
{
using ( CsvReader reader = new CsvReader( "data.csv" ) )
{
foreach( string[] values in reader.RowEnumerator )
{
Console.WriteLine( "Row {0} has {1} values.", reader.RowIndex, values.Length );
}
}
Console.ReadLine();
}
}
Here are the classes. Note that you can use the Csv.Escape function to write valid CSV as well.
using System.IO;
using System.Text.RegularExpressions;
public sealed class CsvReader : System.IDisposable
{
public CsvReader( string fileName ) : this( new FileStream( fileName, FileMode.Open, FileAccess.Read ) )
{
}
public CsvReader( Stream stream )
{
__reader = new StreamReader( stream );
}
public System.Collections.IEnumerable RowEnumerator
{
get {
if ( null == __reader )
throw new System.ApplicationException( "I can't start reading without CSV input." );
__rowno = 0;
string sLine;
string sNextLine;
while ( null != ( sLine = __reader.ReadLine() ) )
{
while ( rexRunOnLine.IsMatch( sLine ) && null != ( sNextLine = __reader.ReadLine() ) )
sLine += "\n" + sNextLine;
__rowno++;
string[] values = rexCsvSplitter.Split( sLine );
for ( int i = 0; i < values.Length; i++ )
values[i] = Csv.Unescape( values[i] );
yield return values;
}
__reader.Close();
}
}
public long RowIndex { get { return __rowno; } }
public void Dispose()
{
if ( null != __reader ) __reader.Dispose();
}
//============================================
private long __rowno = 0;
private TextReader __reader;
private static Regex rexCsvSplitter = new Regex( #",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))" );
private static Regex rexRunOnLine = new Regex( #"^[^""]*(?:""[^""]*""[^""]*)*""[^""]*$" );
}
public static class Csv
{
public static string Escape( string s )
{
if ( s.Contains( QUOTE ) )
s = s.Replace( QUOTE, ESCAPED_QUOTE );
if ( s.IndexOfAny( CHARACTERS_THAT_MUST_BE_QUOTED ) > -1 )
s = QUOTE + s + QUOTE;
return s;
}
public static string Unescape( string s )
{
if ( s.StartsWith( QUOTE ) && s.EndsWith( QUOTE ) )
{
s = s.Substring( 1, s.Length - 2 );
if ( s.Contains( ESCAPED_QUOTE ) )
s = s.Replace( ESCAPED_QUOTE, QUOTE );
}
return s;
}
private const string QUOTE = "\"";
private const string ESCAPED_QUOTE = "\"\"";
private static char[] CHARACTERS_THAT_MUST_BE_QUOTED = { ',', '"', '\n' };
}
The CSV format uses commas to separate values, values which contain carriage returns, linefeeds, commas, or double quotes are surrounded by double-quotes. Values that contain double quotes are quoted and each literal quote is escaped by an immediately preceding quote: For example, the 3 values:
test
list, of, items
"go" he said
would be encoded as:
test
"list, of, items"
"""go"" he said"
Any field can be quoted but only fields that contain commas, CR/NL, or quotes must be quoted.
There is no real standard for the CSV format, but almost all applications follow the conventions documented here. The RFC that was mentioned elsewhere is not a standard for CSV, it is an RFC for using CSV within MIME and contains some unconventional and unnecessary limitations that make it useless outside of MIME.
A gotcha that many CSV modules I have seen don't accommodate is the fact that multiple lines can be encoded in a single field which means you can't assume that each line is a separate record, you either need to not allow newlines in your data or be prepared to handle this.
Put double quotes around strings. That is generally what Excel does.
Ala Eli,
you escape a double quote as two
double quotes. E.g.
"test1","foo""bar","test2"
You can put double quotes around the fields. I don't like this approach, as it adds another special character (the double quote). Just define an escape character (usually backslash) and use it wherever you need to escape something:
data,more data,more data\, even,yet more
You don't have to try to match quotes, and you have fewer exceptions to parse. This simplifies your code, too.
There is a library available through nuget for dealing with pretty much any well formed CSV (.net) - CsvHelper
Example to map to a class:
var csv = new CsvReader( textReader );
var records = csv.GetRecords<MyClass>();
Example to read individual fields:
var csv = new CsvReader( textReader );
while( csv.Read() )
{
var intField = csv.GetField<int>( 0 );
var stringField = csv.GetField<string>( 1 );
var boolField = csv.GetField<bool>( "HeaderName" );
}
Letting the client drive the file format:
, is the standard field delimiter, " is the standard value used to escape fields that contain a delimiter, quote, or line ending.
To use (for example) # for fields and ' for escaping:
var csv = new CsvReader( textReader );
csv.Configuration.Delimiter = "#";
csv.Configuration.Quote = ''';
// read the file however meets your needs
More Documentation
In case you're on a *nix-system, have access to sed and there can be one or more unwanted commas only in a specific field of your CSV, you can use the following one-liner in order to enclose them in " as RFC4180 Section 2 proposes:
sed -r 's/([^,]*,[^,]*,[^,]*,)(.*)(,.*,.*)/\1"\2"\3/' inputfile
Depending on which field the unwanted comma(s) may be in you have to alter/extend the capturing groups of the regex (and the substitution).
The example above will enclose the fourth field (out of six) in quotation marks.
In combination with the --in-place-option you can apply these changes directly to the file.
In order to "build" the right regex, there's a simple principle to follow:
For every field in your CSV that comes before the field with the unwanted comma(s) you write one [^,]*, and put them all together in a capturing group.
For the field that contains the unwanted comma(s) you write (.*).
For every field after the field with the unwanted comma(s) you write one ,.* and put them all together in a capturing group.
Here is a short overview of different possible regexes/substitutions depending on the specific field. If not given, the substitution is \1"\2"\3.
([^,]*)(,.*) #first field, regex
"\1"\2 #first field, substitution
(.*,)([^,]*) #last field, regex
\1"\2" #last field, substitution
([^,]*,)(.*)(,.*,.*,.*) #second field (out of five fields)
([^,]*,[^,]*,)(.*)(,.*) #third field (out of four fields)
([^,]*,[^,]*,[^,]*,)(.*)(,.*,.*) #fourth field (out of six fields)
If you want to remove the unwanted comma(s) with sed instead of enclosing them with quotation marks refer to this answer.
As mentioned in my comment to harpo's answer, his solution is good and works in most cases, however in some scenarios when commas as directly adjacent to each other it fails to split on the commas.
This is because of the Regex string behaving unexpectedly as a vertabim string.
In order to get this behave correct, all " characters in the regex string need to be escaped manually without using the vertabim escape.
Ie. The regex should be this using manual escapes:
",(?=(?:[^\"\"]*\"\"[^\"\"]*\"\")*(?![^\"\"]*\"\"))"
which translates into ",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))"
When using a vertabim string #",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))" it behaves as the following as you can see if you debug the regex:
",(?=(?:[^"]*"[^"]*")*(?![^"]*"))"
So in summary, I recommend harpo's solution, but watch out for this little gotcha!
I've included into the CsvReader a little optional failsafe to notify you if this error occurs (if you have a pre-known number of columns):
if (_expectedDataLength > 0 && values.Length != _expectedDataLength)
throw new DataLengthException(string.Format("Expected {0} columns when splitting csv, got {1}", _expectedDataLength, values.Length));
This can be injected via the constructor:
public CsvReader(string fileName, int expectedDataLength = 0) : this(new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
_expectedDataLength = expectedDataLength;
}
Add a reference to the Microsoft.VisualBasic (yes, it says VisualBasic but it works in C# just as well - remember that at the end it is all just IL).
Use the Microsoft.VisualBasic.FileIO.TextFieldParser class to parse CSV file Here is the sample code:
Dim parser As TextFieldParser = New TextFieldParser("C:\mar0112.csv")
parser.TextFieldType = FieldType.Delimited
parser.SetDelimiters(",")
While Not parser.EndOfData
'Processing row
Dim fields() As String = parser.ReadFields
For Each field As String In fields
'TODO: Process field
Next
parser.Close()
End While
You can use alternative "delimiters" like ";" or "|" but simplest might just be quoting which is supported by most (decent) CSV libraries and most decent spreadsheets.
For more on CSV delimiters and a spec for a standard format for describing delimiters and quoting see this webpage
If you're interested in a more educational exercise on how to parse files in general (using CSV as an example), you may check out this article by Julian Bucknall. I like the article because it breaks things down into much smaller problems that are much less insurmountable. You first create a grammar, and once you have a good grammar, it's a relatively easy and methodical process to convert the grammar into code.
The article uses C# and has a link at the bottom to download the code.
If you feel like reinventing the wheel, the following may work for you:
public static IEnumerable<string> SplitCSV(string line)
{
var s = new StringBuilder();
bool escaped = false, inQuotes = false;
foreach (char c in line)
{
if (c == ',' && !inQuotes)
{
yield return s.ToString();
s.Clear();
}
else if (c == '\\' && !escaped)
{
escaped = true;
}
else if (c == '"' && !escaped)
{
inQuotes = !inQuotes;
}
else
{
escaped = false;
s.Append(c);
}
}
yield return s.ToString();
}
In Europe we have this problem must earlier than this question. In Europe we use all a comma for a decimal point. See this numbers below:
| American | Europe |
| ------------- | ------------- |
| 0.5 | 0,5 |
| 3.14159265359 | 3,14159265359 |
| 17.54 | 17,54 |
| 175,186.15 | 175.186,15 |
So it isn't possible to use the comma separator for CSV files. Because of that reason, the CSV files in Europe are separated by a semicolon (;).
Programs like Microsoft Excel can read files with a semicolon and it's possible to switch from separator. You could even use a tab (\t) as separator. See this answer from Supper User.
Here's a neat little workaround:
You can use a Greek Lower Numeral Sign instead (U+0375)
It looks like this ͵
Using this method saves you a lot of resources too...
I know it's almost 13 years later, but we came across a similar situation where the client inputs us a CSV and has values with commas, there are 2 use cases:
If the client uses a windows Excel client to write the CSV (usually that's the case in windows environment) then commas are automatically added to the value.
The actual text value of the CSV:
3786962,1st Meridian Care Services,John,"Person A,Person B, Person C, Person D",Voyager
If the client is sending you the excel programmatically, then he should adhere to RFC4180 and enclose the value with "quotes". example:
Col1, Col2, "a, b, c", Col4
Just use SoftCircuits.CsvParser on NuGet. It will handle all those details for you and efficiently handles very large files. And, if needed, it can even import/export objects by mapping columns to object properties. In addition, my testing showed it averages nearly 4 times faster than the popular CsvHelper.
You can read the csv file like this.
this makes use of splits and takes care of spaces.
ArrayList List = new ArrayList();
static ServerSocket Server;
static Socket socket;
static ArrayList<Object> list = new ArrayList<Object>();
public static void ReadFromXcel() throws FileNotFoundException
{
File f = new File("Book.csv");
Scanner in = new Scanner(f);
int count =0;
String[] date;
String[] name;
String[] Temp = new String[10];
String[] Temp2 = new String[10];
String[] numbers;
ArrayList<String[]> List = new ArrayList<String[]>();
HashMap m = new HashMap();
in.nextLine();
date = in.nextLine().split(",");
name = in.nextLine().split(",");
numbers = in.nextLine().split(",");
while(in.hasNext())
{
String[] one = in.nextLine().split(",");
List.add(one);
}
int xount = 0;
//Making sure the lines don't start with a blank
for(int y = 0; y<= date.length-1; y++)
{
if(!date[y].equals(""))
{
Temp[xount] = date[y];
Temp2[xount] = name[y];
xount++;
}
}
date = Temp;
name =Temp2;
int counter = 0;
while(counter < List.size())
{
String[] list = List.get(counter);
String sNo = list[0];
String Surname = list[1];
String Name = list[2];
for(int x = 3; x < list.length; x++)
{
m.put(numbers[x], list[x]);
}
Object newOne = new newOne(sNo, Name, Surname, m, false);
StudentList.add(s);
System.out.println(s.sNo);
counter++;
}
I generally URL-encode the fields which can have any commas or any special chars. And then decode it when it is being used/displayed in any visual medium.
(commas becomes %2C)
Every language should have methods to URL-encode and decode strings.
e.g., in java
URLEncoder.encode(myString,"UTF-8"); //to encode
URLDecoder.decode(myEncodedstring, "UTF-8"); //to decode
I know this is a very general solution and it might not be ideal for situation where user wants to view content of csv file, manually.
I usually do this in my CSV files parsing routines. Assume that 'line' variable is one line within a CSV file and all of the columns' values are enclosed in double quotes. After the below two lines execute, you will get CSV columns in the 'values' collection.
// The below two lines will split the columns as well as trim the DBOULE QUOTES around values but NOT within them
string trimmedLine = line.Trim(new char[] { '\"' });
List<string> values = trimmedLine.Split(new string[] { "\",\"" }, StringSplitOptions.None).ToList();
The simplest solution I've found is the one LibreOffice uses:
Replace all literal " by ”
Put double quotes around your string
You can also use the one that Excel uses:
Replace all literal " by ""
Put double quotes around your string
Notice other people recommended to do only step 2 above, but that doesn't work with lines where a " is followed by a ,, like in a CSV where you want to have a single column with the string hello",world, as the CSV would read:
"hello",world"
Which is interpreted as a row with two columns: hello and world"
public static IEnumerable<string> LineSplitter(this string line, char
separator, char skip = '"')
{
var fieldStart = 0;
for (var i = 0; i < line.Length; i++)
{
if (line[i] == separator)
{
yield return line.Substring(fieldStart, i - fieldStart);
fieldStart = i + 1;
}
else if (i == line.Length - 1)
{
yield return line.Substring(fieldStart, i - fieldStart + 1);
fieldStart = i + 1;
}
if (line[i] == '"')
for (i++; i < line.Length && line[i] != skip; i++) { }
}
if (line[line.Length - 1] == separator)
{
yield return string.Empty;
}
}
I used Csvreader library but by using that I got data by exploding from comma(,) in column value.
So If you want to insert CSV file data which contains comma(,) in most of the columns values, you can use below function.
Author link => https://gist.github.com/jaywilliams/385876
function csv_to_array($filename='', $delimiter=',')
{
if(!file_exists($filename) || !is_readable($filename))
return FALSE;
$header = NULL;
$data = array();
if (($handle = fopen($filename, 'r')) !== FALSE)
{
while (($row = fgetcsv($handle, 1000, $delimiter)) !== FALSE)
{
if(!$header)
$header = $row;
else
$data[] = array_combine($header, $row);
}
fclose($handle);
}
return $data;
}
I used papaParse library to have the CSV file parsed and have the key-value pairs(key/header/first row of CSV file-value).
here is example that I use:
https://codesandbox.io/embed/llqmrp96pm
it has dummy.csv file in there to have the CSV parsing demo.
I've used it within reactJS though it is easy and simple to replicate in app written with any language.
An example might help to show how commas can be displayed in a .csv file. Create a simple text file as follows:
Save this text file as a text file with suffix ".csv" and open it with Excel 2000 from Windows 10.
aa,bb,cc,d;d
"In the spreadsheet presentation, the below line should look like the above line except the below shows a displayed comma instead of a semicolon between the d's."
aa,bb,cc,"d,d", This works even in Excel
aa,bb,cc,"d,d", This works even in Excel 2000
aa,bb,cc,"d ,d", This works even in Excel 2000
aa,bb,cc,"d , d", This works even in Excel 2000
aa,bb,cc, " d,d", This fails in Excel 2000 due to the space belore the 1st quote
aa,bb,cc, " d ,d", This fails in Excel 2000 due to the space belore the 1st quote
aa,bb,cc, " d , d", This fails in Excel 2000 due to the space belore the 1st quote
aa,bb,cc,"d,d " , This works even in Excel 2000 even with spaces before and after the 2nd quote.
aa,bb,cc,"d ,d " , This works even in Excel 2000 even with spaces before and after the 2nd quote.
aa,bb,cc,"d , d " , This works even in Excel 2000 even with spaces before and after the 2nd quote.
Rule: If you want to display a comma in a a cell (field) of a .csv file:
"Start and end the field with a double quotes, but avoid white space before the 1st quote"
As this is about general practices let's start from rules of the thumb:
Don't use CSV, use XML with a library to read & write the xml file instead.
If you must use CSV. Do it properly and use a free library to parse and store the CSV files.
To justify 1), most CSV parsers aren't encoding aware so if you aren't dealing with US-ASCII you are asking for troubles.
For example excel 2002 is storing the CSV in local encoding without any note about the encoding. The CSV standard isn't widely adopted :(.
On the other hand xml standard is well adopted and it handles encodings pretty well.
To justify 2), There is tons of csv parsers around for almost all language so there is no need to reinvent the wheel even if the solutions looks pretty simple.
To name few:
for python use build in csv module
for perl check CPAN and Text::CSV
for php use build in fgetcsv/fputcsv functions
for java check SuperCVS library
Really there is no need to implement this by hand if you aren't going to parse it on embedded device.
First, let's ask ourselves, "Why do we feel the need to handle commas differently for CSV files?"
For me, the answer is, "Because when I export data into a CSV file, the commas in a field disappear and my field gets separated into multiple fields where the commas appear in the original data." (That it because the comma is the CSV field separator character.)
Depending on your situation, semi colons may also be used as CSV field separators.
Given my requirements, I can use a character, e.g., single low-9 quotation mark, that looks like a comma.
So, here's how you can do it in Go:
// Replace special CSV characters with single low-9 quotation mark
func Scrub(a interface{}) string {
s := fmt.Sprint(a)
s = strings.Replace(s, ",", "‚", -1)
s = strings.Replace(s, ";", "‚", -1)
return s
}
The second comma looking character in the Replace function is decimal 8218.
Be aware that if you have clients that may have ascii-only text readers that this decima 8218 character will not look like a comma. If this is your case, then I'd recommend surrounding the field with the comma (or semicolon) with double quotes per RFC 4128: https://www.rfc-editor.org/rfc/rfc4180
Thank you others in this post.
I used the information here to create a function in JavaScript that will get csv output for an array of objects which may have property values containing commas.
like
rowsArray = [{obj1prop1: "foo", obj1prop2: "bar,baz"}, {obj2prop1: "qux", obj2prop2: "quux,corge,thud"}]
into
csvRowsArray = [{obj1prop1: "foo", obj1prop2: "\"bar,baz\""}, {...} ]
To use the commas in the values in a csv, the value needs to be wrapped in double quotes. And in order to have double quotes in the value in the json object, they just need to be escaped, i.e., \", backslash double quote. The escape is made here by subbing in a template literal and including the necessary quotes `"${row[key]}"`. The quotes are escaped when put in the object.
Here is my function:
const calculateTheCSVExport = (props) => {
if (props.rows === undefined) return;
let jsonRowsArray = props.rows;
// console.log(jsonRowsArray);
let csvRowsArrayNoCommasInObjectValues = [];
let csvCurrRowObject = {}
jsonRowsArray.forEach(row => {
Object.keys(row).forEach(key => {
// console.log(key, row[key])
if (row[key].indexOf(',') > -1) {
csvCurrRowObject = {...csvCurrRowObject, [key]: `"${row[key]}"`} // enclose value in escaped double quotes in JSON in order to export commas to csv correctly. see more: https://stackoverflow.com/questions/769621/dealing-with-commas-in-a-csv-file
} else {
csvCurrRowObject = {...csvCurrRowObject, [key]: row[key]}
}
});
csvRowsArrayNoCommasInObjectValues.push(csvCurrRowObject);
csvCurrRowObject = {};
})
// console.log(csvRowsArrayNoCommasInObjectValues)
return csvRowsArrayNoCommasInObjectValues;
}
I think the easiest solution to this problem is to have the customer to open the csv in excel, and then ctrl + r to replace all comma with whatever identifier you want. This is very easy for the customer and require only one change in your code to read the delimiter of your choice.
Use a tab character (\t) to separate the fields.
As part of a recent project I had to read and write from a CSV file and put in a grid view in c#. In the end decided to use a ready built parser to do the work for me.
Because I like to do that kind of stuff, I wondered how to go about writing my own.
So far all I've managed to do is this:
//Read the header
StreamReader reader = new StreamReader(dialog.FileName);
string row = reader.ReadLine();
string[] cells = row.Split(',');
//Create the columns of the dataGridView
for (int i = 0; i < cells.Count() - 1; i++)
{
DataGridViewTextBoxColumn column = new DataGridViewTextBoxColumn();
column.Name = cells[i];
column.HeaderText = cells[i];
dataGridView1.Columns.Add(column);
}
//Display the contents of the file
while (reader.Peek() != -1)
{
row = reader.ReadLine();
cells = row.Split(',');
dataGridView1.Rows.Add(cells);
}
My question: is carrying on like this a wise idea, and if it is (or isn't) how would I test it properly?
As a programming exercise (for learning and gaining experience) it is probably a very reasonable thing to do. For production code, it may be better to use an existing library mainly because the work is already done. There are quite a few things to address with a CSV parser. For example (randomly off the top of my head):
Quoted values (strings)
Embedded quotes in quoted strings
Empty values (NULL ... or maybe even NULL vs. empty).
Lines without the correct number of entries
Headers vs. no headers.
Recognizing different data types (e.g., different date formats).
If you have a very specific input format in a very controlled environment, though, you may not need to deal with all of those.
... is carrying on like this a wise idea ...?
Since you're doing this as a learning exercise, you may want to dig deeper into lexing and parsing theory. Your current approach will show its shortcomings fairly quickly as described in Stop Rolling Your Own CSV Parser!. It's not that parsing CSV data is difficult. (It's not.) It's just that most CSV parser projects treat the problem as a text splitting problem versus a parsing problem. If you take the time to define the CSV "language", the parser almost writes itself.
RFC 4180 defines a grammar for CSV data in ABNF form:
file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D ;as per section 6.1 of RFC 2234
DQUOTE = %x22 ;as per section 6.1 of RFC 2234
LF = %x0A ;as per section 6.1 of RFC 2234
CRLF = CR LF ;as per section 6.1 of RFC 2234
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
This grammar shows how single characters are built up to create more and more complex language elements. (As written, definitions go the opposite direction from complex to simple.)
If you start with a grammar, you can write parsing functions that mirror non-terminal grammar elements (the lowercase items). Julian M Bucknall describes the process in Writing a parser for CSV data. Take a look at Test-Driven Development with ANTLR for an example of the same process using a parser generator.
Keep in mind, there is no one accepted CSV definition. CSV data in the wild is not guaranteed to implement all of the RFC 4180 suggestions.
Get (or make) some CSV data and write Unit Tests using NUnit or Visual Studio Testing Tools.
Be sure to test edge cases like
"csv","Data","with","a","trailing","comma",
and
"csv","Data","with,","commas","and","""quotes""","in","it"
This come from
http://www.gigawebsolution.com/Posts/Details/61/Building-a-Simple-CSV-Parser-in-C#
public interface ICsvReaderWriter
{
List<string[]> Read(string filePath, char delimiter);
void Write(string filePath, List<string[]> lines, char delimiter);
}
public class CsvReaderWriter : ICsvReaderWriter
{
public List<string[]> Read(string filePath, char delimiter)
{
var fileContent = new List<string[]>();
using (var reader = new StreamReader(filePath, Encoding.Unicode))
{
string line;
while ((line = reader.ReadLine()) != null)
{
if (!string.IsNullOrEmpty(line))
{
fileContent.Add(line.Split(delimiter));
}
}
}
return fileContent;
}
public void Write(string filePath, List<string[]> lines, char delimiter)
{
using (var writer = new StreamWriter(filePath, true, Encoding.Unicode))
{
foreach (var line in lines)
{
var data = line.Aggregate(string.Empty,
(current, column) => current +
string.Format("{0}{1}", column,delimiter))
.TrimEnd(delimiter);
writer.WriteLine(data);
}
}
}
}
Parsing a CSV file isn't difficult, but it involves more than simply calling String.Split().
You are breaking the lines at each comma. But it's possible for fields to contain embedded commas. In these cases, CSV wraps the field in double quotes. So you must also look for double quotes and ignore commas within those quotes. In addition, it's even possible for fields to contain embedded double quotes. Double quotes must appear within double quotes and be "doubled up" to indicate the quote is a literal character.
If you'd like to see how I did it, you can check out this article.