How do I handle line breaks in a CSV file using C#? - c#

I have an Excel spreadsheet being converted into a CSV file in C#, but am having a problem dealing with line breaks. For instance:
"John","23","555-5555"
"Peter","24","555-5
555"
"Mary,"21","555-5555"
When I read the CSV file, if the record does not starts with a double quote (") then a line break is there by mistake and I have to remove it. I have some CSV reader classes from the internet but I am concerned that they will fail on the line breaks.
How should I handle these line breaks?
Thanks everybody very much for your help.
Here's is what I've done so far. My records have fixed format and all start with
JTW;...;....;...;
JTW;...;...;....
JTW;....;...;..
..;...;... (wrong record, line break inserted)
JTW;...;...
So I checked for the ; in the [3] position of each line. If true, I write; if false, I'll append on the last (removing the line-break)
I'm having problems now because I'm saving the file as a txt.
By the way, I am converting the Excel spreadsheet to csv by saving as csv in Excel. But I'm not sure if the client is doing that.
So the file as a TXT is perfect. I've checked the records and totals. But now I have to convert it back to csv, and I would really like to do it in the program. Does anybody know how?
Here is my code:
namespace EditorCSV
{
class Program
{
static void Main(string[] args)
{
ReadFromFile("c:\\source.csv");
}
static void ReadFromFile(string filename)
{
StreamReader SR;
StreamWriter SW;
SW = File.CreateText("c:\\target.csv");
string S;
char C='a';
int i=0;
SR=File.OpenText(filename);
S=SR.ReadLine();
SW.Write(S);
S = SR.ReadLine();
while(S!=null)
{
try { C = S[3]; }
catch (IndexOutOfRangeException exception){
bool t = false;
while (t == false)
{
t = true;
S = SR.ReadLine();
try { C = S[3]; }
catch (IndexOutOfRangeException ex) { S = SR.ReadLine(); t = false; }
}
}
if( C.Equals(';'))
{
SW.Write("\r\n" + S);
i = i + 1;
}
else
{
SW.Write(S);
}
S=SR.ReadLine();
}
SR.Close();
SW.Close();
Console.WriteLine("Records Processed: " + i.ToString() + " .");
Console.WriteLine("File Created SucacessFully");
Console.ReadKey();
}
}
}

CSV has predefined ways of handling that. This site provides an easy to read explanation of the standard way to handle all the caveats of CSV.
Nevertheless, there is really no reason to not use a solid, open source library for reading and writing CSV files to avoid making non-standard mistakes. LINQtoCSV is my favorite library for this. It supports reading and writing in a clean and simple way.
Alternatively, this SO question on CSV libraries will give you the list of the most popular choices.

Rather than check if the current line is missing the (") as the first character, check instead to see if the last character is a ("). If it is not, you know you have a line break, and you can read the next line and merge it together.
I am assuming your example data was accurate - fields were wrapped in quotes. If quotes might not delimit a text field (or new-lines are somehow found in non-text data), then all bets are off!

There is a built-in method for reading CSV files in .NET (requires Microsoft.VisualBasic assembly reference added):
public static IEnumerable<string[]> ReadSV(TextReader reader, params string[] separators)
{
var parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader);
parser.SetDelimiters(separators);
while (!parser.EndOfData)
yield return parser.ReadFields();
}
If you're dealing with really large files this CSV reader claims to be the fastest one you'll find: http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader

I've used this piece of code recently to parse rows from a CSV file (this is a simplified version):
private void Parse(TextReader reader)
{
var row = new List<string>();
var isStringBlock = false;
var sb = new StringBuilder();
long charIndex = 0;
int currentLineCount = 0;
while (reader.Peek() != -1)
{
charIndex++;
char c = (char)reader.Read();
if (c == '"')
isStringBlock = !isStringBlock;
if (c == separator && !isStringBlock) //end of word
{
row.Add(sb.ToString().Trim()); //add word
sb.Length = 0;
}
else if (c == '\n' && !isStringBlock) //end of line
{
row.Add(sb.ToString().Trim()); //add last word in line
sb.Length = 0;
//DO SOMETHING WITH row HERE!
currentLineCount++;
row = new List<string>();
}
else
{
if (c != '"' && c != '\r') sb.Append(c == '\n' ? ' ' : c);
}
}
row.Add(sb.ToString().Trim()); //add last word
//DO SOMETHING WITH LAST row HERE!
}

Try CsvHelper (a library I maintain). It ignores empty rows. I believe there is a flag you can set in FastCsvReader to have it handle empty rows also.

Heed the advice from the experts and Don't roll your own CSV parser.
Your first thought is, "How do I handle new line breaks?"
Your next thought is, "I need to handle commas inside of quotes."
Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."
It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free CsvHelper library.

Maybe you could count for (") during the ReadLine(). If they are odd, that will raise the flag. You could either ignore those lines, or get the next two and eliminate the first "\n" occurrence of the merge lines.

What I usually do is read the text in character by character opposed to line by line, due to this very problem.
As you're reading each character, you should be able to figure out where each cell starts and stops, but also the difference between a linebreak in a row and in a cell: If I remember correctly, for Excel generated files anyway, rows start with \r\n, and newlines in cells are only \r.

There is an example parser is c# that seems to handle your case correctly. Then you can read your data in and purge the line breaks out of it post-read.
Part 2 is the parser, and there is a Part 1 that covers the writer portion.

Read the line.
Split into columns(fields).
If you have enough columns expected for each line, then process.
If not, read the next line, and capture the remaining columns until you get what you need.
Repeat.

A somewhat simple regular expression could be used on each line. When it matches, you process each field from the match. When it doesn't find a match, you skip that line.
The regular expression could look something like this.
Match match = Regex.Match(line, #"^(?:,?(?<q>['"](?<field>.*?\k'q')|(?<field>[^,]*))+$");
if (match.Success)
{
foreach (var capture in match.Groups["field"].Captures)
{
string fieldValue = capture.Value;
// Use the value.
}
}

Have a look at FileHelpers Library
It supports reading\writing CSV with line breaks as well as reading\writing to excel

The LINQy solution:
string csvText = File.ReadAllText("C:\\Test.txt");
var query = csvText
.Replace(Environment.NewLine, string.Empty)
.Replace("\"\"", "\",\"").Split(',')
.Select((i, n) => new { i, n }).GroupBy(a => a.n / 3);

You might also check out my CSV parser SoftCircuits.CsvParser on NuGet. It will not only parse a CSV file but--if wanted--can also automatically map column values to your class properties. And it runs nearly four times faster than CsvHelper.

For a line break to exist in a CSV, there must be an open double quote that's not closed.
Assuming that all CSVs cells must open and close a double quote, just check if there's an odd number of quotation marks
my_string.Count(c => c == '"') % 2 == 1
and if that's the case, continue reading until you have the even number.

Related

c# Read/ Write CSV - excluding Comma in field Value [duplicate]

I am looking for suggestions on how to handle a csv file that is being created, then uploaded by our customers, and that may have a comma in a value, like a company name.
Some of the ideas we are looking at are: quoted Identifiers (value "," values ","etc) or using a | instead of a comma. The biggest problem is that we have to make it easy, or the customer won't do it.
There's actually a spec for CSV format, RFC 4180 and how to handle commas:
Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.
http://tools.ietf.org/html/rfc4180
So, to have values foo and bar,baz, you do this:
foo,"bar,baz"
Another important requirement to consider (also from the spec):
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
As others have said, you need to escape values that include quotes. Here’s a little CSV reader in C♯ that supports quoted values, including embedded quotes and carriage returns.
By the way, this is unit-tested code. I’m posting it now because this question seems to come up a lot and others may not want an entire library when simple CSV support will do.
You can use it as follows:
using System;
public class test
{
public static void Main()
{
using ( CsvReader reader = new CsvReader( "data.csv" ) )
{
foreach( string[] values in reader.RowEnumerator )
{
Console.WriteLine( "Row {0} has {1} values.", reader.RowIndex, values.Length );
}
}
Console.ReadLine();
}
}
Here are the classes. Note that you can use the Csv.Escape function to write valid CSV as well.
using System.IO;
using System.Text.RegularExpressions;
public sealed class CsvReader : System.IDisposable
{
public CsvReader( string fileName ) : this( new FileStream( fileName, FileMode.Open, FileAccess.Read ) )
{
}
public CsvReader( Stream stream )
{
__reader = new StreamReader( stream );
}
public System.Collections.IEnumerable RowEnumerator
{
get {
if ( null == __reader )
throw new System.ApplicationException( "I can't start reading without CSV input." );
__rowno = 0;
string sLine;
string sNextLine;
while ( null != ( sLine = __reader.ReadLine() ) )
{
while ( rexRunOnLine.IsMatch( sLine ) && null != ( sNextLine = __reader.ReadLine() ) )
sLine += "\n" + sNextLine;
__rowno++;
string[] values = rexCsvSplitter.Split( sLine );
for ( int i = 0; i < values.Length; i++ )
values[i] = Csv.Unescape( values[i] );
yield return values;
}
__reader.Close();
}
}
public long RowIndex { get { return __rowno; } }
public void Dispose()
{
if ( null != __reader ) __reader.Dispose();
}
//============================================
private long __rowno = 0;
private TextReader __reader;
private static Regex rexCsvSplitter = new Regex( #",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))" );
private static Regex rexRunOnLine = new Regex( #"^[^""]*(?:""[^""]*""[^""]*)*""[^""]*$" );
}
public static class Csv
{
public static string Escape( string s )
{
if ( s.Contains( QUOTE ) )
s = s.Replace( QUOTE, ESCAPED_QUOTE );
if ( s.IndexOfAny( CHARACTERS_THAT_MUST_BE_QUOTED ) > -1 )
s = QUOTE + s + QUOTE;
return s;
}
public static string Unescape( string s )
{
if ( s.StartsWith( QUOTE ) && s.EndsWith( QUOTE ) )
{
s = s.Substring( 1, s.Length - 2 );
if ( s.Contains( ESCAPED_QUOTE ) )
s = s.Replace( ESCAPED_QUOTE, QUOTE );
}
return s;
}
private const string QUOTE = "\"";
private const string ESCAPED_QUOTE = "\"\"";
private static char[] CHARACTERS_THAT_MUST_BE_QUOTED = { ',', '"', '\n' };
}
The CSV format uses commas to separate values, values which contain carriage returns, linefeeds, commas, or double quotes are surrounded by double-quotes. Values that contain double quotes are quoted and each literal quote is escaped by an immediately preceding quote: For example, the 3 values:
test
list, of, items
"go" he said
would be encoded as:
test
"list, of, items"
"""go"" he said"
Any field can be quoted but only fields that contain commas, CR/NL, or quotes must be quoted.
There is no real standard for the CSV format, but almost all applications follow the conventions documented here. The RFC that was mentioned elsewhere is not a standard for CSV, it is an RFC for using CSV within MIME and contains some unconventional and unnecessary limitations that make it useless outside of MIME.
A gotcha that many CSV modules I have seen don't accommodate is the fact that multiple lines can be encoded in a single field which means you can't assume that each line is a separate record, you either need to not allow newlines in your data or be prepared to handle this.
Put double quotes around strings. That is generally what Excel does.
Ala Eli,
you escape a double quote as two
double quotes. E.g.
"test1","foo""bar","test2"
You can put double quotes around the fields. I don't like this approach, as it adds another special character (the double quote). Just define an escape character (usually backslash) and use it wherever you need to escape something:
data,more data,more data\, even,yet more
You don't have to try to match quotes, and you have fewer exceptions to parse. This simplifies your code, too.
There is a library available through nuget for dealing with pretty much any well formed CSV (.net) - CsvHelper
Example to map to a class:
var csv = new CsvReader( textReader );
var records = csv.GetRecords<MyClass>();
Example to read individual fields:
var csv = new CsvReader( textReader );
while( csv.Read() )
{
var intField = csv.GetField<int>( 0 );
var stringField = csv.GetField<string>( 1 );
var boolField = csv.GetField<bool>( "HeaderName" );
}
Letting the client drive the file format:
, is the standard field delimiter, " is the standard value used to escape fields that contain a delimiter, quote, or line ending.
To use (for example) # for fields and ' for escaping:
var csv = new CsvReader( textReader );
csv.Configuration.Delimiter = "#";
csv.Configuration.Quote = ''';
// read the file however meets your needs
More Documentation
In case you're on a *nix-system, have access to sed and there can be one or more unwanted commas only in a specific field of your CSV, you can use the following one-liner in order to enclose them in " as RFC4180 Section 2 proposes:
sed -r 's/([^,]*,[^,]*,[^,]*,)(.*)(,.*,.*)/\1"\2"\3/' inputfile
Depending on which field the unwanted comma(s) may be in you have to alter/extend the capturing groups of the regex (and the substitution).
The example above will enclose the fourth field (out of six) in quotation marks.
In combination with the --in-place-option you can apply these changes directly to the file.
In order to "build" the right regex, there's a simple principle to follow:
For every field in your CSV that comes before the field with the unwanted comma(s) you write one [^,]*, and put them all together in a capturing group.
For the field that contains the unwanted comma(s) you write (.*).
For every field after the field with the unwanted comma(s) you write one ,.* and put them all together in a capturing group.
Here is a short overview of different possible regexes/substitutions depending on the specific field. If not given, the substitution is \1"\2"\3.
([^,]*)(,.*) #first field, regex
"\1"\2 #first field, substitution
(.*,)([^,]*) #last field, regex
\1"\2" #last field, substitution
([^,]*,)(.*)(,.*,.*,.*) #second field (out of five fields)
([^,]*,[^,]*,)(.*)(,.*) #third field (out of four fields)
([^,]*,[^,]*,[^,]*,)(.*)(,.*,.*) #fourth field (out of six fields)
If you want to remove the unwanted comma(s) with sed instead of enclosing them with quotation marks refer to this answer.
As mentioned in my comment to harpo's answer, his solution is good and works in most cases, however in some scenarios when commas as directly adjacent to each other it fails to split on the commas.
This is because of the Regex string behaving unexpectedly as a vertabim string.
In order to get this behave correct, all " characters in the regex string need to be escaped manually without using the vertabim escape.
Ie. The regex should be this using manual escapes:
",(?=(?:[^\"\"]*\"\"[^\"\"]*\"\")*(?![^\"\"]*\"\"))"
which translates into ",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))"
When using a vertabim string #",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))" it behaves as the following as you can see if you debug the regex:
",(?=(?:[^"]*"[^"]*")*(?![^"]*"))"
So in summary, I recommend harpo's solution, but watch out for this little gotcha!
I've included into the CsvReader a little optional failsafe to notify you if this error occurs (if you have a pre-known number of columns):
if (_expectedDataLength > 0 && values.Length != _expectedDataLength)
throw new DataLengthException(string.Format("Expected {0} columns when splitting csv, got {1}", _expectedDataLength, values.Length));
This can be injected via the constructor:
public CsvReader(string fileName, int expectedDataLength = 0) : this(new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
_expectedDataLength = expectedDataLength;
}
Add a reference to the Microsoft.VisualBasic (yes, it says VisualBasic but it works in C# just as well - remember that at the end it is all just IL).
Use the Microsoft.VisualBasic.FileIO.TextFieldParser class to parse CSV file Here is the sample code:
Dim parser As TextFieldParser = New TextFieldParser("C:\mar0112.csv")
parser.TextFieldType = FieldType.Delimited
parser.SetDelimiters(",")
While Not parser.EndOfData
'Processing row
Dim fields() As String = parser.ReadFields
For Each field As String In fields
'TODO: Process field
Next
parser.Close()
End While
You can use alternative "delimiters" like ";" or "|" but simplest might just be quoting which is supported by most (decent) CSV libraries and most decent spreadsheets.
For more on CSV delimiters and a spec for a standard format for describing delimiters and quoting see this webpage
If you're interested in a more educational exercise on how to parse files in general (using CSV as an example), you may check out this article by Julian Bucknall. I like the article because it breaks things down into much smaller problems that are much less insurmountable. You first create a grammar, and once you have a good grammar, it's a relatively easy and methodical process to convert the grammar into code.
The article uses C# and has a link at the bottom to download the code.
If you feel like reinventing the wheel, the following may work for you:
public static IEnumerable<string> SplitCSV(string line)
{
var s = new StringBuilder();
bool escaped = false, inQuotes = false;
foreach (char c in line)
{
if (c == ',' && !inQuotes)
{
yield return s.ToString();
s.Clear();
}
else if (c == '\\' && !escaped)
{
escaped = true;
}
else if (c == '"' && !escaped)
{
inQuotes = !inQuotes;
}
else
{
escaped = false;
s.Append(c);
}
}
yield return s.ToString();
}
In Europe we have this problem must earlier than this question. In Europe we use all a comma for a decimal point. See this numbers below:
| American | Europe |
| ------------- | ------------- |
| 0.5 | 0,5 |
| 3.14159265359 | 3,14159265359 |
| 17.54 | 17,54 |
| 175,186.15 | 175.186,15 |
So it isn't possible to use the comma separator for CSV files. Because of that reason, the CSV files in Europe are separated by a semicolon (;).
Programs like Microsoft Excel can read files with a semicolon and it's possible to switch from separator. You could even use a tab (\t) as separator. See this answer from Supper User.
Here's a neat little workaround:
You can use a Greek Lower Numeral Sign instead (U+0375)
It looks like this ͵
Using this method saves you a lot of resources too...
I know it's almost 13 years later, but we came across a similar situation where the client inputs us a CSV and has values with commas, there are 2 use cases:
If the client uses a windows Excel client to write the CSV (usually that's the case in windows environment) then commas are automatically added to the value.
The actual text value of the CSV:
3786962,1st Meridian Care Services,John,"Person A,Person B, Person C, Person D",Voyager
If the client is sending you the excel programmatically, then he should adhere to RFC4180 and enclose the value with "quotes". example:
Col1, Col2, "a, b, c", Col4
Just use SoftCircuits.CsvParser on NuGet. It will handle all those details for you and efficiently handles very large files. And, if needed, it can even import/export objects by mapping columns to object properties. In addition, my testing showed it averages nearly 4 times faster than the popular CsvHelper.
You can read the csv file like this.
this makes use of splits and takes care of spaces.
ArrayList List = new ArrayList();
static ServerSocket Server;
static Socket socket;
static ArrayList<Object> list = new ArrayList<Object>();
public static void ReadFromXcel() throws FileNotFoundException
{
File f = new File("Book.csv");
Scanner in = new Scanner(f);
int count =0;
String[] date;
String[] name;
String[] Temp = new String[10];
String[] Temp2 = new String[10];
String[] numbers;
ArrayList<String[]> List = new ArrayList<String[]>();
HashMap m = new HashMap();
in.nextLine();
date = in.nextLine().split(",");
name = in.nextLine().split(",");
numbers = in.nextLine().split(",");
while(in.hasNext())
{
String[] one = in.nextLine().split(",");
List.add(one);
}
int xount = 0;
//Making sure the lines don't start with a blank
for(int y = 0; y<= date.length-1; y++)
{
if(!date[y].equals(""))
{
Temp[xount] = date[y];
Temp2[xount] = name[y];
xount++;
}
}
date = Temp;
name =Temp2;
int counter = 0;
while(counter < List.size())
{
String[] list = List.get(counter);
String sNo = list[0];
String Surname = list[1];
String Name = list[2];
for(int x = 3; x < list.length; x++)
{
m.put(numbers[x], list[x]);
}
Object newOne = new newOne(sNo, Name, Surname, m, false);
StudentList.add(s);
System.out.println(s.sNo);
counter++;
}
I generally URL-encode the fields which can have any commas or any special chars. And then decode it when it is being used/displayed in any visual medium.
(commas becomes %2C)
Every language should have methods to URL-encode and decode strings.
e.g., in java
URLEncoder.encode(myString,"UTF-8"); //to encode
URLDecoder.decode(myEncodedstring, "UTF-8"); //to decode
I know this is a very general solution and it might not be ideal for situation where user wants to view content of csv file, manually.
I usually do this in my CSV files parsing routines. Assume that 'line' variable is one line within a CSV file and all of the columns' values are enclosed in double quotes. After the below two lines execute, you will get CSV columns in the 'values' collection.
// The below two lines will split the columns as well as trim the DBOULE QUOTES around values but NOT within them
string trimmedLine = line.Trim(new char[] { '\"' });
List<string> values = trimmedLine.Split(new string[] { "\",\"" }, StringSplitOptions.None).ToList();
The simplest solution I've found is the one LibreOffice uses:
Replace all literal " by ”
Put double quotes around your string
You can also use the one that Excel uses:
Replace all literal " by ""
Put double quotes around your string
Notice other people recommended to do only step 2 above, but that doesn't work with lines where a " is followed by a ,, like in a CSV where you want to have a single column with the string hello",world, as the CSV would read:
"hello",world"
Which is interpreted as a row with two columns: hello and world"
public static IEnumerable<string> LineSplitter(this string line, char
separator, char skip = '"')
{
var fieldStart = 0;
for (var i = 0; i < line.Length; i++)
{
if (line[i] == separator)
{
yield return line.Substring(fieldStart, i - fieldStart);
fieldStart = i + 1;
}
else if (i == line.Length - 1)
{
yield return line.Substring(fieldStart, i - fieldStart + 1);
fieldStart = i + 1;
}
if (line[i] == '"')
for (i++; i < line.Length && line[i] != skip; i++) { }
}
if (line[line.Length - 1] == separator)
{
yield return string.Empty;
}
}
I used Csvreader library but by using that I got data by exploding from comma(,) in column value.
So If you want to insert CSV file data which contains comma(,) in most of the columns values, you can use below function.
Author link => https://gist.github.com/jaywilliams/385876
function csv_to_array($filename='', $delimiter=',')
{
if(!file_exists($filename) || !is_readable($filename))
return FALSE;
$header = NULL;
$data = array();
if (($handle = fopen($filename, 'r')) !== FALSE)
{
while (($row = fgetcsv($handle, 1000, $delimiter)) !== FALSE)
{
if(!$header)
$header = $row;
else
$data[] = array_combine($header, $row);
}
fclose($handle);
}
return $data;
}
I used papaParse library to have the CSV file parsed and have the key-value pairs(key/header/first row of CSV file-value).
here is example that I use:
https://codesandbox.io/embed/llqmrp96pm
it has dummy.csv file in there to have the CSV parsing demo.
I've used it within reactJS though it is easy and simple to replicate in app written with any language.
An example might help to show how commas can be displayed in a .csv file. Create a simple text file as follows:
Save this text file as a text file with suffix ".csv" and open it with Excel 2000 from Windows 10.
aa,bb,cc,d;d
"In the spreadsheet presentation, the below line should look like the above line except the below shows a displayed comma instead of a semicolon between the d's."
aa,bb,cc,"d,d", This works even in Excel
aa,bb,cc,"d,d", This works even in Excel 2000
aa,bb,cc,"d ,d", This works even in Excel 2000
aa,bb,cc,"d , d", This works even in Excel 2000
aa,bb,cc, " d,d", This fails in Excel 2000 due to the space belore the 1st quote
aa,bb,cc, " d ,d", This fails in Excel 2000 due to the space belore the 1st quote
aa,bb,cc, " d , d", This fails in Excel 2000 due to the space belore the 1st quote
aa,bb,cc,"d,d " , This works even in Excel 2000 even with spaces before and after the 2nd quote.
aa,bb,cc,"d ,d " , This works even in Excel 2000 even with spaces before and after the 2nd quote.
aa,bb,cc,"d , d " , This works even in Excel 2000 even with spaces before and after the 2nd quote.
Rule: If you want to display a comma in a a cell (field) of a .csv file:
"Start and end the field with a double quotes, but avoid white space before the 1st quote"
As this is about general practices let's start from rules of the thumb:
Don't use CSV, use XML with a library to read & write the xml file instead.
If you must use CSV. Do it properly and use a free library to parse and store the CSV files.
To justify 1), most CSV parsers aren't encoding aware so if you aren't dealing with US-ASCII you are asking for troubles.
For example excel 2002 is storing the CSV in local encoding without any note about the encoding. The CSV standard isn't widely adopted :(.
On the other hand xml standard is well adopted and it handles encodings pretty well.
To justify 2), There is tons of csv parsers around for almost all language so there is no need to reinvent the wheel even if the solutions looks pretty simple.
To name few:
for python use build in csv module
for perl check CPAN and Text::CSV
for php use build in fgetcsv/fputcsv functions
for java check SuperCVS library
Really there is no need to implement this by hand if you aren't going to parse it on embedded device.
First, let's ask ourselves, "Why do we feel the need to handle commas differently for CSV files?"
For me, the answer is, "Because when I export data into a CSV file, the commas in a field disappear and my field gets separated into multiple fields where the commas appear in the original data." (That it because the comma is the CSV field separator character.)
Depending on your situation, semi colons may also be used as CSV field separators.
Given my requirements, I can use a character, e.g., single low-9 quotation mark, that looks like a comma.
So, here's how you can do it in Go:
// Replace special CSV characters with single low-9 quotation mark
func Scrub(a interface{}) string {
s := fmt.Sprint(a)
s = strings.Replace(s, ",", "‚", -1)
s = strings.Replace(s, ";", "‚", -1)
return s
}
The second comma looking character in the Replace function is decimal 8218.
Be aware that if you have clients that may have ascii-only text readers that this decima 8218 character will not look like a comma. If this is your case, then I'd recommend surrounding the field with the comma (or semicolon) with double quotes per RFC 4128: https://www.rfc-editor.org/rfc/rfc4180
Thank you others in this post.
I used the information here to create a function in JavaScript that will get csv output for an array of objects which may have property values containing commas.
like
rowsArray = [{obj1prop1: "foo", obj1prop2: "bar,baz"}, {obj2prop1: "qux", obj2prop2: "quux,corge,thud"}]
into
csvRowsArray = [{obj1prop1: "foo", obj1prop2: "\"bar,baz\""}, {...} ]
To use the commas in the values in a csv, the value needs to be wrapped in double quotes. And in order to have double quotes in the value in the json object, they just need to be escaped, i.e., \", backslash double quote. The escape is made here by subbing in a template literal and including the necessary quotes `"${row[key]}"`. The quotes are escaped when put in the object.
Here is my function:
const calculateTheCSVExport = (props) => {
if (props.rows === undefined) return;
let jsonRowsArray = props.rows;
// console.log(jsonRowsArray);
let csvRowsArrayNoCommasInObjectValues = [];
let csvCurrRowObject = {}
jsonRowsArray.forEach(row => {
Object.keys(row).forEach(key => {
// console.log(key, row[key])
if (row[key].indexOf(',') > -1) {
csvCurrRowObject = {...csvCurrRowObject, [key]: `"${row[key]}"`} // enclose value in escaped double quotes in JSON in order to export commas to csv correctly. see more: https://stackoverflow.com/questions/769621/dealing-with-commas-in-a-csv-file
} else {
csvCurrRowObject = {...csvCurrRowObject, [key]: row[key]}
}
});
csvRowsArrayNoCommasInObjectValues.push(csvCurrRowObject);
csvCurrRowObject = {};
})
// console.log(csvRowsArrayNoCommasInObjectValues)
return csvRowsArrayNoCommasInObjectValues;
}
I think the easiest solution to this problem is to have the customer to open the csv in excel, and then ctrl + r to replace all comma with whatever identifier you want. This is very easy for the customer and require only one change in your code to read the delimiter of your choice.
Use a tab character (\t) to separate the fields.

How can I deal with parsing bad csv data?

I know that the data should be correct. I have no control over the data and my boss is just going to tell me that I need to figure out a way to deal with someone else's mistake. So please don't tell me it's not my problem that the data is bad, because it is.
Anywho, this is what I'm looking at:
"Words","email#email.com","","4253","57574","FirstName","","LastName, MD","","","576JFJD","","1971","","Words","Address","SUITE "A"","City","State","Zip","Phone","",""
Data has been scrubbed for confidentiality reasons.
So as you see, the data contains quotation marks and there are commas inside some of these quoted fields. So I cannot remove them. But the "Suite A""" is throwing off the parser. There are too many quotation marks. >.<
I'm using the TextFieldParser in the Microsoft.VisualBasic.FileIO namespace with these settings:
parser.HasFieldsEnclosedInQuotes = true;
parser.SetDelimiters(",");
parser.TextFieldType = FieldType.Delimited;
The error is
MalformedLineException: Line 9871 cannot be parsed using the current
delimiters.
I would like to scrub the data somehow to account for this but I'm not sure how to do it. Or maybe there's a way to just skip this line? Although I suspect my higher ups will not approve of me just skipping data that we might need.
If you are only trying to get rid of the stray " marks in your csv, you can use the following regex to find them and replace them with '
String sourcestring = "source string to match with pattern";
String matchpattern = #"(?<!^|,)""(?!(,|$))";
String replacementpattern = #"$1'";
Console.WriteLine(Regex.Replace(sourcestring,matchpattern,replacementpattern,RegexOptions.Multiline));
Explanation:
#"(?<!^|,)""(?!(,|$))"; will find will find any " that is not preceded by the beginning of the string, or a , and that is not followed by the end of the string or a ,
I am not familiar with TextFieldParser. However with CsvHelper, you can add a custom handler for invalid data:
var config = new CsvConfiguration();
config.IgnoreReadingExceptions = true;
config.ReadingExceptionCallback += (e, row) =>
{
// you can add some custom patching here if possible
// or, save the line numbers and add/edit them manually later.
};
using(var file = File.OpenRead(".csv"))
using(var reader = new CsvReader(reader, config))
{
reader.GetRecords<YourDtoClass>();
}
My only addition to what everyone is saying (because we've all been there) is to try to attempt to rectify each new issue you encounter with code. There are some decent REGEX strings out there https://www.google.com/?ion=1&espv=2#q=c-sharp+regex+csv+clean or you could manually fix things using String.Replace (String.Replace("\"\"\"","").Replace("\"\","").Replace("\",,","\",") or such). Eventually, as you detect and find ways of correcting more and more mistakes, your manual recovery rate will be minimized substantially (most of your bad data will likely come from similar mistakes). Cheers!
PS - Idea-ish (it's been a while - the logic may neeed some tweaking as I'm writing from memory), but you'll get the gist:
public string[] parseCSVWithQuotes(string csvLine,int expectedNumberOfDataPoints)
{
string ret = "";
string thisChar = "";
string lastChar = "";
bool needleDown = true;
for(int i = 0; i < csvLine.Length; i++)
{
thisChar = csvLine.Substring(i, 1);
if (thisChar == "'"&&lastChar!="'")
needleDown = needleDown == true ? false : true;//when needleDown = true, characters are treated literally
if (thisChar == ","&&lastChar!=",") {
if (needleDown)
{
ret += "|";//convert literal comma to pipe so it doesn't cause another break on split
}else
{
ret += ",";//break on split is intended because the comma is outside the single quote
}
}
if (!needleDown && (thisChar == "\"" || thisChar == "*")) {//repeat for any undesired character or use RegEx
//do not add -- this eliminates any undesired characters outside single quotes
}
else
{
if ((lastChar == "'" || lastChar == "\"" || lastChar == ",") && thisChar == lastChar)
{
//do not add - this eliminates double characters
}else
{
ret += thisChar;
lastChar = thisChar;
//this character is not an undesired character, is no a double, is valid.
}
}
}
//we've cleaned as best we can
string[] parts = ret.Split(',');
if(parts.Length==expectedNumberOfDataPoints){
for(int i = 0; i < parts.Length; i++)
{
//go back and replace the temporary pipe with the literal comma AFTER split
parts[i] = parts[i].Replace("|", ",");
}
return parts;
}else{
//save ret to bad CSV log
return null;
}
}
I've had to do this before,
The first step is to parse the data using string.split(',')
The next step is to combine the segments that belong together.
What I essentially did was
make a new list representing the combined strings
if a string begins with a quote, push it onto your new list
if it does not begin with a quote, append it to the last string in your list
Bonus: throw exceptions when a string ends with a quote but the next one does not begin with a quote
Depending on what the rules are regarding what can actually appear in your data, you might have to change your code to account for that.
At the core of CSV's file format, each line is a row, each cell in that row is separated by a comma. In your case, your format also contains the (very unfortunate) stipulation that commas inside a pair of quotation marks do not count as separators and are instead part of the data. I say very unfortunate because a misplaced quotation mark affects the entire rest of the line, and since quotation marks in standard ASCII do not distinguish between open and closed, there really is nothing you can do to recover from this without knowing the original intent.
That is when you log a message in a way that the person who does know the original intent (the person that provided the data) can look at the file and correct the error:
if (parse_line(line, &data)) {
// save the data
} else {
// log the error
fprintf(&stderr, "Bad line: %s", line);
}
And since your quotation marks aren't escaping newlines, you can keep on going with the next line after running into this error.
ADDENDUM: And if your company has a choice (i.e. your data is being serialized by a company tool) don't use CSV. Use something like XML or JSON with a much more clearly defined parsing mechanism.
I had to do this once aswell. My approach was to go through a line and keep track on what I was reading.
Basicly, I coded my own scanner chopping off tokens from the input line which gave me full control over my faulty .csv data.
This is what I did:
For each character on a line of input.
1. when outside of a string meeting a comma => all of the previous string (which can be empty) is a valid token.
2. when outside of a sting meeting anything but a comma or a quote => now you have a real problem, unquoted tekst => handle as you see fit.
3. when outside of a string meeing a quote => found a start of string.
4. when inside of a string meeting a comma => accept the comma as part of the string.
5. when inside of the string meeting a qoute => trouble starts here, mark this point.
6. continue and when meeting a comma (skipping white space if desired) close the string, 'unread' the comma and continue. (than will bring you to point 1.)
7. or continue and when meeting a quote -> obviously, what was read must be part of the string, add it to the string, 'unread' the quote and continue. (that will you bring to point 5)
8. or continue and find an whitespace, then End Of Line ('\n') -> the last qoute must be the closing quote. accept the string as a value.
9. or continue and fine non-whitespace, then End Of Line. -> now you have a real problem, you have the start of a string but it is not closed -> handle the error as you see fit.
If the number of fields in your .csv file is fixed you can count the comma's you recognise as field seperators and when you see a End Of Line you know you have another problem or not.
With the stream of strings received from the input line you can build a 'clean' .csv line and this way build a buffer of accepted and cleaned input that you can use in your already existing code.

Writing and polishing a CSV parser

As part of a recent project I had to read and write from a CSV file and put in a grid view in c#. In the end decided to use a ready built parser to do the work for me.
Because I like to do that kind of stuff, I wondered how to go about writing my own.
So far all I've managed to do is this:
//Read the header
StreamReader reader = new StreamReader(dialog.FileName);
string row = reader.ReadLine();
string[] cells = row.Split(',');
//Create the columns of the dataGridView
for (int i = 0; i < cells.Count() - 1; i++)
{
DataGridViewTextBoxColumn column = new DataGridViewTextBoxColumn();
column.Name = cells[i];
column.HeaderText = cells[i];
dataGridView1.Columns.Add(column);
}
//Display the contents of the file
while (reader.Peek() != -1)
{
row = reader.ReadLine();
cells = row.Split(',');
dataGridView1.Rows.Add(cells);
}
My question: is carrying on like this a wise idea, and if it is (or isn't) how would I test it properly?
As a programming exercise (for learning and gaining experience) it is probably a very reasonable thing to do. For production code, it may be better to use an existing library mainly because the work is already done. There are quite a few things to address with a CSV parser. For example (randomly off the top of my head):
Quoted values (strings)
Embedded quotes in quoted strings
Empty values (NULL ... or maybe even NULL vs. empty).
Lines without the correct number of entries
Headers vs. no headers.
Recognizing different data types (e.g., different date formats).
If you have a very specific input format in a very controlled environment, though, you may not need to deal with all of those.
... is carrying on like this a wise idea ...?
Since you're doing this as a learning exercise, you may want to dig deeper into lexing and parsing theory. Your current approach will show its shortcomings fairly quickly as described in Stop Rolling Your Own CSV Parser!. It's not that parsing CSV data is difficult. (It's not.) It's just that most CSV parser projects treat the problem as a text splitting problem versus a parsing problem. If you take the time to define the CSV "language", the parser almost writes itself.
RFC 4180 defines a grammar for CSV data in ABNF form:
file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D ;as per section 6.1 of RFC 2234
DQUOTE = %x22 ;as per section 6.1 of RFC 2234
LF = %x0A ;as per section 6.1 of RFC 2234
CRLF = CR LF ;as per section 6.1 of RFC 2234
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
This grammar shows how single characters are built up to create more and more complex language elements. (As written, definitions go the opposite direction from complex to simple.)
If you start with a grammar, you can write parsing functions that mirror non-terminal grammar elements (the lowercase items). Julian M Bucknall describes the process in Writing a parser for CSV data. Take a look at Test-Driven Development with ANTLR for an example of the same process using a parser generator.
Keep in mind, there is no one accepted CSV definition. CSV data in the wild is not guaranteed to implement all of the RFC 4180 suggestions.
Get (or make) some CSV data and write Unit Tests using NUnit or Visual Studio Testing Tools.
Be sure to test edge cases like
"csv","Data","with","a","trailing","comma",
and
"csv","Data","with,","commas","and","""quotes""","in","it"
This come from
http://www.gigawebsolution.com/Posts/Details/61/Building-a-Simple-CSV-Parser-in-C#
public interface ICsvReaderWriter
{
List<string[]> Read(string filePath, char delimiter);
void Write(string filePath, List<string[]> lines, char delimiter);
}
public class CsvReaderWriter : ICsvReaderWriter
{
public List<string[]> Read(string filePath, char delimiter)
{
var fileContent = new List<string[]>();
using (var reader = new StreamReader(filePath, Encoding.Unicode))
{
string line;
while ((line = reader.ReadLine()) != null)
{
if (!string.IsNullOrEmpty(line))
{
fileContent.Add(line.Split(delimiter));
}
}
}
return fileContent;
}
public void Write(string filePath, List<string[]> lines, char delimiter)
{
using (var writer = new StreamWriter(filePath, true, Encoding.Unicode))
{
foreach (var line in lines)
{
var data = line.Aggregate(string.Empty,
(current, column) => current +
string.Format("{0}{1}", column,delimiter))
.TrimEnd(delimiter);
writer.WriteLine(data);
}
}
}
}
Parsing a CSV file isn't difficult, but it involves more than simply calling String.Split().
You are breaking the lines at each comma. But it's possible for fields to contain embedded commas. In these cases, CSV wraps the field in double quotes. So you must also look for double quotes and ignore commas within those quotes. In addition, it's even possible for fields to contain embedded double quotes. Double quotes must appear within double quotes and be "doubled up" to indicate the quote is a literal character.
If you'd like to see how I did it, you can check out this article.

C# file input from text file

I have a function like this:
List<float> myList = new List(float);
public void numbers(string filename)
{
string input;
float number;
if (System.IO.File.Exists(filename) == true)
{
System.IO.StreamReader objectReader;
objectReader = new System.IO.StreamReader(filename);
while ((input = objectReader.ReadLine()) != null)
{
number = Convert.ToSingle(input);
myList.Add(number);
}
objectReader.Close();
}
else
{
MessageBox.Show("No Such File" + filename);
}
}
Where Im trying to add numbers (floats) from a text file into a List. But I keep getting errors saying wrong format. The numbers in the text file are one number per line...any help?
I would suggest you do a Trim call like this
number = Convert.ToSingle(input.Trim());
However, a better code would be using a TryParse call
float tmp;
if(float.TryParse(input.Trim(), out tmp)
{
mylist.Add(tmp);
}
Your code worked fine for me except for the case of a newline (and of course for entries that were not numbers at all)
Here is a version that should work for you, using a tryParse to check if each line can convert to a single):
public void Numbers(string filename)
{
List<float> myList = new List<float>();
string input;
if (System.IO.File.Exists(filename) == true)
{
System.IO.StreamReader objectReader;
objectReader = new System.IO.StreamReader(filename);
while ((input = objectReader.ReadLine()) != null)
{
Single output;
if (Single.TryParse(input, out output ))
{
myList.Add(output);
}
else
{
// Huh? Should this happen, maybe some logging can go here to track down why you couldn't just use the .Convert()
}
}
objectReader.Close();
}
else
{
MessageBox.Show("No Such File" + filename);
}
}
As Mike C rightly points out, this could be potentially risky - swallowing good data that has been corrupted by the output process. The tryParse method returns false when it fails so you could add in an else branch and some logging to check just what is causing the failures and see if there is another bug floating around that can be corrected.
Do you have any blank lines in the file, or failures to convert the number? My guess is that you have a line which is not castable to float from its current format. You should make sure you sanitize the lines before reading them in (strip off everything that is not a number using a regex) and throw the line out if it fails the check.
One thing you might do is use double instead and do a Convert.ToDouble().
Are there spaces or commas or anything? The best thing to do would be to set a breakpoint on
number = Convert.ToSingle(input);
to see what input is actually before you try to convert it.
There's a wonderful free package called FileHelpers which helps with importing data from all sorts of text files. The advantage with this is that a lot of the deeper error handling is already in place.
By the way,
if (System.IO.File.Exists(filename) == true)
can be shortened to
if (System.IO.File.Exists(filename))

Looking for Regex to find quoted newlines in a big string (for C#)

I have a big string (let's call it a CSV file, though it isn't actually one, it'll just be easier for now) that I have to parse in C# code.
The first step of the parsing process splits the file into individual lines by just using a StreamReader object and calling ReadLine until it's through the file. However, any given line might contain a quoted (in single quotes) literal with embedded newlines. I need to find those newlines and convert them temporarily into some other kind of token or escape sequence until I've split the file into an array of lines..then I can change them back.
Example input data:
1,2,10,99,'Some text without a newline', true, false, 90
2,1,11,98,'This text has an embedded newline
and continues here', true, true, 90
I could write all of the C# code needed to do this by using string.IndexOf to find the quoted sections and look within them for newlines, but I'm thinking a Regex might be a better choice (i.e. now I have two problems)
Since this isn't a true CSV file, does it have any sort of schema?
From your example, it looks like you have:
int, int, int, int, string , bool, bool, int
With that making up your record / object.
Assuming that your data is well formed (I don't know enough about your source to know how valid this assumption is); you could:
Read your line.
Use a state machine to parse your data.
If your line ends, and you're parsing a string, read the next line..and keep parsing.
I'd avoid using a regex if possible.
State-machines for doing such a job are made easy using C# 2.0 iterators. Here's hopefully the last CSV parser I'll ever write. The whole file is treated as a enumerable bunch of enumerable strings, i.e. rows/columns. IEnumerable is great because it can then be processed by LINQ operators.
public class CsvParser
{
public char FieldDelimiter { get; set; }
public CsvParser()
: this(',')
{
}
public CsvParser(char fieldDelimiter)
{
FieldDelimiter = fieldDelimiter;
}
public IEnumerable<IEnumerable<string>> Parse(string text)
{
return Parse(new StringReader(text));
}
public IEnumerable<IEnumerable<string>> Parse(TextReader reader)
{
while (reader.Peek() != -1)
yield return parseLine(reader);
}
IEnumerable<string> parseLine(TextReader reader)
{
bool insideQuotes = false;
StringBuilder item = new StringBuilder();
while (reader.Peek() != -1)
{
char ch = (char)reader.Read();
char? nextCh = reader.Peek() > -1 ? (char)reader.Peek() : (char?)null;
if (!insideQuotes && ch == FieldDelimiter)
{
yield return item.ToString();
item.Length = 0;
}
else if (!insideQuotes && ch == '\r' && nextCh == '\n') //CRLF
{
reader.Read(); // skip LF
break;
}
else if (!insideQuotes && ch == '\n') //LF for *nix-style line endings
break;
else if (ch == '"' && nextCh == '"') // escaped quotes ""
{
item.Append('"');
reader.Read(); // skip next "
}
else if (ch == '"')
insideQuotes = !insideQuotes;
else
item.Append(ch);
}
// last one
yield return item.ToString();
}
}
Note that the file is read character by character with the code deciding when newlines are to be treated as row delimiters or part of a quoted string.
What if you got the whole file into a variable then split that based on non-quoted newlines?
EDIT: Sorry, I've misinterpreted your post. If you're looking for a regex, then here is one:
content = Regex.Replace(content, "'([^']*)\n([^']*)'", "'\1TOKEN\2'");
There might be edge cases and that two problems but I think it should be ok most of the time. What the Regex does is that it first finds any pair of single quotes that has \n between it and replace that \n with TOKEN preserving any text in-between.
But still, I'd go state machine like what #bryansh explained below.

Categories