Parquet.NET is generating huge parquet files in comparison with pyarrow

Parquet.NET is generating huge parquet files in comparison with pyarrow - c#

My application takes data from Azure EventHubs, which has a maximum of 1mb size, transforms it into a DataTable and then save it as a Parquet file somewhere.
The parquet generated by Parquet.Net is huge, it is always over 50mb even with the best compression method. When I read this 50mb parquet file using pandas and then re-write it into another file, it becomes less then 500kb.
See below the comparison between Parquet.Net (RED) and Pyarrow (BLUE):
As we can see, the number of columns and rows are the same.
I did check the content and it seems all okay.
Obs: There is one varchar(8000) column has lots of data.
That is how I got the parquet metadata:
import pandas as pd
import pyarrow.parquet as pq
# pd.set_option("max_colwidth", None)
# pd.set_option("max_seq_item", None)
# pd.set_option("min_rows", 2)
# pd.set_option("max_rows", None)
# pd.set_option('max_columns', None)
parquet_file_net = pq.ParquetFile("parquetnetFile.parquet")
print(parquet_file_net.metadata)
print()
parquet_file_py = pq.ParquetFile("pyarrowFile.parquet")
print(parquet_file_py.metadata)
print()
print()
print(parquet_file_net.metadata.row_group(0))
print()
print(parquet_file_py.metadata.row_group(0))
My c# code is based on the following one, but I did some changes:
https://github.com/dazfuller/datatable-to-parquet
So here is my C# code.
public static async Task<MemoryStream> ToParquetStream(DataTable dt)
{
var fields = GenerateSchema(dt);
var parquetStream = new MemoryStream();
using (var writer = new ParquetWriter(new Schema(fields), parquetStream))
{
writer.CompressionMethod = CompressionMethod.Gzip;
writer.CompressionLevel = 2;
var range = Enumerable.Range(0, dt.Columns.Count);
var result = await range.ForEachAsyncInParallel(async c =>
{
return await Task.Run(() =>
{
// Determine the target data type for the column
var targetType = dt.Columns[c].DataType;
if (targetType == typeof(DateTime))
{
targetType = typeof(DateTimeOffset);
}
// Generate the value type, this is to ensure it can handle null values
var valueType = targetType.IsClass ? targetType : typeof(Nullable<>).MakeGenericType(targetType);
// Create a list to hold values of the required type for the column
var valuesArray = Array.CreateInstance(valueType, dt.Rows.Count);
// Get the data to be written to the parquet stream
for (int r = 0; r < dt.Rows.Count; r++)
{
DataRow row = dt.Rows[r];
// Check if value is null, if so then add a null value
if (row[c] == null || row[c] == DBNull.Value)
{
valuesArray.SetValue(null, r);
}
else
{
// Add the value to the list, but if it's a DateTime then create it as a DateTimeOffset first
if (dt.Columns[c].DataType == typeof(DateTime))
{
valuesArray.SetValue(new DateTimeOffset((DateTime)row[c]), r);
}
else
{
valuesArray.SetValue(row[c], r);
}
}
}
return valuesArray;
});
});
using (var rgw = writer.CreateRowGroup())
{
for (int c = 0; c < dt.Columns.Count; c++)
{
rgw.WriteColumn(new Parquet.Data.DataColumn(fields[c], result[c]));
}
}
}
return parquetStream;
}
private static List<DataField> GenerateSchema(DataTable dt)
{
var fields = new List<DataField>(dt.Columns.Count);
foreach (DataColumn column in dt.Columns)
{
// Attempt to parse the type of column to a parquet data type
var success = Enum.TryParse<DataType>(column.DataType.Name, true, out var type);
// If the parse was not successful and it's source is a DateTime then use a DateTimeOffset, otherwise default to a string
if (!success && column.DataType == typeof(DateTime))
{
type = DataType.DateTimeOffset;
}
// In c# float is System.Single. That is why the parse fails
else if (!success && column.DataType == typeof(float))
{
type = DataType.Float;
}
else if (!success)
{
type = DataType.String;
}
fields.Add(new DataField(column.ColumnName, type));
}
return fields;
}
public static async Task<R[]> ForEachAsyncInParallel<T, R>(this IEnumerable<T> list, Func<T, Task<R>> func)
{
var tasks = new List<Task<R>>();
foreach (var value in list)
{
tasks.Add(func(value));
}
return await Task.WhenAll<R>(tasks);
}
So why is the file size so large?
Here are the files generated by parquet.net and pyarrow: https://easyupload.io/m/28jo48

Related

Parsing CSV File with double quotes [duplicate]

Is there a default/official/recommended way to parse CSV files in C#? I don't want to roll my own parser.
Also, I've seen instances of people using ODBC/OLE DB to read CSV via the Text driver, and a lot of people discourage this due to its "drawbacks." What are these drawbacks?
Ideally, I'm looking for a way through which I can read the CSV by column name, using the first record as the header / field names. Some of the answers given are correct but work to basically deserialize the file into classes.

A CSV parser is now a part of .NET Framework.
Add a reference to Microsoft.VisualBasic.dll (works fine in C#, don't mind the name)
using (TextFieldParser parser = new TextFieldParser(#"c:\temp\test.csv"))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
while (!parser.EndOfData)
{
//Process row
string[] fields = parser.ReadFields();
foreach (string field in fields)
{
//TODO: Process field
}
}
}
The docs are here - TextFieldParser Class
P.S. If you need a CSV exporter, try CsvExport (discl: I'm one of the contributors)

CsvHelper (a library I maintain) will read a CSV file into custom objects.
using (var reader = new StreamReader("path\\to\\file.csv"))
using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
{
var records = csv.GetRecords<Foo>();
}
Sometimes you don't own the objects you're trying to read into. In this case, you can use fluent mapping because you can't put attributes on the class.
public sealed class MyCustomObjectMap : CsvClassMap<MyCustomObject>
{
public MyCustomObjectMap()
{
Map( m => m.Property1 ).Name( "Column Name" );
Map( m => m.Property2 ).Index( 4 );
Map( m => m.Property3 ).Ignore();
Map( m => m.Property4 ).TypeConverter<MySpecialTypeConverter>();
}
}

Let a library handle all the nitty-gritty details for you! :-)
Check out FileHelpers and stay DRY - Don't Repeat Yourself - no need to re-invent the wheel a gazillionth time....
You basically just need to define that shape of your data - the fields in your individual line in the CSV - by means of a public class (and so well-thought out attributes like default values, replacements for NULL values and so forth), point the FileHelpers engine at a file, and bingo - you get back all the entries from that file. One simple operation - great performance!

In a business application, i use the Open Source project on codeproject.com, CSVReader.
It works well, and has good performance. There is some benchmarking on the link i provided.
A simple example, copied from the project page:
using (CsvReader csv = new CsvReader(new StreamReader("data.csv"), true))
{
int fieldCount = csv.FieldCount;
string[] headers = csv.GetFieldHeaders();
while (csv.ReadNextRecord())
{
for (int i = 0; i < fieldCount; i++)
Console.Write(string.Format("{0} = {1};", headers[i], csv[i]));
Console.WriteLine();
}
}
As you can see, it's very easy to work with.

I know its a bit late but just found a library Microsoft.VisualBasic.FileIO which has TextFieldParser class to process csv files.

Here is a helper class I use often, in case any one ever comes back to this thread (I wanted to share it).
I use this for the simplicity of porting it into projects ready to use:
public class CSVHelper : List<string[]>
{
protected string csv = string.Empty;
protected string separator = ",";
public CSVHelper(string csv, string separator = "\",\"")
{
this.csv = csv;
this.separator = separator;
foreach (string line in Regex.Split(csv, System.Environment.NewLine).ToList().Where(s => !string.IsNullOrEmpty(s)))
{
string[] values = Regex.Split(line, separator);
for (int i = 0; i < values.Length; i++)
{
//Trim values
values[i] = values[i].Trim('\"');
}
this.Add(values);
}
}
}
And use it like:
public List<Person> GetPeople(string csvContent)
{
List<Person> people = new List<Person>();
CSVHelper csv = new CSVHelper(csvContent);
foreach(string[] line in csv)
{
Person person = new Person();
person.Name = line[0];
person.TelephoneNo = line[1];
people.Add(person);
}
return people;
}
[Updated csv helper: bug fixed where the last new line character created a new line]

If you need only reading csv files then I recommend this library: A Fast CSV Reader
If you also need to generate csv files then use this one: FileHelpers
Both of them are free and opensource.

This solution is using the official Microsoft.VisualBasic assembly to parse CSV.
Advantages:
delimiter escaping
ignores Header
trim spaces
ignore comments
Code:
using Microsoft.VisualBasic.FileIO;
public static List<List<string>> ParseCSV (string csv)
{
List<List<string>> result = new List<List<string>>();
// To use the TextFieldParser a reference to the Microsoft.VisualBasic assembly has to be added to the project.
using (TextFieldParser parser = new TextFieldParser(new StringReader(csv)))
{
parser.CommentTokens = new string[] { "#" };
parser.SetDelimiters(new string[] { ";" });
parser.HasFieldsEnclosedInQuotes = true;
// Skip over header line.
//parser.ReadLine();
while (!parser.EndOfData)
{
var values = new List<string>();
var readFields = parser.ReadFields();
if (readFields != null)
values.AddRange(readFields);
result.Add(values);
}
}
return result;
}

I have written TinyCsvParser for .NET, which is one of the fastest .NET parsers around and highly configurable to parse almost any CSV format.
It is released under MIT License:
https://github.com/bytefish/TinyCsvParser
You can use NuGet to install it. Run the following command in the Package Manager Console.
PM> Install-Package TinyCsvParser
Usage
Imagine we have list of Persons in a CSV file persons.csv with their first name, last name and birthdate.
FirstName;LastName;BirthDate
Philipp;Wagner;1986/05/12
Max;Musterman;2014/01/02
The corresponding domain model in our system might look like this.
private class Person
{
public string FirstName { get; set; }
public string LastName { get; set; }
public DateTime BirthDate { get; set; }
}
When using TinyCsvParser you have to define the mapping between the columns in the CSV data and the property in you domain model.
private class CsvPersonMapping : CsvMapping<Person>
{
public CsvPersonMapping()
: base()
{
MapProperty(0, x => x.FirstName);
MapProperty(1, x => x.LastName);
MapProperty(2, x => x.BirthDate);
}
}
And then we can use the mapping to parse the CSV data with a CsvParser.
namespace TinyCsvParser.Test
{
[TestFixture]
public class TinyCsvParserTest
{
[Test]
public void TinyCsvTest()
{
CsvParserOptions csvParserOptions = new CsvParserOptions(true, new[] { ';' });
CsvPersonMapping csvMapper = new CsvPersonMapping();
CsvParser<Person> csvParser = new CsvParser<Person>(csvParserOptions, csvMapper);
var result = csvParser
.ReadFromFile(#"persons.csv", Encoding.ASCII)
.ToList();
Assert.AreEqual(2, result.Count);
Assert.IsTrue(result.All(x => x.IsValid));
Assert.AreEqual("Philipp", result[0].Result.FirstName);
Assert.AreEqual("Wagner", result[0].Result.LastName);
Assert.AreEqual(1986, result[0].Result.BirthDate.Year);
Assert.AreEqual(5, result[0].Result.BirthDate.Month);
Assert.AreEqual(12, result[0].Result.BirthDate.Day);
Assert.AreEqual("Max", result[1].Result.FirstName);
Assert.AreEqual("Mustermann", result[1].Result.LastName);
Assert.AreEqual(2014, result[1].Result.BirthDate.Year);
Assert.AreEqual(1, result[1].Result.BirthDate.Month);
Assert.AreEqual(1, result[1].Result.BirthDate.Day);
}
}
}
User Guide
A full User Guide is available at:
http://bytefish.github.io/TinyCsvParser/

Here is a short and simple solution.
using (TextFieldParser parser = new TextFieldParser(outputLocation))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
string[] headers = parser.ReadLine().Split(',');
foreach (string header in headers)
{
dataTable.Columns.Add(header);
}
while (!parser.EndOfData)
{
string[] fields = parser.ReadFields();
dataTable.Rows.Add(fields);
}
}

Here is my KISS implementation...
using System;
using System.Collections.Generic;
using System.Text;
class CsvParser
{
public static List<string> Parse(string line)
{
const char escapeChar = '"';
const char splitChar = ',';
bool inEscape = false;
bool priorEscape = false;
List<string> result = new List<string>();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < line.Length; i++)
{
char c = line[i];
switch (c)
{
case escapeChar:
if (!inEscape)
inEscape = true;
else
{
if (!priorEscape)
{
if (i + 1 < line.Length && line[i + 1] == escapeChar)
priorEscape = true;
else
inEscape = false;
}
else
{
sb.Append(c);
priorEscape = false;
}
}
break;
case splitChar:
if (inEscape) //if in escape
sb.Append(c);
else
{
result.Add(sb.ToString());
sb.Length = 0;
}
break;
default:
sb.Append(c);
break;
}
}
if (sb.Length > 0)
result.Add(sb.ToString());
return result;
}
}

Some time ago I had wrote simple class for CSV read/write based on Microsoft.VisualBasic library. Using this simple class you will be able to work with CSV like with 2 dimensions array. You can find my class by the following link: https://github.com/ukushu/DataExporter
Simple example of usage:
Csv csv = new Csv("\t");//delimiter symbol
csv.FileOpen("c:\\file1.csv");
var row1Cell6Value = csv.Rows[0][5];
csv.AddRow("asdf","asdffffff","5")
csv.FileSave("c:\\file2.csv");
For reading header only you need is to read csv.Rows[0] cells :)

This code reads csv to DataTable:
public static DataTable ReadCsv(string path)
{
DataTable result = new DataTable("SomeData");
using (TextFieldParser parser = new TextFieldParser(path))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
bool isFirstRow = true;
//IList<string> headers = new List<string>();
while (!parser.EndOfData)
{
string[] fields = parser.ReadFields();
if (isFirstRow)
{
foreach (string field in fields)
{
result.Columns.Add(new DataColumn(field, typeof(string)));
}
isFirstRow = false;
}
else
{
int i = 0;
DataRow row = result.NewRow();
foreach (string field in fields)
{
row[i++] = field;
}
result.Rows.Add(row);
}
}
}
return result;
}

Single source file solution for straightforward parsing needs, useful. Deals with all the nasty edge cases. Such as new line normalization and handling new lines in quoted string literals. Your welcome!
If you CSV file has a header you just read out the column names (and compute column indexes) from the first row. Simple as that.
Note that Dump is a LINQPad method, you might want to remove that if you are not using LINQPad.
void Main()
{
var file1 = "a,b,c\r\nx,y,z";
CSV.ParseText(file1).Dump();
var file2 = "a,\"b\",c\r\nx,\"y,z\"";
CSV.ParseText(file2).Dump();
var file3 = "a,\"b\",c\r\nx,\"y\r\nz\"";
CSV.ParseText(file3).Dump();
var file4 = "\"\"\"\"";
CSV.ParseText(file4).Dump();
}
static class CSV
{
public struct Record
{
public readonly string[] Row;
public string this[int index] => Row[index];
public Record(string[] row)
{
Row = row;
}
}
public static List<Record> ParseText(string text)
{
return Parse(new StringReader(text));
}
public static List<Record> ParseFile(string fn)
{
using (var reader = File.OpenText(fn))
{
return Parse(reader);
}
}
public static List<Record> Parse(TextReader reader)
{
var data = new List<Record>();
var col = new StringBuilder();
var row = new List<string>();
for (; ; )
{
var ln = reader.ReadLine();
if (ln == null) break;
if (Tokenize(ln, col, row))
{
data.Add(new Record(row.ToArray()));
row.Clear();
}
}
return data;
}
public static bool Tokenize(string s, StringBuilder col, List<string> row)
{
int i = 0;
if (col.Length > 0)
{
col.AppendLine(); // continuation
if (!TokenizeQuote(s, ref i, col, row))
{
return false;
}
}
while (i < s.Length)
{
var ch = s[i];
if (ch == ',')
{
row.Add(col.ToString().Trim());
col.Length = 0;
i++;
}
else if (ch == '"')
{
i++;
if (!TokenizeQuote(s, ref i, col, row))
{
return false;
}
}
else
{
col.Append(ch);
i++;
}
}
if (col.Length > 0)
{
row.Add(col.ToString().Trim());
col.Length = 0;
}
return true;
}
public static bool TokenizeQuote(string s, ref int i, StringBuilder col, List<string> row)
{
while (i < s.Length)
{
var ch = s[i];
if (ch == '"')
{
// escape sequence
if (i + 1 < s.Length && s[i + 1] == '"')
{
col.Append('"');
i++;
i++;
continue;
}
i++;
return true;
}
else
{
col.Append(ch);
i++;
}
}
return false;
}
}

Another one to this list, Cinchoo ETL - an open source library to read and write multiple file formats (CSV, flat file, Xml, JSON etc)
Sample below shows how to read CSV file quickly (No POCO object required)
string csv = #"Id, Name
1, Carl
2, Tom
3, Mark";
using (var p = ChoCSVReader.LoadText(csv)
.WithFirstLineHeader()
)
{
foreach (var rec in p)
{
Console.WriteLine($"Id: {rec.Id}");
Console.WriteLine($"Name: {rec.Name}");
}
}
Sample below shows how to read CSV file using POCO object
public partial class EmployeeRec
{
public int Id { get; set; }
public string Name { get; set; }
}
static void CSVTest()
{
string csv = #"Id, Name
1, Carl
2, Tom
3, Mark";
using (var p = ChoCSVReader<EmployeeRec>.LoadText(csv)
.WithFirstLineHeader()
)
{
foreach (var rec in p)
{
Console.WriteLine($"Id: {rec.Id}");
Console.WriteLine($"Name: {rec.Name}");
}
}
}
Please check out articles at CodeProject on how to use it.

This parser supports nested commas and quotes in a column:
static class CSVParser
{
public static string[] ParseLine(string line)
{
List<string> cols = new List<string>();
string value = null;
for(int i = 0; i < line.Length; i++)
{
switch(line[i])
{
case ',':
cols.Add(value);
value = null;
if(i == line.Length - 1)
{// It ends with comma
cols.Add(null);
}
break;
case '"':
cols.Add(ParseEnclosedColumn(line, ref i));
i++;
break;
default:
value += line[i];
if (i == line.Length - 1)
{// Last character
cols.Add(value);
}
break;
}
}
return cols.ToArray();
}//ParseLine
static string ParseEnclosedColumn(string line, ref int index)
{// Example: "b"",bb"
string value = null;
int numberQuotes = 1;
int index2 = index;
for (int i = index + 1; i < line.Length; i++)
{
index2 = i;
switch (line[i])
{
case '"':
numberQuotes++;
if (numberQuotes % 2 == 0)
{
if (i < line.Length - 1 && line[i + 1] == ',')
{
index = i;
return value;
}
}
else if (i > index + 1 && line[i - 1] == '"')
{
value += '"';
}
break;
default:
value += line[i];
break;
}
}
index = index2;
return value;
}//ParseEnclosedColumn
}//class CSVParser

Based on unlimit's post on How to properly split a CSV using C# split() function? :
string[] tokens = System.Text.RegularExpressions.Regex.Split(paramString, ",");
NOTE: this doesn't handle escaped / nested commas, etc., and therefore is only suitable for certain simple CSV lists.

If anyone wants a snippet they can plop into their code without having to bind a library or download a package. Here is a version I wrote:
public static string FormatCSV(List<string> parts)
{
string result = "";
foreach (string s in parts)
{
if (result.Length > 0)
{
result += ",";
if (s.Length == 0)
continue;
}
if (s.Length > 0)
{
result += "\"" + s.Replace("\"", "\"\"") + "\"";
}
else
{
// cannot output double quotes since its considered an escape for a quote
result += ",";
}
}
return result;
}
enum CSVMode
{
CLOSED = 0,
OPENED_RAW = 1,
OPENED_QUOTE = 2
}
public static List<string> ParseCSV(string input)
{
List<string> results;
CSVMode mode;
char[] letters;
string content;
mode = CSVMode.CLOSED;
content = "";
results = new List<string>();
letters = input.ToCharArray();
for (int i = 0; i < letters.Length; i++)
{
char letter = letters[i];
char nextLetter = '\0';
if (i < letters.Length - 1)
nextLetter = letters[i + 1];
// If its a quote character
if (letter == '"')
{
// If that next letter is a quote
if (nextLetter == '"' && mode == CSVMode.OPENED_QUOTE)
{
// Then this quote is escaped and should be added to the content
content += letter;
// Skip the escape character
i++;
continue;
}
else
{
// otherwise its not an escaped quote and is an opening or closing one
// Character is skipped
// If it was open, then close it
if (mode == CSVMode.OPENED_QUOTE)
{
results.Add(content);
// reset the content
content = "";
mode = CSVMode.CLOSED;
// If there is a next letter available
if (nextLetter != '\0')
{
// If it is a comma
if (nextLetter == ',')
{
i++;
continue;
}
else
{
throw new Exception("Expected comma. Found: " + nextLetter);
}
}
}
else if (mode == CSVMode.OPENED_RAW)
{
// If it was opened raw, then just add the quote
content += letter;
}
else if (mode == CSVMode.CLOSED)
{
// Otherwise open it as a quote
mode = CSVMode.OPENED_QUOTE;
}
}
}
// If its a comma seperator
else if (letter == ',')
{
// If in quote mode
if (mode == CSVMode.OPENED_QUOTE)
{
// Just read it
content += letter;
}
// If raw, then close the content
else if (mode == CSVMode.OPENED_RAW)
{
results.Add(content);
content = "";
mode = CSVMode.CLOSED;
}
// If it was closed, then open it raw
else if (mode == CSVMode.CLOSED)
{
mode = CSVMode.OPENED_RAW;
results.Add(content);
content = "";
}
}
else
{
// If opened quote, just read it
if (mode == CSVMode.OPENED_QUOTE)
{
content += letter;
}
// If opened raw, then read it
else if (mode == CSVMode.OPENED_RAW)
{
content += letter;
}
// It closed, then open raw
else if (mode == CSVMode.CLOSED)
{
mode = CSVMode.OPENED_RAW;
content += letter;
}
}
}
// If it was still reading when the buffer finished
if (mode != CSVMode.CLOSED)
{
results.Add(content);
}
return results;
}

For smaller input CSV data LINQ is fully enough.
For example for the following CSV file content:
schema_name,description,utype
"IX_HE","High-Energy data","x"
"III_spectro","Spectrosopic data","d"
"VI_misc","Miscellaneous","f"
"vcds1","Catalogs only available in CDS","d"
"J_other","Publications from other journals","b"
when we read the whole content into single string called data, then
using System;
using System.IO;
using System.Linq;
var data = File.ReadAllText(Path2CSV);
// helper split characters
var newline = Environment.NewLine.ToCharArray();
var comma = ",".ToCharArray();
var quote = "\"".ToCharArray();
// split input string data to lines
var lines = data.Split(newline);
// first line is header, take the header fields
foreach (var col in lines.First().Split(comma)) {
// do something with "col"
}
// we skip the first line, all the rest are real data lines/fields
foreach (var line in lines.Skip(1)) {
// first we split the data line by comma character
// next we remove double qoutes from each splitted element using Trim()
// finally we make an array
var fields = line.Split(comma)
.Select(_ => { _ = _.Trim(quote); return _; })
.ToArray();
// do something with the "fields" array
}

Regex to get all "cells" form csv file row [duplicate]

Is there a default/official/recommended way to parse CSV files in C#? I don't want to roll my own parser.
Also, I've seen instances of people using ODBC/OLE DB to read CSV via the Text driver, and a lot of people discourage this due to its "drawbacks." What are these drawbacks?
Ideally, I'm looking for a way through which I can read the CSV by column name, using the first record as the header / field names. Some of the answers given are correct but work to basically deserialize the file into classes.

A CSV parser is now a part of .NET Framework.
Add a reference to Microsoft.VisualBasic.dll (works fine in C#, don't mind the name)
using (TextFieldParser parser = new TextFieldParser(#"c:\temp\test.csv"))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
while (!parser.EndOfData)
{
//Process row
string[] fields = parser.ReadFields();
foreach (string field in fields)
{
//TODO: Process field
}
}
}
The docs are here - TextFieldParser Class
P.S. If you need a CSV exporter, try CsvExport (discl: I'm one of the contributors)

CsvHelper (a library I maintain) will read a CSV file into custom objects.
using (var reader = new StreamReader("path\\to\\file.csv"))
using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
{
var records = csv.GetRecords<Foo>();
}
Sometimes you don't own the objects you're trying to read into. In this case, you can use fluent mapping because you can't put attributes on the class.
public sealed class MyCustomObjectMap : CsvClassMap<MyCustomObject>
{
public MyCustomObjectMap()
{
Map( m => m.Property1 ).Name( "Column Name" );
Map( m => m.Property2 ).Index( 4 );
Map( m => m.Property3 ).Ignore();
Map( m => m.Property4 ).TypeConverter<MySpecialTypeConverter>();
}
}

Let a library handle all the nitty-gritty details for you! :-)
Check out FileHelpers and stay DRY - Don't Repeat Yourself - no need to re-invent the wheel a gazillionth time....
You basically just need to define that shape of your data - the fields in your individual line in the CSV - by means of a public class (and so well-thought out attributes like default values, replacements for NULL values and so forth), point the FileHelpers engine at a file, and bingo - you get back all the entries from that file. One simple operation - great performance!

In a business application, i use the Open Source project on codeproject.com, CSVReader.
It works well, and has good performance. There is some benchmarking on the link i provided.
A simple example, copied from the project page:
using (CsvReader csv = new CsvReader(new StreamReader("data.csv"), true))
{
int fieldCount = csv.FieldCount;
string[] headers = csv.GetFieldHeaders();
while (csv.ReadNextRecord())
{
for (int i = 0; i < fieldCount; i++)
Console.Write(string.Format("{0} = {1};", headers[i], csv[i]));
Console.WriteLine();
}
}
As you can see, it's very easy to work with.

I know its a bit late but just found a library Microsoft.VisualBasic.FileIO which has TextFieldParser class to process csv files.

Here is a helper class I use often, in case any one ever comes back to this thread (I wanted to share it).
I use this for the simplicity of porting it into projects ready to use:
public class CSVHelper : List<string[]>
{
protected string csv = string.Empty;
protected string separator = ",";
public CSVHelper(string csv, string separator = "\",\"")
{
this.csv = csv;
this.separator = separator;
foreach (string line in Regex.Split(csv, System.Environment.NewLine).ToList().Where(s => !string.IsNullOrEmpty(s)))
{
string[] values = Regex.Split(line, separator);
for (int i = 0; i < values.Length; i++)
{
//Trim values
values[i] = values[i].Trim('\"');
}
this.Add(values);
}
}
}
And use it like:
public List<Person> GetPeople(string csvContent)
{
List<Person> people = new List<Person>();
CSVHelper csv = new CSVHelper(csvContent);
foreach(string[] line in csv)
{
Person person = new Person();
person.Name = line[0];
person.TelephoneNo = line[1];
people.Add(person);
}
return people;
}
[Updated csv helper: bug fixed where the last new line character created a new line]

If you need only reading csv files then I recommend this library: A Fast CSV Reader
If you also need to generate csv files then use this one: FileHelpers
Both of them are free and opensource.

This solution is using the official Microsoft.VisualBasic assembly to parse CSV.
Advantages:
delimiter escaping
ignores Header
trim spaces
ignore comments
Code:
using Microsoft.VisualBasic.FileIO;
public static List<List<string>> ParseCSV (string csv)
{
List<List<string>> result = new List<List<string>>();
// To use the TextFieldParser a reference to the Microsoft.VisualBasic assembly has to be added to the project.
using (TextFieldParser parser = new TextFieldParser(new StringReader(csv)))
{
parser.CommentTokens = new string[] { "#" };
parser.SetDelimiters(new string[] { ";" });
parser.HasFieldsEnclosedInQuotes = true;
// Skip over header line.
//parser.ReadLine();
while (!parser.EndOfData)
{
var values = new List<string>();
var readFields = parser.ReadFields();
if (readFields != null)
values.AddRange(readFields);
result.Add(values);
}
}
return result;
}

I have written TinyCsvParser for .NET, which is one of the fastest .NET parsers around and highly configurable to parse almost any CSV format.
It is released under MIT License:
https://github.com/bytefish/TinyCsvParser
You can use NuGet to install it. Run the following command in the Package Manager Console.
PM> Install-Package TinyCsvParser
Usage
Imagine we have list of Persons in a CSV file persons.csv with their first name, last name and birthdate.
FirstName;LastName;BirthDate
Philipp;Wagner;1986/05/12
Max;Musterman;2014/01/02
The corresponding domain model in our system might look like this.
private class Person
{
public string FirstName { get; set; }
public string LastName { get; set; }
public DateTime BirthDate { get; set; }
}
When using TinyCsvParser you have to define the mapping between the columns in the CSV data and the property in you domain model.
private class CsvPersonMapping : CsvMapping<Person>
{
public CsvPersonMapping()
: base()
{
MapProperty(0, x => x.FirstName);
MapProperty(1, x => x.LastName);
MapProperty(2, x => x.BirthDate);
}
}
And then we can use the mapping to parse the CSV data with a CsvParser.
namespace TinyCsvParser.Test
{
[TestFixture]
public class TinyCsvParserTest
{
[Test]
public void TinyCsvTest()
{
CsvParserOptions csvParserOptions = new CsvParserOptions(true, new[] { ';' });
CsvPersonMapping csvMapper = new CsvPersonMapping();
CsvParser<Person> csvParser = new CsvParser<Person>(csvParserOptions, csvMapper);
var result = csvParser
.ReadFromFile(#"persons.csv", Encoding.ASCII)
.ToList();
Assert.AreEqual(2, result.Count);
Assert.IsTrue(result.All(x => x.IsValid));
Assert.AreEqual("Philipp", result[0].Result.FirstName);
Assert.AreEqual("Wagner", result[0].Result.LastName);
Assert.AreEqual(1986, result[0].Result.BirthDate.Year);
Assert.AreEqual(5, result[0].Result.BirthDate.Month);
Assert.AreEqual(12, result[0].Result.BirthDate.Day);
Assert.AreEqual("Max", result[1].Result.FirstName);
Assert.AreEqual("Mustermann", result[1].Result.LastName);
Assert.AreEqual(2014, result[1].Result.BirthDate.Year);
Assert.AreEqual(1, result[1].Result.BirthDate.Month);
Assert.AreEqual(1, result[1].Result.BirthDate.Day);
}
}
}
User Guide
A full User Guide is available at:
http://bytefish.github.io/TinyCsvParser/

Here is a short and simple solution.
using (TextFieldParser parser = new TextFieldParser(outputLocation))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
string[] headers = parser.ReadLine().Split(',');
foreach (string header in headers)
{
dataTable.Columns.Add(header);
}
while (!parser.EndOfData)
{
string[] fields = parser.ReadFields();
dataTable.Rows.Add(fields);
}
}

Here is my KISS implementation...
using System;
using System.Collections.Generic;
using System.Text;
class CsvParser
{
public static List<string> Parse(string line)
{
const char escapeChar = '"';
const char splitChar = ',';
bool inEscape = false;
bool priorEscape = false;
List<string> result = new List<string>();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < line.Length; i++)
{
char c = line[i];
switch (c)
{
case escapeChar:
if (!inEscape)
inEscape = true;
else
{
if (!priorEscape)
{
if (i + 1 < line.Length && line[i + 1] == escapeChar)
priorEscape = true;
else
inEscape = false;
}
else
{
sb.Append(c);
priorEscape = false;
}
}
break;
case splitChar:
if (inEscape) //if in escape
sb.Append(c);
else
{
result.Add(sb.ToString());
sb.Length = 0;
}
break;
default:
sb.Append(c);
break;
}
}
if (sb.Length > 0)
result.Add(sb.ToString());
return result;
}
}

Some time ago I had wrote simple class for CSV read/write based on Microsoft.VisualBasic library. Using this simple class you will be able to work with CSV like with 2 dimensions array. You can find my class by the following link: https://github.com/ukushu/DataExporter
Simple example of usage:
Csv csv = new Csv("\t");//delimiter symbol
csv.FileOpen("c:\\file1.csv");
var row1Cell6Value = csv.Rows[0][5];
csv.AddRow("asdf","asdffffff","5")
csv.FileSave("c:\\file2.csv");
For reading header only you need is to read csv.Rows[0] cells :)

This code reads csv to DataTable:
public static DataTable ReadCsv(string path)
{
DataTable result = new DataTable("SomeData");
using (TextFieldParser parser = new TextFieldParser(path))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
bool isFirstRow = true;
//IList<string> headers = new List<string>();
while (!parser.EndOfData)
{
string[] fields = parser.ReadFields();
if (isFirstRow)
{
foreach (string field in fields)
{
result.Columns.Add(new DataColumn(field, typeof(string)));
}
isFirstRow = false;
}
else
{
int i = 0;
DataRow row = result.NewRow();
foreach (string field in fields)
{
row[i++] = field;
}
result.Rows.Add(row);
}
}
}
return result;
}

Single source file solution for straightforward parsing needs, useful. Deals with all the nasty edge cases. Such as new line normalization and handling new lines in quoted string literals. Your welcome!
If you CSV file has a header you just read out the column names (and compute column indexes) from the first row. Simple as that.
Note that Dump is a LINQPad method, you might want to remove that if you are not using LINQPad.
void Main()
{
var file1 = "a,b,c\r\nx,y,z";
CSV.ParseText(file1).Dump();
var file2 = "a,\"b\",c\r\nx,\"y,z\"";
CSV.ParseText(file2).Dump();
var file3 = "a,\"b\",c\r\nx,\"y\r\nz\"";
CSV.ParseText(file3).Dump();
var file4 = "\"\"\"\"";
CSV.ParseText(file4).Dump();
}
static class CSV
{
public struct Record
{
public readonly string[] Row;
public string this[int index] => Row[index];
public Record(string[] row)
{
Row = row;
}
}
public static List<Record> ParseText(string text)
{
return Parse(new StringReader(text));
}
public static List<Record> ParseFile(string fn)
{
using (var reader = File.OpenText(fn))
{
return Parse(reader);
}
}
public static List<Record> Parse(TextReader reader)
{
var data = new List<Record>();
var col = new StringBuilder();
var row = new List<string>();
for (; ; )
{
var ln = reader.ReadLine();
if (ln == null) break;
if (Tokenize(ln, col, row))
{
data.Add(new Record(row.ToArray()));
row.Clear();
}
}
return data;
}
public static bool Tokenize(string s, StringBuilder col, List<string> row)
{
int i = 0;
if (col.Length > 0)
{
col.AppendLine(); // continuation
if (!TokenizeQuote(s, ref i, col, row))
{
return false;
}
}
while (i < s.Length)
{
var ch = s[i];
if (ch == ',')
{
row.Add(col.ToString().Trim());
col.Length = 0;
i++;
}
else if (ch == '"')
{
i++;
if (!TokenizeQuote(s, ref i, col, row))
{
return false;
}
}
else
{
col.Append(ch);
i++;
}
}
if (col.Length > 0)
{
row.Add(col.ToString().Trim());
col.Length = 0;
}
return true;
}
public static bool TokenizeQuote(string s, ref int i, StringBuilder col, List<string> row)
{
while (i < s.Length)
{
var ch = s[i];
if (ch == '"')
{
// escape sequence
if (i + 1 < s.Length && s[i + 1] == '"')
{
col.Append('"');
i++;
i++;
continue;
}
i++;
return true;
}
else
{
col.Append(ch);
i++;
}
}
return false;
}
}

Another one to this list, Cinchoo ETL - an open source library to read and write multiple file formats (CSV, flat file, Xml, JSON etc)
Sample below shows how to read CSV file quickly (No POCO object required)
string csv = #"Id, Name
1, Carl
2, Tom
3, Mark";
using (var p = ChoCSVReader.LoadText(csv)
.WithFirstLineHeader()
)
{
foreach (var rec in p)
{
Console.WriteLine($"Id: {rec.Id}");
Console.WriteLine($"Name: {rec.Name}");
}
}
Sample below shows how to read CSV file using POCO object
public partial class EmployeeRec
{
public int Id { get; set; }
public string Name { get; set; }
}
static void CSVTest()
{
string csv = #"Id, Name
1, Carl
2, Tom
3, Mark";
using (var p = ChoCSVReader<EmployeeRec>.LoadText(csv)
.WithFirstLineHeader()
)
{
foreach (var rec in p)
{
Console.WriteLine($"Id: {rec.Id}");
Console.WriteLine($"Name: {rec.Name}");
}
}
}
Please check out articles at CodeProject on how to use it.

This parser supports nested commas and quotes in a column:
static class CSVParser
{
public static string[] ParseLine(string line)
{
List<string> cols = new List<string>();
string value = null;
for(int i = 0; i < line.Length; i++)
{
switch(line[i])
{
case ',':
cols.Add(value);
value = null;
if(i == line.Length - 1)
{// It ends with comma
cols.Add(null);
}
break;
case '"':
cols.Add(ParseEnclosedColumn(line, ref i));
i++;
break;
default:
value += line[i];
if (i == line.Length - 1)
{// Last character
cols.Add(value);
}
break;
}
}
return cols.ToArray();
}//ParseLine
static string ParseEnclosedColumn(string line, ref int index)
{// Example: "b"",bb"
string value = null;
int numberQuotes = 1;
int index2 = index;
for (int i = index + 1; i < line.Length; i++)
{
index2 = i;
switch (line[i])
{
case '"':
numberQuotes++;
if (numberQuotes % 2 == 0)
{
if (i < line.Length - 1 && line[i + 1] == ',')
{
index = i;
return value;
}
}
else if (i > index + 1 && line[i - 1] == '"')
{
value += '"';
}
break;
default:
value += line[i];
break;
}
}
index = index2;
return value;
}//ParseEnclosedColumn
}//class CSVParser

Based on unlimit's post on How to properly split a CSV using C# split() function? :
string[] tokens = System.Text.RegularExpressions.Regex.Split(paramString, ",");
NOTE: this doesn't handle escaped / nested commas, etc., and therefore is only suitable for certain simple CSV lists.

If anyone wants a snippet they can plop into their code without having to bind a library or download a package. Here is a version I wrote:
public static string FormatCSV(List<string> parts)
{
string result = "";
foreach (string s in parts)
{
if (result.Length > 0)
{
result += ",";
if (s.Length == 0)
continue;
}
if (s.Length > 0)
{
result += "\"" + s.Replace("\"", "\"\"") + "\"";
}
else
{
// cannot output double quotes since its considered an escape for a quote
result += ",";
}
}
return result;
}
enum CSVMode
{
CLOSED = 0,
OPENED_RAW = 1,
OPENED_QUOTE = 2
}
public static List<string> ParseCSV(string input)
{
List<string> results;
CSVMode mode;
char[] letters;
string content;
mode = CSVMode.CLOSED;
content = "";
results = new List<string>();
letters = input.ToCharArray();
for (int i = 0; i < letters.Length; i++)
{
char letter = letters[i];
char nextLetter = '\0';
if (i < letters.Length - 1)
nextLetter = letters[i + 1];
// If its a quote character
if (letter == '"')
{
// If that next letter is a quote
if (nextLetter == '"' && mode == CSVMode.OPENED_QUOTE)
{
// Then this quote is escaped and should be added to the content
content += letter;
// Skip the escape character
i++;
continue;
}
else
{
// otherwise its not an escaped quote and is an opening or closing one
// Character is skipped
// If it was open, then close it
if (mode == CSVMode.OPENED_QUOTE)
{
results.Add(content);
// reset the content
content = "";
mode = CSVMode.CLOSED;
// If there is a next letter available
if (nextLetter != '\0')
{
// If it is a comma
if (nextLetter == ',')
{
i++;
continue;
}
else
{
throw new Exception("Expected comma. Found: " + nextLetter);
}
}
}
else if (mode == CSVMode.OPENED_RAW)
{
// If it was opened raw, then just add the quote
content += letter;
}
else if (mode == CSVMode.CLOSED)
{
// Otherwise open it as a quote
mode = CSVMode.OPENED_QUOTE;
}
}
}
// If its a comma seperator
else if (letter == ',')
{
// If in quote mode
if (mode == CSVMode.OPENED_QUOTE)
{
// Just read it
content += letter;
}
// If raw, then close the content
else if (mode == CSVMode.OPENED_RAW)
{
results.Add(content);
content = "";
mode = CSVMode.CLOSED;
}
// If it was closed, then open it raw
else if (mode == CSVMode.CLOSED)
{
mode = CSVMode.OPENED_RAW;
results.Add(content);
content = "";
}
}
else
{
// If opened quote, just read it
if (mode == CSVMode.OPENED_QUOTE)
{
content += letter;
}
// If opened raw, then read it
else if (mode == CSVMode.OPENED_RAW)
{
content += letter;
}
// It closed, then open raw
else if (mode == CSVMode.CLOSED)
{
mode = CSVMode.OPENED_RAW;
content += letter;
}
}
}
// If it was still reading when the buffer finished
if (mode != CSVMode.CLOSED)
{
results.Add(content);
}
return results;
}

For smaller input CSV data LINQ is fully enough.
For example for the following CSV file content:
schema_name,description,utype
"IX_HE","High-Energy data","x"
"III_spectro","Spectrosopic data","d"
"VI_misc","Miscellaneous","f"
"vcds1","Catalogs only available in CDS","d"
"J_other","Publications from other journals","b"
when we read the whole content into single string called data, then
using System;
using System.IO;
using System.Linq;
var data = File.ReadAllText(Path2CSV);
// helper split characters
var newline = Environment.NewLine.ToCharArray();
var comma = ",".ToCharArray();
var quote = "\"".ToCharArray();
// split input string data to lines
var lines = data.Split(newline);
// first line is header, take the header fields
foreach (var col in lines.First().Split(comma)) {
// do something with "col"
}
// we skip the first line, all the rest are real data lines/fields
foreach (var line in lines.Skip(1)) {
// first we split the data line by comma character
// next we remove double qoutes from each splitted element using Trim()
// finally we make an array
var fields = line.Split(comma)
.Select(_ => { _ = _.Trim(quote); return _; })
.ToArray();
// do something with the "fields" array
}

Using lambda expression to check filepath is valid c#

I am trying to check whether a file path is valid using the following code
foreach (int i in UniqueRandom(0, 4))
{
var wbImage = getCharBitmap(c, rndFolder, i);
}
The UniqueRandom method generates non repeating random numbers between 0 to 4. Each number i represents a file name, which may or may not exist. If the file exist, the getCharBitmap method will return a WritableBitmap object, otherwise, it will return null.
I want to integrate a lambda expression to check whether the method returns null or not, then, if it's not null, I want to remember the i value and exit the foreach loop right away.
How to do this efficiently with the least amount of code?

Try
var firstExisting = UniqueRandom(0, 4)
.Select(i => new
{
Bitmap = GetCharBitmap(c, rndFolder, i),
Number = i
})
.FirstOrDefault(x => x.Bitmap != null);
if (firstExisting != null)
{
int j = firstExisting.Number;
}
Or the same without LINQ:
private static int FirstExisting()
{
foreach (int i in UniqueRandom(0, 4))
{
var wbImage = GetCharBitmap(c, rndFolder, i);
if (wbImage != null)
{
return i;
}
}
throw new Exception("No existing found"); // or return say -1
}

MVCCrud Using LinqToEntities

There is a sample application called MVCCrud. This example is quite good and I would like to use it as the framework on a project that I am working on.
The problem is that MVCCrud uses LingToSQL and I would like to use LinqToEntities. I got most everything to work correctly once I converted over to LinqToEntities except one place.
In the following code on the lines i = typeof(TModel).GetProperty(primaryKey).GetValue(p, null),
cell = getCells(p)
it gives a Linq to Entities does not recognize GetValue.
Can someone help me refactor the following code?
items = items.OrderBy(string.Format("{0} {1}", sidx, sord)).Skip(pageIndex * pageSize).Take(pageSize).AsQueryable();
// Generate JSON
var jsonData =
new
{
total = totalPages,
page,
records = totalRecords,
rows = items.Select(
p => new
{
// id column from repository
i = typeof(TModel).GetProperty(primaryKey).GetValue(p, null),
cell = getCells(p)
}).ToArray()
};
return Json(jsonData);
and here is the getCell method:
private string[] getCells(TModel p)
{
List<string> result = new List<string>();
string a = actionCell(p);
if (a != null)
{
result.Add(a);
}
foreach (string column in data_rows.Select(r => r.value))
{
try
{
// hack for tblcategory.name
string[] parts = column.Split('.');
// Set first part
PropertyInfo c = typeof(TModel).GetProperty(parts[0]);
object tmp = c.GetValue(p, null);
// loop through if there is more than one depth to the . eg tblCategory.name
for (int j = 1; j < parts.Length; j++)
{
c = tmp.GetType().GetProperty(parts[j]);
tmp = c.GetValue(tmp, null);
}
if (tmp.GetType() == typeof(DateTime))
{
result.Add(((DateTime)tmp).ToString(dateTimeFormat));
}
else if (tmp.GetType() == typeof(float))
{
result.Add(((float)tmp).ToString(decimalFormat));
}
else if (tmp.GetType() == typeof(double))
{
result.Add(((double)tmp).ToString(decimalFormat));
}
else if (tmp.GetType() == typeof(decimal))
{
result.Add(((decimal)tmp).ToString(decimalFormat));
}
else
{
result.Add(tmp.ToString());
}
}
catch (Exception)
{
result.Add(string.Empty);
}
}
return result.ToArray();
}

Do this ToList() instead of AsQueryable():
items = items.OrderBy(string.Format("{0} {1}", sidx, sord)).Skip(pageIndex * pageSize).Take(pageSize).ToList();
You can't execute any external method "within" linq query.
And may you say that was working in Linq2Sql then you should know when you call any external method "Like ToString()" Linq2Sql will fetch all data from database then handle your query in the memory and that maybe a serious harming if you have a lot of records.
For more information look at this

Parsing CSV files in C#, with header

Is there a default/official/recommended way to parse CSV files in C#? I don't want to roll my own parser.
Also, I've seen instances of people using ODBC/OLE DB to read CSV via the Text driver, and a lot of people discourage this due to its "drawbacks." What are these drawbacks?
Ideally, I'm looking for a way through which I can read the CSV by column name, using the first record as the header / field names. Some of the answers given are correct but work to basically deserialize the file into classes.

A CSV parser is now a part of .NET Framework.
Add a reference to Microsoft.VisualBasic.dll (works fine in C#, don't mind the name)
using (TextFieldParser parser = new TextFieldParser(#"c:\temp\test.csv"))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
while (!parser.EndOfData)
{
//Process row
string[] fields = parser.ReadFields();
foreach (string field in fields)
{
//TODO: Process field
}
}
}
The docs are here - TextFieldParser Class
P.S. If you need a CSV exporter, try CsvExport (discl: I'm one of the contributors)

CsvHelper (a library I maintain) will read a CSV file into custom objects.
using (var reader = new StreamReader("path\\to\\file.csv"))
using (var csv = new CsvReader(reader, CultureInfo.InvariantCulture))
{
var records = csv.GetRecords<Foo>();
}
Sometimes you don't own the objects you're trying to read into. In this case, you can use fluent mapping because you can't put attributes on the class.
public sealed class MyCustomObjectMap : CsvClassMap<MyCustomObject>
{
public MyCustomObjectMap()
{
Map( m => m.Property1 ).Name( "Column Name" );
Map( m => m.Property2 ).Index( 4 );
Map( m => m.Property3 ).Ignore();
Map( m => m.Property4 ).TypeConverter<MySpecialTypeConverter>();
}
}

Let a library handle all the nitty-gritty details for you! :-)
Check out FileHelpers and stay DRY - Don't Repeat Yourself - no need to re-invent the wheel a gazillionth time....
You basically just need to define that shape of your data - the fields in your individual line in the CSV - by means of a public class (and so well-thought out attributes like default values, replacements for NULL values and so forth), point the FileHelpers engine at a file, and bingo - you get back all the entries from that file. One simple operation - great performance!

In a business application, i use the Open Source project on codeproject.com, CSVReader.
It works well, and has good performance. There is some benchmarking on the link i provided.
A simple example, copied from the project page:
using (CsvReader csv = new CsvReader(new StreamReader("data.csv"), true))
{
int fieldCount = csv.FieldCount;
string[] headers = csv.GetFieldHeaders();
while (csv.ReadNextRecord())
{
for (int i = 0; i < fieldCount; i++)
Console.Write(string.Format("{0} = {1};", headers[i], csv[i]));
Console.WriteLine();
}
}
As you can see, it's very easy to work with.

I know its a bit late but just found a library Microsoft.VisualBasic.FileIO which has TextFieldParser class to process csv files.

Here is a helper class I use often, in case any one ever comes back to this thread (I wanted to share it).
I use this for the simplicity of porting it into projects ready to use:
public class CSVHelper : List<string[]>
{
protected string csv = string.Empty;
protected string separator = ",";
public CSVHelper(string csv, string separator = "\",\"")
{
this.csv = csv;
this.separator = separator;
foreach (string line in Regex.Split(csv, System.Environment.NewLine).ToList().Where(s => !string.IsNullOrEmpty(s)))
{
string[] values = Regex.Split(line, separator);
for (int i = 0; i < values.Length; i++)
{
//Trim values
values[i] = values[i].Trim('\"');
}
this.Add(values);
}
}
}
And use it like:
public List<Person> GetPeople(string csvContent)
{
List<Person> people = new List<Person>();
CSVHelper csv = new CSVHelper(csvContent);
foreach(string[] line in csv)
{
Person person = new Person();
person.Name = line[0];
person.TelephoneNo = line[1];
people.Add(person);
}
return people;
}
[Updated csv helper: bug fixed where the last new line character created a new line]

If you need only reading csv files then I recommend this library: A Fast CSV Reader
If you also need to generate csv files then use this one: FileHelpers
Both of them are free and opensource.

This solution is using the official Microsoft.VisualBasic assembly to parse CSV.
Advantages:
delimiter escaping
ignores Header
trim spaces
ignore comments
Code:
using Microsoft.VisualBasic.FileIO;
public static List<List<string>> ParseCSV (string csv)
{
List<List<string>> result = new List<List<string>>();
// To use the TextFieldParser a reference to the Microsoft.VisualBasic assembly has to be added to the project.
using (TextFieldParser parser = new TextFieldParser(new StringReader(csv)))
{
parser.CommentTokens = new string[] { "#" };
parser.SetDelimiters(new string[] { ";" });
parser.HasFieldsEnclosedInQuotes = true;
// Skip over header line.
//parser.ReadLine();
while (!parser.EndOfData)
{
var values = new List<string>();
var readFields = parser.ReadFields();
if (readFields != null)
values.AddRange(readFields);
result.Add(values);
}
}
return result;
}

I have written TinyCsvParser for .NET, which is one of the fastest .NET parsers around and highly configurable to parse almost any CSV format.
It is released under MIT License:
https://github.com/bytefish/TinyCsvParser
You can use NuGet to install it. Run the following command in the Package Manager Console.
PM> Install-Package TinyCsvParser
Usage
Imagine we have list of Persons in a CSV file persons.csv with their first name, last name and birthdate.
FirstName;LastName;BirthDate
Philipp;Wagner;1986/05/12
Max;Musterman;2014/01/02
The corresponding domain model in our system might look like this.
private class Person
{
public string FirstName { get; set; }
public string LastName { get; set; }
public DateTime BirthDate { get; set; }
}
When using TinyCsvParser you have to define the mapping between the columns in the CSV data and the property in you domain model.
private class CsvPersonMapping : CsvMapping<Person>
{
public CsvPersonMapping()
: base()
{
MapProperty(0, x => x.FirstName);
MapProperty(1, x => x.LastName);
MapProperty(2, x => x.BirthDate);
}
}
And then we can use the mapping to parse the CSV data with a CsvParser.
namespace TinyCsvParser.Test
{
[TestFixture]
public class TinyCsvParserTest
{
[Test]
public void TinyCsvTest()
{
CsvParserOptions csvParserOptions = new CsvParserOptions(true, new[] { ';' });
CsvPersonMapping csvMapper = new CsvPersonMapping();
CsvParser<Person> csvParser = new CsvParser<Person>(csvParserOptions, csvMapper);
var result = csvParser
.ReadFromFile(#"persons.csv", Encoding.ASCII)
.ToList();
Assert.AreEqual(2, result.Count);
Assert.IsTrue(result.All(x => x.IsValid));
Assert.AreEqual("Philipp", result[0].Result.FirstName);
Assert.AreEqual("Wagner", result[0].Result.LastName);
Assert.AreEqual(1986, result[0].Result.BirthDate.Year);
Assert.AreEqual(5, result[0].Result.BirthDate.Month);
Assert.AreEqual(12, result[0].Result.BirthDate.Day);
Assert.AreEqual("Max", result[1].Result.FirstName);
Assert.AreEqual("Mustermann", result[1].Result.LastName);
Assert.AreEqual(2014, result[1].Result.BirthDate.Year);
Assert.AreEqual(1, result[1].Result.BirthDate.Month);
Assert.AreEqual(1, result[1].Result.BirthDate.Day);
}
}
}
User Guide
A full User Guide is available at:
http://bytefish.github.io/TinyCsvParser/

Here is a short and simple solution.
using (TextFieldParser parser = new TextFieldParser(outputLocation))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
string[] headers = parser.ReadLine().Split(',');
foreach (string header in headers)
{
dataTable.Columns.Add(header);
}
while (!parser.EndOfData)
{
string[] fields = parser.ReadFields();
dataTable.Rows.Add(fields);
}
}

Here is my KISS implementation...
using System;
using System.Collections.Generic;
using System.Text;
class CsvParser
{
public static List<string> Parse(string line)
{
const char escapeChar = '"';
const char splitChar = ',';
bool inEscape = false;
bool priorEscape = false;
List<string> result = new List<string>();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < line.Length; i++)
{
char c = line[i];
switch (c)
{
case escapeChar:
if (!inEscape)
inEscape = true;
else
{
if (!priorEscape)
{
if (i + 1 < line.Length && line[i + 1] == escapeChar)
priorEscape = true;
else
inEscape = false;
}
else
{
sb.Append(c);
priorEscape = false;
}
}
break;
case splitChar:
if (inEscape) //if in escape
sb.Append(c);
else
{
result.Add(sb.ToString());
sb.Length = 0;
}
break;
default:
sb.Append(c);
break;
}
}
if (sb.Length > 0)
result.Add(sb.ToString());
return result;
}
}

Some time ago I had wrote simple class for CSV read/write based on Microsoft.VisualBasic library. Using this simple class you will be able to work with CSV like with 2 dimensions array. You can find my class by the following link: https://github.com/ukushu/DataExporter
Simple example of usage:
Csv csv = new Csv("\t");//delimiter symbol
csv.FileOpen("c:\\file1.csv");
var row1Cell6Value = csv.Rows[0][5];
csv.AddRow("asdf","asdffffff","5")
csv.FileSave("c:\\file2.csv");
For reading header only you need is to read csv.Rows[0] cells :)

This code reads csv to DataTable:
public static DataTable ReadCsv(string path)
{
DataTable result = new DataTable("SomeData");
using (TextFieldParser parser = new TextFieldParser(path))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
bool isFirstRow = true;
//IList<string> headers = new List<string>();
while (!parser.EndOfData)
{
string[] fields = parser.ReadFields();
if (isFirstRow)
{
foreach (string field in fields)
{
result.Columns.Add(new DataColumn(field, typeof(string)));
}
isFirstRow = false;
}
else
{
int i = 0;
DataRow row = result.NewRow();
foreach (string field in fields)
{
row[i++] = field;
}
result.Rows.Add(row);
}
}
}
return result;
}

Single source file solution for straightforward parsing needs, useful. Deals with all the nasty edge cases. Such as new line normalization and handling new lines in quoted string literals. Your welcome!
If you CSV file has a header you just read out the column names (and compute column indexes) from the first row. Simple as that.
Note that Dump is a LINQPad method, you might want to remove that if you are not using LINQPad.
void Main()
{
var file1 = "a,b,c\r\nx,y,z";
CSV.ParseText(file1).Dump();
var file2 = "a,\"b\",c\r\nx,\"y,z\"";
CSV.ParseText(file2).Dump();
var file3 = "a,\"b\",c\r\nx,\"y\r\nz\"";
CSV.ParseText(file3).Dump();
var file4 = "\"\"\"\"";
CSV.ParseText(file4).Dump();
}
static class CSV
{
public struct Record
{
public readonly string[] Row;
public string this[int index] => Row[index];
public Record(string[] row)
{
Row = row;
}
}
public static List<Record> ParseText(string text)
{
return Parse(new StringReader(text));
}
public static List<Record> ParseFile(string fn)
{
using (var reader = File.OpenText(fn))
{
return Parse(reader);
}
}
public static List<Record> Parse(TextReader reader)
{
var data = new List<Record>();
var col = new StringBuilder();
var row = new List<string>();
for (; ; )
{
var ln = reader.ReadLine();
if (ln == null) break;
if (Tokenize(ln, col, row))
{
data.Add(new Record(row.ToArray()));
row.Clear();
}
}
return data;
}
public static bool Tokenize(string s, StringBuilder col, List<string> row)
{
int i = 0;
if (col.Length > 0)
{
col.AppendLine(); // continuation
if (!TokenizeQuote(s, ref i, col, row))
{
return false;
}
}
while (i < s.Length)
{
var ch = s[i];
if (ch == ',')
{
row.Add(col.ToString().Trim());
col.Length = 0;
i++;
}
else if (ch == '"')
{
i++;
if (!TokenizeQuote(s, ref i, col, row))
{
return false;
}
}
else
{
col.Append(ch);
i++;
}
}
if (col.Length > 0)
{
row.Add(col.ToString().Trim());
col.Length = 0;
}
return true;
}
public static bool TokenizeQuote(string s, ref int i, StringBuilder col, List<string> row)
{
while (i < s.Length)
{
var ch = s[i];
if (ch == '"')
{
// escape sequence
if (i + 1 < s.Length && s[i + 1] == '"')
{
col.Append('"');
i++;
i++;
continue;
}
i++;
return true;
}
else
{
col.Append(ch);
i++;
}
}
return false;
}
}

Another one to this list, Cinchoo ETL - an open source library to read and write multiple file formats (CSV, flat file, Xml, JSON etc)
Sample below shows how to read CSV file quickly (No POCO object required)
string csv = #"Id, Name
1, Carl
2, Tom
3, Mark";
using (var p = ChoCSVReader.LoadText(csv)
.WithFirstLineHeader()
)
{
foreach (var rec in p)
{
Console.WriteLine($"Id: {rec.Id}");
Console.WriteLine($"Name: {rec.Name}");
}
}
Sample below shows how to read CSV file using POCO object
public partial class EmployeeRec
{
public int Id { get; set; }
public string Name { get; set; }
}
static void CSVTest()
{
string csv = #"Id, Name
1, Carl
2, Tom
3, Mark";
using (var p = ChoCSVReader<EmployeeRec>.LoadText(csv)
.WithFirstLineHeader()
)
{
foreach (var rec in p)
{
Console.WriteLine($"Id: {rec.Id}");
Console.WriteLine($"Name: {rec.Name}");
}
}
}
Please check out articles at CodeProject on how to use it.

This parser supports nested commas and quotes in a column:
static class CSVParser
{
public static string[] ParseLine(string line)
{
List<string> cols = new List<string>();
string value = null;
for(int i = 0; i < line.Length; i++)
{
switch(line[i])
{
case ',':
cols.Add(value);
value = null;
if(i == line.Length - 1)
{// It ends with comma
cols.Add(null);
}
break;
case '"':
cols.Add(ParseEnclosedColumn(line, ref i));
i++;
break;
default:
value += line[i];
if (i == line.Length - 1)
{// Last character
cols.Add(value);
}
break;
}
}
return cols.ToArray();
}//ParseLine
static string ParseEnclosedColumn(string line, ref int index)
{// Example: "b"",bb"
string value = null;
int numberQuotes = 1;
int index2 = index;
for (int i = index + 1; i < line.Length; i++)
{
index2 = i;
switch (line[i])
{
case '"':
numberQuotes++;
if (numberQuotes % 2 == 0)
{
if (i < line.Length - 1 && line[i + 1] == ',')
{
index = i;
return value;
}
}
else if (i > index + 1 && line[i - 1] == '"')
{
value += '"';
}
break;
default:
value += line[i];
break;
}
}
index = index2;
return value;
}//ParseEnclosedColumn
}//class CSVParser

Based on unlimit's post on How to properly split a CSV using C# split() function? :
string[] tokens = System.Text.RegularExpressions.Regex.Split(paramString, ",");
NOTE: this doesn't handle escaped / nested commas, etc., and therefore is only suitable for certain simple CSV lists.

If anyone wants a snippet they can plop into their code without having to bind a library or download a package. Here is a version I wrote:
public static string FormatCSV(List<string> parts)
{
string result = "";
foreach (string s in parts)
{
if (result.Length > 0)
{
result += ",";
if (s.Length == 0)
continue;
}
if (s.Length > 0)
{
result += "\"" + s.Replace("\"", "\"\"") + "\"";
}
else
{
// cannot output double quotes since its considered an escape for a quote
result += ",";
}
}
return result;
}
enum CSVMode
{
CLOSED = 0,
OPENED_RAW = 1,
OPENED_QUOTE = 2
}
public static List<string> ParseCSV(string input)
{
List<string> results;
CSVMode mode;
char[] letters;
string content;
mode = CSVMode.CLOSED;
content = "";
results = new List<string>();
letters = input.ToCharArray();
for (int i = 0; i < letters.Length; i++)
{
char letter = letters[i];
char nextLetter = '\0';
if (i < letters.Length - 1)
nextLetter = letters[i + 1];
// If its a quote character
if (letter == '"')
{
// If that next letter is a quote
if (nextLetter == '"' && mode == CSVMode.OPENED_QUOTE)
{
// Then this quote is escaped and should be added to the content
content += letter;
// Skip the escape character
i++;
continue;
}
else
{
// otherwise its not an escaped quote and is an opening or closing one
// Character is skipped
// If it was open, then close it
if (mode == CSVMode.OPENED_QUOTE)
{
results.Add(content);
// reset the content
content = "";
mode = CSVMode.CLOSED;
// If there is a next letter available
if (nextLetter != '\0')
{
// If it is a comma
if (nextLetter == ',')
{
i++;
continue;
}
else
{
throw new Exception("Expected comma. Found: " + nextLetter);
}
}
}
else if (mode == CSVMode.OPENED_RAW)
{
// If it was opened raw, then just add the quote
content += letter;
}
else if (mode == CSVMode.CLOSED)
{
// Otherwise open it as a quote
mode = CSVMode.OPENED_QUOTE;
}
}
}
// If its a comma seperator
else if (letter == ',')
{
// If in quote mode
if (mode == CSVMode.OPENED_QUOTE)
{
// Just read it
content += letter;
}
// If raw, then close the content
else if (mode == CSVMode.OPENED_RAW)
{
results.Add(content);
content = "";
mode = CSVMode.CLOSED;
}
// If it was closed, then open it raw
else if (mode == CSVMode.CLOSED)
{
mode = CSVMode.OPENED_RAW;
results.Add(content);
content = "";
}
}
else
{
// If opened quote, just read it
if (mode == CSVMode.OPENED_QUOTE)
{
content += letter;
}
// If opened raw, then read it
else if (mode == CSVMode.OPENED_RAW)
{
content += letter;
}
// It closed, then open raw
else if (mode == CSVMode.CLOSED)
{
mode = CSVMode.OPENED_RAW;
content += letter;
}
}
}
// If it was still reading when the buffer finished
if (mode != CSVMode.CLOSED)
{
results.Add(content);
}
return results;
}

For smaller input CSV data LINQ is fully enough.
For example for the following CSV file content:
schema_name,description,utype
"IX_HE","High-Energy data","x"
"III_spectro","Spectrosopic data","d"
"VI_misc","Miscellaneous","f"
"vcds1","Catalogs only available in CDS","d"
"J_other","Publications from other journals","b"
when we read the whole content into single string called data, then
using System;
using System.IO;
using System.Linq;
var data = File.ReadAllText(Path2CSV);
// helper split characters
var newline = Environment.NewLine.ToCharArray();
var comma = ",".ToCharArray();
var quote = "\"".ToCharArray();
// split input string data to lines
var lines = data.Split(newline);
// first line is header, take the header fields
foreach (var col in lines.First().Split(comma)) {
// do something with "col"
}
// we skip the first line, all the rest are real data lines/fields
foreach (var line in lines.Skip(1)) {
// first we split the data line by comma character
// next we remove double qoutes from each splitted element using Trim()
// finally we make an array
var fields = line.Split(comma)
.Select(_ => { _ = _.Trim(quote); return _; })
.ToArray();
// do something with the "fields" array
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parquet.NET is generating huge parquet files in comparison with pyarrow - c#

Related

Parsing CSV File with double quotes [duplicate]

Regex to get all "cells" form csv file row [duplicate]

Using lambda expression to check filepath is valid c#

MVCCrud Using LinqToEntities

Parsing CSV files in C#, with header

Categories

Resources