How to format and read CSV file? - c#

Here is just an example of the data I need to format.
The first column is simple, the problem the second column.
What would be the best approach to format multiple data fields in one column?
How to parse this data?
Important*: The second column needs to contain multiple values, like in an example below
Name Details
Alex Age:25
Height:6
Hair:Brown
Eyes:Hazel

A csv should probably look like this:
Name,Age,Height,Hair,Eyes
Alex,25,6,Brown,Hazel
Each cell should be separated by exactly one comma from its neighbor.
You can reformat it as such by using a simple regex which replaces certain newline and non-newline whitespace with commas (you can easily find each block because it has values in both columns).

A CSV file is normally defined using commas as field separators and CR for a row separator. You are using CR within your second column, this will cause problems. You'll need to reformat your second column to use some other form of separator between multiple values. A common alternate separator is the | (pipe) character.
Your format would then look like:
Alex,Age:25|Height:6|Hair:Brown|Eyes:Hazel
In your parsing, you would first parse the comma separated fields (which would return two values), and then parse the second field as pipe separated.

This is an interesting one - it can be quite difficult to parse specific format files which is why people often write specific classes to deal with them. More conventional file formats like CSV, or other delimited formats are [more] easy to read because they are formatted in a similar way.
A problem like the above can be addressed in the following way:
1) What should the output look like?
In your instance, and this is just a guess, but I believe you are aiming for the following:
Name, Age, Height, Hair, Eyes
Alex, 25, 6, Brown, Hazel
In which case, you have to parse out this information based on the structure above. If it's repeated blocks of text like the above then we can say the following:
a. Every person is in a block starting with Name Details
b. The name value is the first text after Details, with the other columns being delimited in the format Column:Value
However, you might also have sections with addtional attributes, or attributes that are missing if the original input was optional, so tracking the column and ordinal would be useful too.
So one approach might look like the following:
public void ParseFile(){
String currentLine;
bool newSection = false;
//Store the column names and ordinal position here.
List<String> nameOrdinals = new List<String>();
nameOrdinals.Add("Name"); //IndexOf == 0
Dictionary<Int32, List<String>> nameValues = new Dictionary<Int32 ,List<string>>(); //Use this to store each person's details
Int32 rowNumber = 0;
using (TextReader reader = File.OpenText("D:\\temp\\test.txt"))
{
while ((currentLine = reader.ReadLine()) != null) //This will read the file one row at a time until there are no more rows to read
{
string[] lineSegments = currentLine.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
if (lineSegments.Length == 2 && String.Compare(lineSegments[0], "Name", StringComparison.InvariantCultureIgnoreCase) == 0
&& String.Compare(lineSegments[1], "Details", StringComparison.InvariantCultureIgnoreCase) == 0) //Looking for a Name Details Line - Start of a new section
{
rowNumber++;
newSection = true;
continue;
}
if (newSection && lineSegments.Length > 1) //We can start adding a new person's details - we know that
{
nameValues.Add(rowNumber, new List<String>());
nameValues[rowNumber].Insert(nameOrdinals.IndexOf("Name"), lineSegments[0]);
//Get the first column:value item
ParseColonSeparatedItem(lineSegments[1], nameOrdinals, nameValues, rowNumber);
newSection = false;
continue;
}
if (lineSegments.Length > 0 && lineSegments[0] != String.Empty) //Ignore empty lines
{
ParseColonSeparatedItem(lineSegments[0], nameOrdinals, nameValues, rowNumber);
}
}
}
//At this point we should have collected a big list of items. We can then write out the CSV. We can use a StringBuilder for now, although your requirements will
//be dependent upon how big the source files are.
//Write out the columns
StringBuilder builder = new StringBuilder();
for (int i = 0; i < nameOrdinals.Count; i++)
{
if(i == nameOrdinals.Count - 1)
{
builder.Append(nameOrdinals[i]);
}
else
{
builder.AppendFormat("{0},", nameOrdinals[i]);
}
}
builder.Append(Environment.NewLine);
foreach (int key in nameValues.Keys)
{
List<String> values = nameValues[key];
for (int i = 0; i < values.Count; i++)
{
if (i == values.Count - 1)
{
builder.Append(values[i]);
}
else
{
builder.AppendFormat("{0},", values[i]);
}
}
builder.Append(Environment.NewLine);
}
//At this point you now have a StringBuilder containing the CSV data you can write to a file or similar
}
private void ParseColonSeparatedItem(string textToSeparate, List<String> columns, Dictionary<Int32, List<String>> outputStorage, int outputKey)
{
if (String.IsNullOrWhiteSpace(textToSeparate)) { return; }
string[] colVals = textToSeparate.Split(new[] { ":" }, StringSplitOptions.RemoveEmptyEntries);
List<String> outputValues = outputStorage[outputKey];
if (!columns.Contains(colVals[0]))
{
//Add the column to the list of expected columns. The index of the column determines it's index in the output
columns.Add(colVals[0]);
}
if (outputValues.Count < columns.Count)
{
outputValues.Add(colVals[1]);
}
else
{
outputStorage[outputKey].Insert(columns.IndexOf(colVals[0]), colVals[1]); //We append the value to the list at the place where the column index expects it to be. That way we can miss values in certain sections yet still have the expected output
}
}
After running this against your file, the string builder contains:
"Name,Age,Height,Hair,Eyes\r\nAlex,25,6,Brown,Hazel\r\n"
Which matches the above (\r\n is effectively the Windows new line marker)
This approach demonstrates how a custom parser might work - it's purposefully over verbose as there is plenty of refactoring that could take place here, and is just an example.
Improvements would include:
1) This function assumes there are no spaces in the actual text items themselves. This is a pretty big assumption and, if wrong, would require a different approach to parsing out the line segments. However, this only needs to change in one place - as you read a line at a time, you could apply a reg ex, or just read in characters and assume that everything after the first "column:" section is a value, for example.
2) No exception handling
3) Text output is not quoted. You could test each value to see if it's a date or number - if not, wrap it in quotes as then other programs (like Excel) will attempt to preserve the underlying datatypes more effectively.
4) Assumes no column names are repeated. If they are, then you have to check if a column item has already been added, and then create an ColName2 column in the parsing section.

Related

Check if list contains a string that matches closely

I'm trying to figure out the most efficient way to implement the following scenario:
I have a list like this:
public static IEnumerable<string> ValidTags = new List<string> {
"ABC.XYZ",
"PQR.SUB.UID",
"PQR.ALI.OBD",
};
I have a huge CSV with multiple columns. One of the column is tags. This column either contains blank values, or one of the above values. The problem is, the tag column may contain values like "ABC.XYZ?#" i.e. the valid tags plus some extraneous characters. I need to update such columns with the valid tag, since they "closely match" one of our valid tags.
Example:
if the CSV contains PQR.ALI.OBD? update it with the valid tag PQR.ALI.OBD
if the CSV contains PQR.ALI.OBA, this is invalid, just add suffix invalid and update it PQR.ALI.OBA-invalid.
I'm trying to figure out the best possible way to do this.
My current approach is:
Iterate through each column in CSV, get the tagValue
Now check if our tagValue contains any of the string from list
If it contains but is not exactly the same, update it with the value it contains.
If it doesnt "contain" any value from the list, add suffix-invalid.
Is there any better/more efficient way to do this?
Update:
The list has only 5 items, I have shown three here.
The extra chars are only at the end, and that's happening because people are editing those CSVs in excel web version and that messes up some entries.
My current code: (I'm sure there is a better way to do this, also new at C# so please tell me how I can improve this). I'm using CSVHelper to get CSV cells.
var record = csv.GetRecord<Record>();
string tag = csv.GetField(10); //tag column number in CSV is 10
/* Criteria for validation:
* tag matches our list, but has extraneous chars - strip extraneous chars and update csv
* tag doesn't match our list - add suffix invalid.*/
int listIndex = 0;
bool valid;
foreach (var validTags in ValidTags) //ValidTags is the enum above
{
if (validTags.Contains(tag.ToUpper()) && !string.Equals(validTags, subjectIdentifier.ToUpper()))
{
valid = true;
continue; //move on to next csv row.
//this means that tag is valid but has some extra characters appended to it because of web excel, strip extra charts
}
listIndex++;
if(listIndex == 3 && !valid) {
//means we have reached the end of the list but not found valid tag
//add suffix invalid and move on to next csv row
}
}
Since you say that the extra characters are only at the end, and assuming that the original tag is still present before the extra characters, you could just search the list for each tag to see if the tag contains an entry from the list. If it does, then update it to the correct entry if it's not an exact match, and if it doesn't, append the "-invalid" tag to it.
Before doing this, we may need to first sort the list Descending so that when we're searching we find the closest (longest) match (in a case where one item in the list begins with another item in the list).
var csvPath = #"f:\public\temp\temp.csv";
var entriesUpdated = 0;
// Order the list so we match on the most similar match (ABC.DEF before ABC)
var orderedTags = ValidTags.OrderByDescending(t => t);
var newFileLines = new List<string>();
// Read each line in the file
foreach (var csvLine in File.ReadLines(csvPath))
{
// Get the columns
var columns = csvLine.Split(',');
// Process each column
for (int index = 0; index < columns.Length; index++)
{
var column = columns[index];
switch (index)
{
case 0: // tag column
var correctTag = orderedTags.FirstOrDefault(tag =>
column.IndexOf(tag, StringComparison.OrdinalIgnoreCase) > -1);
if (correctTag != null)
{
// This item contains a correct tag, so
// update it if it's not an exact match
if (column != correctTag)
{
columns[index] = correctTag;
entriesUpdated++;
}
}
else
{
// This column does not contain a correct tag, so mark it as invalid
columns[index] += "-invalid";
entriesUpdated++;
}
break;
// Other cases for other columns follow if needed
}
}
newFileLines.Add(string.Join(",", columns));
}
// Write the new lines if any were changed
if (entriesUpdated > 0) File.WriteAllLines(csvPath, newFileLines);

c# Read/ Write CSV - excluding Comma in field Value [duplicate]

I am looking for suggestions on how to handle a csv file that is being created, then uploaded by our customers, and that may have a comma in a value, like a company name.
Some of the ideas we are looking at are: quoted Identifiers (value "," values ","etc) or using a | instead of a comma. The biggest problem is that we have to make it easy, or the customer won't do it.
There's actually a spec for CSV format, RFC 4180 and how to handle commas:
Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.
http://tools.ietf.org/html/rfc4180
So, to have values foo and bar,baz, you do this:
foo,"bar,baz"
Another important requirement to consider (also from the spec):
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
As others have said, you need to escape values that include quotes. Here’s a little CSV reader in C♯ that supports quoted values, including embedded quotes and carriage returns.
By the way, this is unit-tested code. I’m posting it now because this question seems to come up a lot and others may not want an entire library when simple CSV support will do.
You can use it as follows:
using System;
public class test
{
public static void Main()
{
using ( CsvReader reader = new CsvReader( "data.csv" ) )
{
foreach( string[] values in reader.RowEnumerator )
{
Console.WriteLine( "Row {0} has {1} values.", reader.RowIndex, values.Length );
}
}
Console.ReadLine();
}
}
Here are the classes. Note that you can use the Csv.Escape function to write valid CSV as well.
using System.IO;
using System.Text.RegularExpressions;
public sealed class CsvReader : System.IDisposable
{
public CsvReader( string fileName ) : this( new FileStream( fileName, FileMode.Open, FileAccess.Read ) )
{
}
public CsvReader( Stream stream )
{
__reader = new StreamReader( stream );
}
public System.Collections.IEnumerable RowEnumerator
{
get {
if ( null == __reader )
throw new System.ApplicationException( "I can't start reading without CSV input." );
__rowno = 0;
string sLine;
string sNextLine;
while ( null != ( sLine = __reader.ReadLine() ) )
{
while ( rexRunOnLine.IsMatch( sLine ) && null != ( sNextLine = __reader.ReadLine() ) )
sLine += "\n" + sNextLine;
__rowno++;
string[] values = rexCsvSplitter.Split( sLine );
for ( int i = 0; i < values.Length; i++ )
values[i] = Csv.Unescape( values[i] );
yield return values;
}
__reader.Close();
}
}
public long RowIndex { get { return __rowno; } }
public void Dispose()
{
if ( null != __reader ) __reader.Dispose();
}
//============================================
private long __rowno = 0;
private TextReader __reader;
private static Regex rexCsvSplitter = new Regex( #",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))" );
private static Regex rexRunOnLine = new Regex( #"^[^""]*(?:""[^""]*""[^""]*)*""[^""]*$" );
}
public static class Csv
{
public static string Escape( string s )
{
if ( s.Contains( QUOTE ) )
s = s.Replace( QUOTE, ESCAPED_QUOTE );
if ( s.IndexOfAny( CHARACTERS_THAT_MUST_BE_QUOTED ) > -1 )
s = QUOTE + s + QUOTE;
return s;
}
public static string Unescape( string s )
{
if ( s.StartsWith( QUOTE ) && s.EndsWith( QUOTE ) )
{
s = s.Substring( 1, s.Length - 2 );
if ( s.Contains( ESCAPED_QUOTE ) )
s = s.Replace( ESCAPED_QUOTE, QUOTE );
}
return s;
}
private const string QUOTE = "\"";
private const string ESCAPED_QUOTE = "\"\"";
private static char[] CHARACTERS_THAT_MUST_BE_QUOTED = { ',', '"', '\n' };
}
The CSV format uses commas to separate values, values which contain carriage returns, linefeeds, commas, or double quotes are surrounded by double-quotes. Values that contain double quotes are quoted and each literal quote is escaped by an immediately preceding quote: For example, the 3 values:
test
list, of, items
"go" he said
would be encoded as:
test
"list, of, items"
"""go"" he said"
Any field can be quoted but only fields that contain commas, CR/NL, or quotes must be quoted.
There is no real standard for the CSV format, but almost all applications follow the conventions documented here. The RFC that was mentioned elsewhere is not a standard for CSV, it is an RFC for using CSV within MIME and contains some unconventional and unnecessary limitations that make it useless outside of MIME.
A gotcha that many CSV modules I have seen don't accommodate is the fact that multiple lines can be encoded in a single field which means you can't assume that each line is a separate record, you either need to not allow newlines in your data or be prepared to handle this.
Put double quotes around strings. That is generally what Excel does.
Ala Eli,
you escape a double quote as two
double quotes. E.g.
"test1","foo""bar","test2"
You can put double quotes around the fields. I don't like this approach, as it adds another special character (the double quote). Just define an escape character (usually backslash) and use it wherever you need to escape something:
data,more data,more data\, even,yet more
You don't have to try to match quotes, and you have fewer exceptions to parse. This simplifies your code, too.
There is a library available through nuget for dealing with pretty much any well formed CSV (.net) - CsvHelper
Example to map to a class:
var csv = new CsvReader( textReader );
var records = csv.GetRecords<MyClass>();
Example to read individual fields:
var csv = new CsvReader( textReader );
while( csv.Read() )
{
var intField = csv.GetField<int>( 0 );
var stringField = csv.GetField<string>( 1 );
var boolField = csv.GetField<bool>( "HeaderName" );
}
Letting the client drive the file format:
, is the standard field delimiter, " is the standard value used to escape fields that contain a delimiter, quote, or line ending.
To use (for example) # for fields and ' for escaping:
var csv = new CsvReader( textReader );
csv.Configuration.Delimiter = "#";
csv.Configuration.Quote = ''';
// read the file however meets your needs
More Documentation
In case you're on a *nix-system, have access to sed and there can be one or more unwanted commas only in a specific field of your CSV, you can use the following one-liner in order to enclose them in " as RFC4180 Section 2 proposes:
sed -r 's/([^,]*,[^,]*,[^,]*,)(.*)(,.*,.*)/\1"\2"\3/' inputfile
Depending on which field the unwanted comma(s) may be in you have to alter/extend the capturing groups of the regex (and the substitution).
The example above will enclose the fourth field (out of six) in quotation marks.
In combination with the --in-place-option you can apply these changes directly to the file.
In order to "build" the right regex, there's a simple principle to follow:
For every field in your CSV that comes before the field with the unwanted comma(s) you write one [^,]*, and put them all together in a capturing group.
For the field that contains the unwanted comma(s) you write (.*).
For every field after the field with the unwanted comma(s) you write one ,.* and put them all together in a capturing group.
Here is a short overview of different possible regexes/substitutions depending on the specific field. If not given, the substitution is \1"\2"\3.
([^,]*)(,.*) #first field, regex
"\1"\2 #first field, substitution
(.*,)([^,]*) #last field, regex
\1"\2" #last field, substitution
([^,]*,)(.*)(,.*,.*,.*) #second field (out of five fields)
([^,]*,[^,]*,)(.*)(,.*) #third field (out of four fields)
([^,]*,[^,]*,[^,]*,)(.*)(,.*,.*) #fourth field (out of six fields)
If you want to remove the unwanted comma(s) with sed instead of enclosing them with quotation marks refer to this answer.
As mentioned in my comment to harpo's answer, his solution is good and works in most cases, however in some scenarios when commas as directly adjacent to each other it fails to split on the commas.
This is because of the Regex string behaving unexpectedly as a vertabim string.
In order to get this behave correct, all " characters in the regex string need to be escaped manually without using the vertabim escape.
Ie. The regex should be this using manual escapes:
",(?=(?:[^\"\"]*\"\"[^\"\"]*\"\")*(?![^\"\"]*\"\"))"
which translates into ",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))"
When using a vertabim string #",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))" it behaves as the following as you can see if you debug the regex:
",(?=(?:[^"]*"[^"]*")*(?![^"]*"))"
So in summary, I recommend harpo's solution, but watch out for this little gotcha!
I've included into the CsvReader a little optional failsafe to notify you if this error occurs (if you have a pre-known number of columns):
if (_expectedDataLength > 0 && values.Length != _expectedDataLength)
throw new DataLengthException(string.Format("Expected {0} columns when splitting csv, got {1}", _expectedDataLength, values.Length));
This can be injected via the constructor:
public CsvReader(string fileName, int expectedDataLength = 0) : this(new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
_expectedDataLength = expectedDataLength;
}
Add a reference to the Microsoft.VisualBasic (yes, it says VisualBasic but it works in C# just as well - remember that at the end it is all just IL).
Use the Microsoft.VisualBasic.FileIO.TextFieldParser class to parse CSV file Here is the sample code:
Dim parser As TextFieldParser = New TextFieldParser("C:\mar0112.csv")
parser.TextFieldType = FieldType.Delimited
parser.SetDelimiters(",")
While Not parser.EndOfData
'Processing row
Dim fields() As String = parser.ReadFields
For Each field As String In fields
'TODO: Process field
Next
parser.Close()
End While
You can use alternative "delimiters" like ";" or "|" but simplest might just be quoting which is supported by most (decent) CSV libraries and most decent spreadsheets.
For more on CSV delimiters and a spec for a standard format for describing delimiters and quoting see this webpage
If you're interested in a more educational exercise on how to parse files in general (using CSV as an example), you may check out this article by Julian Bucknall. I like the article because it breaks things down into much smaller problems that are much less insurmountable. You first create a grammar, and once you have a good grammar, it's a relatively easy and methodical process to convert the grammar into code.
The article uses C# and has a link at the bottom to download the code.
If you feel like reinventing the wheel, the following may work for you:
public static IEnumerable<string> SplitCSV(string line)
{
var s = new StringBuilder();
bool escaped = false, inQuotes = false;
foreach (char c in line)
{
if (c == ',' && !inQuotes)
{
yield return s.ToString();
s.Clear();
}
else if (c == '\\' && !escaped)
{
escaped = true;
}
else if (c == '"' && !escaped)
{
inQuotes = !inQuotes;
}
else
{
escaped = false;
s.Append(c);
}
}
yield return s.ToString();
}
In Europe we have this problem must earlier than this question. In Europe we use all a comma for a decimal point. See this numbers below:
| American | Europe |
| ------------- | ------------- |
| 0.5 | 0,5 |
| 3.14159265359 | 3,14159265359 |
| 17.54 | 17,54 |
| 175,186.15 | 175.186,15 |
So it isn't possible to use the comma separator for CSV files. Because of that reason, the CSV files in Europe are separated by a semicolon (;).
Programs like Microsoft Excel can read files with a semicolon and it's possible to switch from separator. You could even use a tab (\t) as separator. See this answer from Supper User.
Here's a neat little workaround:
You can use a Greek Lower Numeral Sign instead (U+0375)
It looks like this ͵
Using this method saves you a lot of resources too...
I know it's almost 13 years later, but we came across a similar situation where the client inputs us a CSV and has values with commas, there are 2 use cases:
If the client uses a windows Excel client to write the CSV (usually that's the case in windows environment) then commas are automatically added to the value.
The actual text value of the CSV:
3786962,1st Meridian Care Services,John,"Person A,Person B, Person C, Person D",Voyager
If the client is sending you the excel programmatically, then he should adhere to RFC4180 and enclose the value with "quotes". example:
Col1, Col2, "a, b, c", Col4
Just use SoftCircuits.CsvParser on NuGet. It will handle all those details for you and efficiently handles very large files. And, if needed, it can even import/export objects by mapping columns to object properties. In addition, my testing showed it averages nearly 4 times faster than the popular CsvHelper.
You can read the csv file like this.
this makes use of splits and takes care of spaces.
ArrayList List = new ArrayList();
static ServerSocket Server;
static Socket socket;
static ArrayList<Object> list = new ArrayList<Object>();
public static void ReadFromXcel() throws FileNotFoundException
{
File f = new File("Book.csv");
Scanner in = new Scanner(f);
int count =0;
String[] date;
String[] name;
String[] Temp = new String[10];
String[] Temp2 = new String[10];
String[] numbers;
ArrayList<String[]> List = new ArrayList<String[]>();
HashMap m = new HashMap();
in.nextLine();
date = in.nextLine().split(",");
name = in.nextLine().split(",");
numbers = in.nextLine().split(",");
while(in.hasNext())
{
String[] one = in.nextLine().split(",");
List.add(one);
}
int xount = 0;
//Making sure the lines don't start with a blank
for(int y = 0; y<= date.length-1; y++)
{
if(!date[y].equals(""))
{
Temp[xount] = date[y];
Temp2[xount] = name[y];
xount++;
}
}
date = Temp;
name =Temp2;
int counter = 0;
while(counter < List.size())
{
String[] list = List.get(counter);
String sNo = list[0];
String Surname = list[1];
String Name = list[2];
for(int x = 3; x < list.length; x++)
{
m.put(numbers[x], list[x]);
}
Object newOne = new newOne(sNo, Name, Surname, m, false);
StudentList.add(s);
System.out.println(s.sNo);
counter++;
}
I generally URL-encode the fields which can have any commas or any special chars. And then decode it when it is being used/displayed in any visual medium.
(commas becomes %2C)
Every language should have methods to URL-encode and decode strings.
e.g., in java
URLEncoder.encode(myString,"UTF-8"); //to encode
URLDecoder.decode(myEncodedstring, "UTF-8"); //to decode
I know this is a very general solution and it might not be ideal for situation where user wants to view content of csv file, manually.
I usually do this in my CSV files parsing routines. Assume that 'line' variable is one line within a CSV file and all of the columns' values are enclosed in double quotes. After the below two lines execute, you will get CSV columns in the 'values' collection.
// The below two lines will split the columns as well as trim the DBOULE QUOTES around values but NOT within them
string trimmedLine = line.Trim(new char[] { '\"' });
List<string> values = trimmedLine.Split(new string[] { "\",\"" }, StringSplitOptions.None).ToList();
The simplest solution I've found is the one LibreOffice uses:
Replace all literal " by ”
Put double quotes around your string
You can also use the one that Excel uses:
Replace all literal " by ""
Put double quotes around your string
Notice other people recommended to do only step 2 above, but that doesn't work with lines where a " is followed by a ,, like in a CSV where you want to have a single column with the string hello",world, as the CSV would read:
"hello",world"
Which is interpreted as a row with two columns: hello and world"
public static IEnumerable<string> LineSplitter(this string line, char
separator, char skip = '"')
{
var fieldStart = 0;
for (var i = 0; i < line.Length; i++)
{
if (line[i] == separator)
{
yield return line.Substring(fieldStart, i - fieldStart);
fieldStart = i + 1;
}
else if (i == line.Length - 1)
{
yield return line.Substring(fieldStart, i - fieldStart + 1);
fieldStart = i + 1;
}
if (line[i] == '"')
for (i++; i < line.Length && line[i] != skip; i++) { }
}
if (line[line.Length - 1] == separator)
{
yield return string.Empty;
}
}
I used Csvreader library but by using that I got data by exploding from comma(,) in column value.
So If you want to insert CSV file data which contains comma(,) in most of the columns values, you can use below function.
Author link => https://gist.github.com/jaywilliams/385876
function csv_to_array($filename='', $delimiter=',')
{
if(!file_exists($filename) || !is_readable($filename))
return FALSE;
$header = NULL;
$data = array();
if (($handle = fopen($filename, 'r')) !== FALSE)
{
while (($row = fgetcsv($handle, 1000, $delimiter)) !== FALSE)
{
if(!$header)
$header = $row;
else
$data[] = array_combine($header, $row);
}
fclose($handle);
}
return $data;
}
I used papaParse library to have the CSV file parsed and have the key-value pairs(key/header/first row of CSV file-value).
here is example that I use:
https://codesandbox.io/embed/llqmrp96pm
it has dummy.csv file in there to have the CSV parsing demo.
I've used it within reactJS though it is easy and simple to replicate in app written with any language.
An example might help to show how commas can be displayed in a .csv file. Create a simple text file as follows:
Save this text file as a text file with suffix ".csv" and open it with Excel 2000 from Windows 10.
aa,bb,cc,d;d
"In the spreadsheet presentation, the below line should look like the above line except the below shows a displayed comma instead of a semicolon between the d's."
aa,bb,cc,"d,d", This works even in Excel
aa,bb,cc,"d,d", This works even in Excel 2000
aa,bb,cc,"d ,d", This works even in Excel 2000
aa,bb,cc,"d , d", This works even in Excel 2000
aa,bb,cc, " d,d", This fails in Excel 2000 due to the space belore the 1st quote
aa,bb,cc, " d ,d", This fails in Excel 2000 due to the space belore the 1st quote
aa,bb,cc, " d , d", This fails in Excel 2000 due to the space belore the 1st quote
aa,bb,cc,"d,d " , This works even in Excel 2000 even with spaces before and after the 2nd quote.
aa,bb,cc,"d ,d " , This works even in Excel 2000 even with spaces before and after the 2nd quote.
aa,bb,cc,"d , d " , This works even in Excel 2000 even with spaces before and after the 2nd quote.
Rule: If you want to display a comma in a a cell (field) of a .csv file:
"Start and end the field with a double quotes, but avoid white space before the 1st quote"
As this is about general practices let's start from rules of the thumb:
Don't use CSV, use XML with a library to read & write the xml file instead.
If you must use CSV. Do it properly and use a free library to parse and store the CSV files.
To justify 1), most CSV parsers aren't encoding aware so if you aren't dealing with US-ASCII you are asking for troubles.
For example excel 2002 is storing the CSV in local encoding without any note about the encoding. The CSV standard isn't widely adopted :(.
On the other hand xml standard is well adopted and it handles encodings pretty well.
To justify 2), There is tons of csv parsers around for almost all language so there is no need to reinvent the wheel even if the solutions looks pretty simple.
To name few:
for python use build in csv module
for perl check CPAN and Text::CSV
for php use build in fgetcsv/fputcsv functions
for java check SuperCVS library
Really there is no need to implement this by hand if you aren't going to parse it on embedded device.
First, let's ask ourselves, "Why do we feel the need to handle commas differently for CSV files?"
For me, the answer is, "Because when I export data into a CSV file, the commas in a field disappear and my field gets separated into multiple fields where the commas appear in the original data." (That it because the comma is the CSV field separator character.)
Depending on your situation, semi colons may also be used as CSV field separators.
Given my requirements, I can use a character, e.g., single low-9 quotation mark, that looks like a comma.
So, here's how you can do it in Go:
// Replace special CSV characters with single low-9 quotation mark
func Scrub(a interface{}) string {
s := fmt.Sprint(a)
s = strings.Replace(s, ",", "‚", -1)
s = strings.Replace(s, ";", "‚", -1)
return s
}
The second comma looking character in the Replace function is decimal 8218.
Be aware that if you have clients that may have ascii-only text readers that this decima 8218 character will not look like a comma. If this is your case, then I'd recommend surrounding the field with the comma (or semicolon) with double quotes per RFC 4128: https://www.rfc-editor.org/rfc/rfc4180
Thank you others in this post.
I used the information here to create a function in JavaScript that will get csv output for an array of objects which may have property values containing commas.
like
rowsArray = [{obj1prop1: "foo", obj1prop2: "bar,baz"}, {obj2prop1: "qux", obj2prop2: "quux,corge,thud"}]
into
csvRowsArray = [{obj1prop1: "foo", obj1prop2: "\"bar,baz\""}, {...} ]
To use the commas in the values in a csv, the value needs to be wrapped in double quotes. And in order to have double quotes in the value in the json object, they just need to be escaped, i.e., \", backslash double quote. The escape is made here by subbing in a template literal and including the necessary quotes `"${row[key]}"`. The quotes are escaped when put in the object.
Here is my function:
const calculateTheCSVExport = (props) => {
if (props.rows === undefined) return;
let jsonRowsArray = props.rows;
// console.log(jsonRowsArray);
let csvRowsArrayNoCommasInObjectValues = [];
let csvCurrRowObject = {}
jsonRowsArray.forEach(row => {
Object.keys(row).forEach(key => {
// console.log(key, row[key])
if (row[key].indexOf(',') > -1) {
csvCurrRowObject = {...csvCurrRowObject, [key]: `"${row[key]}"`} // enclose value in escaped double quotes in JSON in order to export commas to csv correctly. see more: https://stackoverflow.com/questions/769621/dealing-with-commas-in-a-csv-file
} else {
csvCurrRowObject = {...csvCurrRowObject, [key]: row[key]}
}
});
csvRowsArrayNoCommasInObjectValues.push(csvCurrRowObject);
csvCurrRowObject = {};
})
// console.log(csvRowsArrayNoCommasInObjectValues)
return csvRowsArrayNoCommasInObjectValues;
}
I think the easiest solution to this problem is to have the customer to open the csv in excel, and then ctrl + r to replace all comma with whatever identifier you want. This is very easy for the customer and require only one change in your code to read the delimiter of your choice.
Use a tab character (\t) to separate the fields.

Parse Text File Into Dictionary

I have a text file that has several hundred configuration values. The general format of the configuration data is "Label:Value". Using C# .net, I would like to read these configurations, and use the Values in other portions of the code. My first thought is that I would use a string search to look for the Labels then parse out the values following the labels and add them to a dictionary, but this seems rather tedious considering the number of labels/values that I would have to search for. I am interested to hear some thoughts on a possible architecture to perform this task. I have included a small section of a sample text file that contains some of the labels and values (below). A couple of notes: The Values are not always numeric (as seen in the AUX Serial Number); For whatever reason, the text files were formatted using spaces (\s) rather than tabs (\t). Thanks in advance for any time you spend thinking about this.
Sample Text:
AUX Serial Number: 445P000023 AUX Hardware Rev: 1
Barometric Pressure Slope: -1.452153E-02
Barometric Pressure Intercept: 9.524336E+02
This is a nice little brain tickler. I think this code might be able to point you in the right direction. Keep in mind, this fills a Dictionary<string, string>, so there are no conversions of values into ints or the like. Also, please excuse the mess (and the poor naming conventions). It was a quick write-up based on my train of thought.
Dictionary<string, string> allTheThings = new Dictionary<string, string>();
public void ReadIt()
{
// Open the file into a streamreader
using (System.IO.StreamReader sr = new System.IO.StreamReader("text_path_here.txt"))
{
while (!sr.EndOfStream) // Keep reading until we get to the end
{
string splitMe = sr.ReadLine();
string[] bananaSplits = splitMe.Split(new char[] { ':' }); //Split at the colons
if (bananaSplits.Length < 2) // If we get less than 2 results, discard them
continue;
else if (bananaSplits.Length == 2) // Easy part. If there are 2 results, add them to the dictionary
allTheThings.Add(bananaSplits[0].Trim(), bananaSplits[1].Trim());
else if (bananaSplits.Length > 2)
SplitItGood(splitMe, allTheThings); // Hard part. If there are more than 2 results, use the method below.
}
}
}
public void SplitItGood(string stringInput, Dictionary<string, string> dictInput)
{
StringBuilder sb = new StringBuilder();
List<string> fish = new List<string>(); // This list will hold the keys and values as we find them
bool hasFirstValue = false;
foreach (char c in stringInput) // Iterate through each character in the input
{
if (c != ':') // Keep building the string until we reach a colon
sb.Append(c);
else if (c == ':' && !hasFirstValue)
{
fish.Add(sb.ToString().Trim());
sb.Clear();
hasFirstValue = true;
}
else if (c == ':' && hasFirstValue)
{
// Below, the StringBuilder currently has something like this:
// " 235235 Some Text Here"
// We trim the leading whitespace, then split at the first sign of a double space
string[] bananaSplit = sb.ToString()
.Trim()
.Split(new string[] { " " },
StringSplitOptions.RemoveEmptyEntries);
// Add both results to the list
fish.Add(bananaSplit[0].Trim());
fish.Add(bananaSplit[1].Trim());
sb.Clear();
}
}
fish.Add(sb.ToString().Trim()); // Add the last result to the list
for (int i = 0; i < fish.Count; i += 2)
{
// This for loop assumes that the amount of keys and values added together
// is an even number. If it comes out odd, then one of the lines on the input
// text file wasn't parsed correctly or wasn't generated correctly.
dictInput.Add(fish[i], fish[i + 1]);
}
}
So the only general approach that I can think of, given the format that you're limited to, is to first find the first colon on the line and take everything before it as the label. Skip all whilespace characters until you get to the first non-whitespace character. Take all non-whitespace characters as the value of the label. If there is a colon after the end of that value take everything after the end of the previous value to the colon as the next value and repeat. You'll also probably need to trim whitespace around the labels.
You might be able to capture that meaning with a regex, but it wouldn't likely be a pretty one if you could; I'd avoid it for something this complex unless you're entire development team is very proficient with them.
I would try something like this:
While string contains triple space, replace it with double space.
Replace all ": " and ": " (: with double space) with ":".
Replace all " " (double space) with '\n' (new line).
If line don't contain ':' than skip the line. Else, use string.Split(':'). This way you receive arrays of 2 strings (key and value). Some of them may contain empty characters at the beginning or at the end.
Use string.Trim() to get rid of those empty characters.
Add received key and value to Dictionary.
I am not sure if it solves all your cases but it's a general clue how I would try to do it.
If it works you could think about performance (use StringBuilder instead of string wherever it is possible etc.).
This is probably the dirtiest function I´ve ever written, but it works.
StreamReader reader = new StreamReader("c:/yourFile.txt");
Dictionary<string, string> yourDic = new Dictionary<string, string>();
StreamReader reader = new StreamReader("c:/yourFile.txt");
Dictionary<string, string> yourDic = new Dictionary<string, string>();
while (reader.Peek() >= 0)
{
string line = reader.ReadLine();
string[] data = line.Split(':');
if (line != String.Empty)
{
for (int i = 0; i < data.Length - 1; i++)
{
if (i != 0)
{
bool isPair;
if (i % 2 == 0)
{
isPair = true;
}
else
{
isPair = false;
}
if (isPair)
{
string keyOdd = data[i].Trim();
try { keyOdd = keyOdd.Substring(keyOdd.IndexOf(' ')).TrimStart(); }
catch { }
string valueOdd = data[i + 1].TrimStart();
try { valueOdd = valueOdd.Remove(valueOdd.IndexOf(' ')); } catch{}
yourDic.Add(keyOdd, valueOdd);
}
else
{
string keyPair = data[i].TrimStart();
keyPair = keyPair.Substring(keyPair.IndexOf(' ')).Trim();
string valuePair = data[i + 1].TrimStart();
try { valuePair = valuePair.Remove(valuePair.IndexOf(' ')); } catch { }
yourDic.Add(keyPair, valuePair);
}
}
else
{
string key = data[i].Trim();
string value = data[i + 1].TrimStart();
try { value = value.Remove(value.IndexOf(' ')); } catch{}
yourDic.Add(key, value);
}
}
}
}
How does it works?, well splitting the line you can know what you can get in every position of the array, so I just play with the even and odd values.
You will understand me when you debug this function :D. It fills the Dictionary that you need.
I have another idea. Does values contain spaces? If not you could do like this:
Ignore white spaces until you read some other char (first char of key).
Read string until ':' occures.
Trim key that you get.
Ignore white spaces until you read some other char (first char of value).
Read until you get empty char.
Trim value that you get.
If it is the end than stop. Else, go back to step 1.
Good luck.
Maybe something like this would work, be careful with the ':' character
StreamReader reader = new StreamReader("c:/yourFile.txt");
Dictionary<string, string> yourDic = new Dictionary<string, string>();
while (reader.Peek() >= 0)
{
string line = reader.ReadLine();
yourDic.Add(line.Split(':')[0], line.Split(':')[1]);
}
Anyway, I recommend to organize that file in some way that you´ll always know in what format it comes.

C#: Checking That ArrayList Elements have specific type

ArrayList fileList = new ArrayList();
private void button2_Click(object sender, EventArgs e)
{
if (openFileDialog1.ShowDialog() == DialogResult.OK)
{
string line;
// Read the file and display it line by line.
System.IO.StreamReader file = new System.IO.StreamReader(openFileDialog1.FileName);
while ((line = file.ReadLine()) != null)
{
// Puts elements in table
fileList.Add(line.Split(';'));
}
file.Close();
}
for (int i = 0; i < fileList.Count; i++)
{
for (int x = 0; x < (fileList[i] as string[]).Length; x++)
{
// if (x ==0)
// {
//fileList[0] must Be int
// }
// if (x==1)
//fileList[1] must be string
this.textBox2.Text += ((fileList[i] as string[])[x] + " ");
}
this.textBox2.Text += Environment.NewLine;
}
}
I am so far here.
I take the elements from a CSV file.
I need now to be sure that the 1 column has only numbers-integers (1,2,3,4,5), the second column has only names(so it will have the type string or character), the third surnames etc. etc.
The rows are presented like this : 1;George;Mano;
How can I be sure that the CSV file has the correct types?
I think that any more code about this problem will be placed inside the 2 for statements.
Thank you very much,
George.
I think your question needs more work.
You don't show your declaration for filelist. Whatever it is, there is no reason to convert it to string[] just to get the length. The length with be the same no matter what type it is. You cannot use this method to determine which items are strings.
You'll need to loop through the items and see if they contain only digits or whatever.
Also, your code to read CSV files is not quote right. CSV files are comma-separated. And it's possible that they could contain commas within double quotes. These commas should be ignored. A better way to read CSV files can be seen here.
An Arraylist contains object.
System.IO.StreamReader.ReadLine returns a String.
Checking the value of the first line read and trying to convert the string into an integer would be a valid approach.
Your current approach is adding the String that is returned by System.IO.StreamReader.ReadLine into your collection which you later turn into a String[] by using the String.Split method.
Your other requirements will be a greal more difficult because every line you are reading is a String already. So you would have to look at each character within the string to determine if it appears to be a name.
In other words you might want to find a different way to provide an input. I would agree that a regular expression might be the best way to get rid of junk data.
Edit: Now that we know it's really CSV, here's a columnar answer ;-)
Your ArrayList contains string[], so you need to verify that each array has the appropriate type of string.
for (int i = 0; i < fileList.Count; i++)
{
string[] lineItems = (string[])fileList[i];
if (!Regex.IsMatch (lineItems[0], "^\d+$")) // numbers
throw new ArgumentException ("invalid id at row " + i);
if (!Regex.IsMatch (lineItems[1], "^[a-zA-Z]+$")) // surnames - letters-only
throw new ArgumentException ("invalid surname at row " + i);
if (!Regex.IsMatch (lineItems[2], "^[a-zA-Z]+$")) // names - letters-only
throw new ArgumentException ("invalid name at row " + i);
}
You can use Regex class.
fileList[0] must Be int:
int x;
if(int.TryParse(fileList[0], out x)){ //do whatever here and x will have that integer value. TryParse will return false if it's not an integer so the if will not fire}
fileList[1] must be string :
iterate over the string and check each element is a letter. look at the char. methods for the appropriate one.

How do I handle line breaks in a CSV file using C#?

I have an Excel spreadsheet being converted into a CSV file in C#, but am having a problem dealing with line breaks. For instance:
"John","23","555-5555"
"Peter","24","555-5
555"
"Mary,"21","555-5555"
When I read the CSV file, if the record does not starts with a double quote (") then a line break is there by mistake and I have to remove it. I have some CSV reader classes from the internet but I am concerned that they will fail on the line breaks.
How should I handle these line breaks?
Thanks everybody very much for your help.
Here's is what I've done so far. My records have fixed format and all start with
JTW;...;....;...;
JTW;...;...;....
JTW;....;...;..
..;...;... (wrong record, line break inserted)
JTW;...;...
So I checked for the ; in the [3] position of each line. If true, I write; if false, I'll append on the last (removing the line-break)
I'm having problems now because I'm saving the file as a txt.
By the way, I am converting the Excel spreadsheet to csv by saving as csv in Excel. But I'm not sure if the client is doing that.
So the file as a TXT is perfect. I've checked the records and totals. But now I have to convert it back to csv, and I would really like to do it in the program. Does anybody know how?
Here is my code:
namespace EditorCSV
{
class Program
{
static void Main(string[] args)
{
ReadFromFile("c:\\source.csv");
}
static void ReadFromFile(string filename)
{
StreamReader SR;
StreamWriter SW;
SW = File.CreateText("c:\\target.csv");
string S;
char C='a';
int i=0;
SR=File.OpenText(filename);
S=SR.ReadLine();
SW.Write(S);
S = SR.ReadLine();
while(S!=null)
{
try { C = S[3]; }
catch (IndexOutOfRangeException exception){
bool t = false;
while (t == false)
{
t = true;
S = SR.ReadLine();
try { C = S[3]; }
catch (IndexOutOfRangeException ex) { S = SR.ReadLine(); t = false; }
}
}
if( C.Equals(';'))
{
SW.Write("\r\n" + S);
i = i + 1;
}
else
{
SW.Write(S);
}
S=SR.ReadLine();
}
SR.Close();
SW.Close();
Console.WriteLine("Records Processed: " + i.ToString() + " .");
Console.WriteLine("File Created SucacessFully");
Console.ReadKey();
}
}
}
CSV has predefined ways of handling that. This site provides an easy to read explanation of the standard way to handle all the caveats of CSV.
Nevertheless, there is really no reason to not use a solid, open source library for reading and writing CSV files to avoid making non-standard mistakes. LINQtoCSV is my favorite library for this. It supports reading and writing in a clean and simple way.
Alternatively, this SO question on CSV libraries will give you the list of the most popular choices.
Rather than check if the current line is missing the (") as the first character, check instead to see if the last character is a ("). If it is not, you know you have a line break, and you can read the next line and merge it together.
I am assuming your example data was accurate - fields were wrapped in quotes. If quotes might not delimit a text field (or new-lines are somehow found in non-text data), then all bets are off!
There is a built-in method for reading CSV files in .NET (requires Microsoft.VisualBasic assembly reference added):
public static IEnumerable<string[]> ReadSV(TextReader reader, params string[] separators)
{
var parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader);
parser.SetDelimiters(separators);
while (!parser.EndOfData)
yield return parser.ReadFields();
}
If you're dealing with really large files this CSV reader claims to be the fastest one you'll find: http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader
I've used this piece of code recently to parse rows from a CSV file (this is a simplified version):
private void Parse(TextReader reader)
{
var row = new List<string>();
var isStringBlock = false;
var sb = new StringBuilder();
long charIndex = 0;
int currentLineCount = 0;
while (reader.Peek() != -1)
{
charIndex++;
char c = (char)reader.Read();
if (c == '"')
isStringBlock = !isStringBlock;
if (c == separator && !isStringBlock) //end of word
{
row.Add(sb.ToString().Trim()); //add word
sb.Length = 0;
}
else if (c == '\n' && !isStringBlock) //end of line
{
row.Add(sb.ToString().Trim()); //add last word in line
sb.Length = 0;
//DO SOMETHING WITH row HERE!
currentLineCount++;
row = new List<string>();
}
else
{
if (c != '"' && c != '\r') sb.Append(c == '\n' ? ' ' : c);
}
}
row.Add(sb.ToString().Trim()); //add last word
//DO SOMETHING WITH LAST row HERE!
}
Try CsvHelper (a library I maintain). It ignores empty rows. I believe there is a flag you can set in FastCsvReader to have it handle empty rows also.
Heed the advice from the experts and Don't roll your own CSV parser.
Your first thought is, "How do I handle new line breaks?"
Your next thought is, "I need to handle commas inside of quotes."
Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."
It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free CsvHelper library.
Maybe you could count for (") during the ReadLine(). If they are odd, that will raise the flag. You could either ignore those lines, or get the next two and eliminate the first "\n" occurrence of the merge lines.
What I usually do is read the text in character by character opposed to line by line, due to this very problem.
As you're reading each character, you should be able to figure out where each cell starts and stops, but also the difference between a linebreak in a row and in a cell: If I remember correctly, for Excel generated files anyway, rows start with \r\n, and newlines in cells are only \r.
There is an example parser is c# that seems to handle your case correctly. Then you can read your data in and purge the line breaks out of it post-read.
Part 2 is the parser, and there is a Part 1 that covers the writer portion.
Read the line.
Split into columns(fields).
If you have enough columns expected for each line, then process.
If not, read the next line, and capture the remaining columns until you get what you need.
Repeat.
A somewhat simple regular expression could be used on each line. When it matches, you process each field from the match. When it doesn't find a match, you skip that line.
The regular expression could look something like this.
Match match = Regex.Match(line, #"^(?:,?(?<q>['"](?<field>.*?\k'q')|(?<field>[^,]*))+$");
if (match.Success)
{
foreach (var capture in match.Groups["field"].Captures)
{
string fieldValue = capture.Value;
// Use the value.
}
}
Have a look at FileHelpers Library
It supports reading\writing CSV with line breaks as well as reading\writing to excel
The LINQy solution:
string csvText = File.ReadAllText("C:\\Test.txt");
var query = csvText
.Replace(Environment.NewLine, string.Empty)
.Replace("\"\"", "\",\"").Split(',')
.Select((i, n) => new { i, n }).GroupBy(a => a.n / 3);
You might also check out my CSV parser SoftCircuits.CsvParser on NuGet. It will not only parse a CSV file but--if wanted--can also automatically map column values to your class properties. And it runs nearly four times faster than CsvHelper.
For a line break to exist in a CSV, there must be an open double quote that's not closed.
Assuming that all CSVs cells must open and close a double quote, just check if there's an odd number of quotation marks
my_string.Count(c => c == '"') % 2 == 1
and if that's the case, continue reading until you have the even number.

Categories