See the below code. I'm opening a .CSV file and reading it into a listbox, but rather than coming across as -
X
Y
Z
it is
"X"
"Y"
"Z"
Relevant code is:
if(ofdCSV.ShowDialog() == DialogResult.OK)
{
list.Visible = true;
StreamReader sr = new StreamReader(ofdCSV.FileName);
string currentLine;
while ((currentLine = sr.ReadLine()) != null)
{
list.Items.Add(currentLine);
}
Any ideas? I've looked around for a while, but I'm still a novice with this, so I'm not entirely sure what to even look for.
CSV files are a common example of something that is invariably more complex than you think it is at first glance. The process usually starts with you thinking that you know how csv files work and writing some simple code. You then gradually accumulate more and more code as you discover more and more edge cases. The final stage is when all your code is eventually discarded in favor of a 3rd party CSV parser, such as Filehelpers.
The .net framework has TextFieldParser which does a pretty good job if you set
Delimiters = new string[] { "," };
HasFieldsEnclosedInQuotes = true;
A more detailed explanation can be found here.
It's less overhead than using a whole other library, but it may fall over if you feed it a line like
caltrops,"10' pole, 1" diameter",lunch
But you probably don't want to regex that either.
You can try using Replace to remove quotes.
list.Items.Add(currentLine.Replace('"', '').Trim());
or
list.Items.Add(currentLine.Replace("\"", string.Empty));;
I have written following statement which to my knowledge works properly to create CSV data. The major point here is that CSV data can span multiple lines. Hope this helps.
public static string Escape(string val)
{
if (val == null)
return "";
while (val.EndsWith("\r\n"))
val = val.Remove(val.Length - 2, 2);
if (val.Contains("\t"))
val = val.Replace("\t", " ");
if (val.Contains("\r\n"))
val = val.Replace("\r\n", "\n");
if (val.Contains("\r"))
val = val.Replace("\r", "\n");
if (val.Contains("\""))
val = val.Replace("\"", "\"\"");
if (val.Contains(",") ||
val.Contains("\"") ||
val.Contains("\n") ||
val.StartsWith(" ") ||
val.EndsWith(" "))
{
val = "\"" + val + "\"";
}
return val;
}
Related
I am making a simple compiler, and am working on string parsing. At the moment, my code is:
while (stringToParse.Contains(" + ") || stringToParse.Contains("+ ") || stringToParse.Contains(" +")) {
stringToParse = stringToParse.Replace(" +", "+").Replace("+ ", "+").Replace(" + ", "+");
}
string[] splitString = stringToParse.Split("+");
But something like:
"\"hello \" + \"world \" + \" + \" + \"hello\""
Would return:
["\"hello "\", "\"world \"", "\"", "\"", ]
(without backslashes)
But something like:
""hello " + "world " + " + " + "hello""
Would return:
[""hello "", ""world "", """, """, ]
So how can I specify if a " + " is in a string or as a separator? is there maybe a way to detect for something like the following?
...(any number of non " or + characters)...+...(any number of " or + characters)
My expected output would be:
[""hello "", ""world "", ""+""]
Explicit State Machine
To do this, Without using any dedicated library, I suggest to build a state machine.
You will iterate over the characters of the string, and depending on which character you encounter you update the state of the machine. Optimizations are possible, however, let us begin with conventional clarity.
var characters = input.ToCharArray();
var results = new List<string>();
var current = string.Empty;
// 0 = not inside quotes, we expect +
// 1 = not inside quotes, we expect "
// 2 = inside quotes
var state = 1;
foreach (var character in characters)
{
switch (state)
{
case 0:
// We are not inside quotes, we expect +
if (character == '+')
{
state = 1;
continue;
}
if (char.IsWhiteSpace(character))
{
continue;
}
// error?
break;
case 1:
// We are not inside quotes, we expect "
if (character == '\"')
{
state = 2;
continue;
}
if (char.IsWhiteSpace(character))
{
continue;
}
// error?
break;
case 2:
// We are inside quotes, we expect "
if (character == '\"')
{
state = 0;
results.Add(current);
current = string.Empty;
continue;
}
current += character;
break;
default:
// error?
break;
}
}
if (state != 0)
{
// error
}
// You can use results.ToArray();
Possible optimizations:
We can use a StringBuilder instead of concatenations.
Also, we can use IndexOf to find the next relevant character.
We can check if a string (a chunk of characters) is empty or white space (perhaps using IsNullOrWhiteSpace).
We can use AsSpan so we can work with ReadOnlySpan instead.
You can also see how you can add support for your own escape sequences, or any other stuff.
Implicit State Machine (with helper class)
I want to point out that this is not the only way to organize this code. I would, if I were you, create a pseudo iterator class that had a method two methods:
A method that returns the next character... or better yet, that returns true if the next character matches a parameter (and advances), or false (and does not advance).
A method that returns all the characters until the next instance of a particular character (and advances to there).
The main advantage of such approach is that I would no longer have to step character by character, thus, I would not need to have a state variable. Instead I could allow the code structure to resemble the shape of my gramar.
Wait, I have wrote such class: StringProcessor. It is part of the Theraot.Core nuget, it is used to parse strings to BigInteger.
var processor = new Theraot.Core.StringProcessor(input);
var results = new List<string>();
while (!processor.EndOfString)
{
// SkipWhile skips all the characters that match
processor.SkipWhile(char.IsWhiteSpace);
// Read returns true (and advances after) if what is next matches the paramter
if (processor.Read('"'))
{
// ReadUntil advances after and returns everything found before the parameter
// Note: it does not advance after the parameter.
results.Add(processor.ReadUntil('"'));
processor.Read('"');
}
processor.SkipWhile(char.IsWhiteSpace);
if (!processor.Read('+'))
{
// error?
}
}
Please notice that a class such as the StringProcessor used above cuts a lot of fluff, which makes it viable for simple languages.
Custom Tokenizer
Of course, for something more complex you might want to look for a tokenizer.
To give you an example, consider that this is the "grammar" we have:
Document: Many
{
Whitespace
String:
{
QuoteSymbol
NonQuoteSymbol
QuoteSymbol
}
Whitespace
PlusSymbol
}
No, this not any of the usual metalanguages. However, written this way it is easier to see how the code we had above resembles the language.
Would it not be nice to write as follows?
var QuoteSymbol = Pattern.Literal("QuoteSymbol", '"');
var NonQuoteSymbol = Pattern.Custom("NonQuoteSymbol", s => s.ReadUntil('"'));
var String = Pattern.Conjunction("String", QuoteSymbol, NonQuoteSymbol, QuoteSymbol);
var WhiteSpace = Pattern.Custom("WhiteSpace", s => s.ReadWhile(char.IsWhiteSpace));
var PlusSymbol = Pattern.Literal("PlusSymbol", '+');
var Document = Pattern.Repetition(
Pattern.Conjunction(WhiteSpace, String, WhiteSpace, PlusSymbol)
);
var results = from TerminalSymbol symbol
in Document.Parse(input)
where symbol.Pattern == String
select symbol.ToString();
Writing code like that would make it easier to modify the language. Well, we are still writing code, however you could imagine parsing a file that has the grammar of the language you want to parse... Fancy!
As you might expect, it requires extra work to build the necesary code to make it work. Or, you know, get some code that already works (the linked code is built around on StringProcessor).
Language Toolkits
The code presented earlier is not suitable to be used for a prettyprinter and is not capable of recovering from a syntax error. It can be modified to do such things. Neither will it integrate with code editors at any level.
If you want a fully fledged solution. I have two suggestions:
Irony
Nitra
These are the kind of things you would use if you wanted to create a programming language ontop.
And of course, I should link you to "Compilers: Principles, Techniques, and Tools" usually just known as "The Dragon Book".
I have been trying to search string patterns in a large text file. I am reading line by line and checking each line which is causing a lot of time. I did try with HashSet and ReadAllLines.
HashSet<string> strings = new HashSet<string>(File.ReadAllLines(#"D:\Doc\Tst.txt"));
Now when I am trying to search the string, it's not matching. As it is looking for a match of the entire row. I just want to check if the string appears in the row.
I had tried by using this:
using (System.IO.StreamReader file = new System.IO.StreamReader(#"D:\Doc\Tst.txt"))
{
while ((CurrentLine = file.ReadLine()) != null)
{
vals = chk_log(CurrentLine, date_Format, (range.Cells[i][counter]).Value2, vals);
if (vals == true)
break;
}
}
bool chk_log(string LineText, string date_to_chk, string publisher, bool tvals)
{
if (LineText.Contains(date_to_chk))
if (LineText.Contains(publisher))
{
tvals = true;
}
else
tvals = false;
else tvals = false;
return tvals;
}
But this is consuming too much time. Any help on this would be good.
Reading into a HashSet doesn't make sense to me (unless there are a lot of duplicated lines) since you aren't testing for membership of the set.
Taking a really naive approach you could just do this.
var isItThere = File.ReadAllLines(#"d:\docs\st.txt").Any(x =>
x.Contains(date_to_chk) && x.Contains(publisher));
65K lines at (say) 1K a line isn't a lot of memory to worry about, and I personally wouldn't bother with Parallel since it sounds like it would be superfast to do anyway.
You could replace Any where First to find the first result or Where to get an IEnumerable<string> containing all results.
You can use a compiled regular expression instead of String.Contains (compile once before looping over the lines). This typically gives better performance.
var regex = new Regex($"{date}|{publisher}", RegexOptions.Compiled);
foreach (string line in File.ReadLines(#"D:\Doc\Tst.txt"))
{
if (regex.IsMatch(line)) break;
}
This also shows a convenient standard library function for reading a file line by line.
Or, depending on what you want to do...
var isItThere = File.ReadLines(#"D:\Doc\Tst.txt").Any(regex.IsMatch);
I've got a CSV string an I want to separate it into an array. However the CSV is a mix of strings and numbers where the strings are enclosed in quotes and may contain commas.
For example, I might have a CSV as follows:
1,"Hello",2,"World",3,"Hello, World"
I would like it so the string is split into:
1
"Hello"
2
"World"
3
"Hello, World"
If I use String.Split(','); I get:
1
"Hello"
2
"World"
3
"Hello
World"
Is there an easy way of doing this? A library that is already written or do I have to parse the string character by character?
The "A Fast CSV Reader" article on Code Project. I've used it happily many times.
String.Split() is icky for this. Not only does it have nasty corner cases where it doesn't work like the one you just found (and others you haven't seen yet), but performance is less than ideal as well. The FastCSVReader posted by others will work, there's a decent csv parser built into the framework (Microsoft.VisualBasic.TextFieldParser), and I have a simple parser that behaves correctly posted to this question.
I would suggest using one of the following solutions, was just testing a few of them (hence the delay):-
Regex matching commas not found within an enclosing double aprostophe
A Fast CSV Reader - for read CSV only
FileHelpers Library 2.0 - for read/write CSV
Hope this helps.
It's not the most elegant solution, but the quickest if you want to just quickly copy and paste code (avoiding having to import DLLs or other code libraries):
private string[] splitQuoted(string line, char delimeter)
{
string[] array;
List<string> list = new List<string>();
do
{
if (line.StartsWith("\""))
{
line = line.Substring(1);
int idx = line.IndexOf("\"");
while (line.IndexOf("\"", idx) == line.IndexOf("\"\"", idx))
{
idx = line.IndexOf("\"\"", idx) + 2;
}
idx = line.IndexOf("\"", idx);
list.Add(line.Substring(0, idx));
line = line.Substring(idx + 2);
}
else
{
list.Add(line.Substring(0, Math.Max(line.IndexOf(delimeter), 0)));
line = line.Substring(line.IndexOf(delimeter) + 1);
}
}
while (line.IndexOf(delimeter) != -1);
list.Add(line);
array = new string[list.Count];
list.CopyTo(array);
return array;
}
I have an Excel spreadsheet being converted into a CSV file in C#, but am having a problem dealing with line breaks. For instance:
"John","23","555-5555"
"Peter","24","555-5
555"
"Mary,"21","555-5555"
When I read the CSV file, if the record does not starts with a double quote (") then a line break is there by mistake and I have to remove it. I have some CSV reader classes from the internet but I am concerned that they will fail on the line breaks.
How should I handle these line breaks?
Thanks everybody very much for your help.
Here's is what I've done so far. My records have fixed format and all start with
JTW;...;....;...;
JTW;...;...;....
JTW;....;...;..
..;...;... (wrong record, line break inserted)
JTW;...;...
So I checked for the ; in the [3] position of each line. If true, I write; if false, I'll append on the last (removing the line-break)
I'm having problems now because I'm saving the file as a txt.
By the way, I am converting the Excel spreadsheet to csv by saving as csv in Excel. But I'm not sure if the client is doing that.
So the file as a TXT is perfect. I've checked the records and totals. But now I have to convert it back to csv, and I would really like to do it in the program. Does anybody know how?
Here is my code:
namespace EditorCSV
{
class Program
{
static void Main(string[] args)
{
ReadFromFile("c:\\source.csv");
}
static void ReadFromFile(string filename)
{
StreamReader SR;
StreamWriter SW;
SW = File.CreateText("c:\\target.csv");
string S;
char C='a';
int i=0;
SR=File.OpenText(filename);
S=SR.ReadLine();
SW.Write(S);
S = SR.ReadLine();
while(S!=null)
{
try { C = S[3]; }
catch (IndexOutOfRangeException exception){
bool t = false;
while (t == false)
{
t = true;
S = SR.ReadLine();
try { C = S[3]; }
catch (IndexOutOfRangeException ex) { S = SR.ReadLine(); t = false; }
}
}
if( C.Equals(';'))
{
SW.Write("\r\n" + S);
i = i + 1;
}
else
{
SW.Write(S);
}
S=SR.ReadLine();
}
SR.Close();
SW.Close();
Console.WriteLine("Records Processed: " + i.ToString() + " .");
Console.WriteLine("File Created SucacessFully");
Console.ReadKey();
}
}
}
CSV has predefined ways of handling that. This site provides an easy to read explanation of the standard way to handle all the caveats of CSV.
Nevertheless, there is really no reason to not use a solid, open source library for reading and writing CSV files to avoid making non-standard mistakes. LINQtoCSV is my favorite library for this. It supports reading and writing in a clean and simple way.
Alternatively, this SO question on CSV libraries will give you the list of the most popular choices.
Rather than check if the current line is missing the (") as the first character, check instead to see if the last character is a ("). If it is not, you know you have a line break, and you can read the next line and merge it together.
I am assuming your example data was accurate - fields were wrapped in quotes. If quotes might not delimit a text field (or new-lines are somehow found in non-text data), then all bets are off!
There is a built-in method for reading CSV files in .NET (requires Microsoft.VisualBasic assembly reference added):
public static IEnumerable<string[]> ReadSV(TextReader reader, params string[] separators)
{
var parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader);
parser.SetDelimiters(separators);
while (!parser.EndOfData)
yield return parser.ReadFields();
}
If you're dealing with really large files this CSV reader claims to be the fastest one you'll find: http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader
I've used this piece of code recently to parse rows from a CSV file (this is a simplified version):
private void Parse(TextReader reader)
{
var row = new List<string>();
var isStringBlock = false;
var sb = new StringBuilder();
long charIndex = 0;
int currentLineCount = 0;
while (reader.Peek() != -1)
{
charIndex++;
char c = (char)reader.Read();
if (c == '"')
isStringBlock = !isStringBlock;
if (c == separator && !isStringBlock) //end of word
{
row.Add(sb.ToString().Trim()); //add word
sb.Length = 0;
}
else if (c == '\n' && !isStringBlock) //end of line
{
row.Add(sb.ToString().Trim()); //add last word in line
sb.Length = 0;
//DO SOMETHING WITH row HERE!
currentLineCount++;
row = new List<string>();
}
else
{
if (c != '"' && c != '\r') sb.Append(c == '\n' ? ' ' : c);
}
}
row.Add(sb.ToString().Trim()); //add last word
//DO SOMETHING WITH LAST row HERE!
}
Try CsvHelper (a library I maintain). It ignores empty rows. I believe there is a flag you can set in FastCsvReader to have it handle empty rows also.
Heed the advice from the experts and Don't roll your own CSV parser.
Your first thought is, "How do I handle new line breaks?"
Your next thought is, "I need to handle commas inside of quotes."
Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."
It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free CsvHelper library.
Maybe you could count for (") during the ReadLine(). If they are odd, that will raise the flag. You could either ignore those lines, or get the next two and eliminate the first "\n" occurrence of the merge lines.
What I usually do is read the text in character by character opposed to line by line, due to this very problem.
As you're reading each character, you should be able to figure out where each cell starts and stops, but also the difference between a linebreak in a row and in a cell: If I remember correctly, for Excel generated files anyway, rows start with \r\n, and newlines in cells are only \r.
There is an example parser is c# that seems to handle your case correctly. Then you can read your data in and purge the line breaks out of it post-read.
Part 2 is the parser, and there is a Part 1 that covers the writer portion.
Read the line.
Split into columns(fields).
If you have enough columns expected for each line, then process.
If not, read the next line, and capture the remaining columns until you get what you need.
Repeat.
A somewhat simple regular expression could be used on each line. When it matches, you process each field from the match. When it doesn't find a match, you skip that line.
The regular expression could look something like this.
Match match = Regex.Match(line, #"^(?:,?(?<q>['"](?<field>.*?\k'q')|(?<field>[^,]*))+$");
if (match.Success)
{
foreach (var capture in match.Groups["field"].Captures)
{
string fieldValue = capture.Value;
// Use the value.
}
}
Have a look at FileHelpers Library
It supports reading\writing CSV with line breaks as well as reading\writing to excel
The LINQy solution:
string csvText = File.ReadAllText("C:\\Test.txt");
var query = csvText
.Replace(Environment.NewLine, string.Empty)
.Replace("\"\"", "\",\"").Split(',')
.Select((i, n) => new { i, n }).GroupBy(a => a.n / 3);
You might also check out my CSV parser SoftCircuits.CsvParser on NuGet. It will not only parse a CSV file but--if wanted--can also automatically map column values to your class properties. And it runs nearly four times faster than CsvHelper.
For a line break to exist in a CSV, there must be an open double quote that's not closed.
Assuming that all CSVs cells must open and close a double quote, just check if there's an odd number of quotation marks
my_string.Count(c => c == '"') % 2 == 1
and if that's the case, continue reading until you have the even number.
I have a function like this:
List<float> myList = new List(float);
public void numbers(string filename)
{
string input;
float number;
if (System.IO.File.Exists(filename) == true)
{
System.IO.StreamReader objectReader;
objectReader = new System.IO.StreamReader(filename);
while ((input = objectReader.ReadLine()) != null)
{
number = Convert.ToSingle(input);
myList.Add(number);
}
objectReader.Close();
}
else
{
MessageBox.Show("No Such File" + filename);
}
}
Where Im trying to add numbers (floats) from a text file into a List. But I keep getting errors saying wrong format. The numbers in the text file are one number per line...any help?
I would suggest you do a Trim call like this
number = Convert.ToSingle(input.Trim());
However, a better code would be using a TryParse call
float tmp;
if(float.TryParse(input.Trim(), out tmp)
{
mylist.Add(tmp);
}
Your code worked fine for me except for the case of a newline (and of course for entries that were not numbers at all)
Here is a version that should work for you, using a tryParse to check if each line can convert to a single):
public void Numbers(string filename)
{
List<float> myList = new List<float>();
string input;
if (System.IO.File.Exists(filename) == true)
{
System.IO.StreamReader objectReader;
objectReader = new System.IO.StreamReader(filename);
while ((input = objectReader.ReadLine()) != null)
{
Single output;
if (Single.TryParse(input, out output ))
{
myList.Add(output);
}
else
{
// Huh? Should this happen, maybe some logging can go here to track down why you couldn't just use the .Convert()
}
}
objectReader.Close();
}
else
{
MessageBox.Show("No Such File" + filename);
}
}
As Mike C rightly points out, this could be potentially risky - swallowing good data that has been corrupted by the output process. The tryParse method returns false when it fails so you could add in an else branch and some logging to check just what is causing the failures and see if there is another bug floating around that can be corrected.
Do you have any blank lines in the file, or failures to convert the number? My guess is that you have a line which is not castable to float from its current format. You should make sure you sanitize the lines before reading them in (strip off everything that is not a number using a regex) and throw the line out if it fails the check.
One thing you might do is use double instead and do a Convert.ToDouble().
Are there spaces or commas or anything? The best thing to do would be to set a breakpoint on
number = Convert.ToSingle(input);
to see what input is actually before you try to convert it.
There's a wonderful free package called FileHelpers which helps with importing data from all sorts of text files. The advantage with this is that a lot of the deeper error handling is already in place.
By the way,
if (System.IO.File.Exists(filename) == true)
can be shortened to
if (System.IO.File.Exists(filename))