Read a CSV file and writer into a file without " " using C# - c#

I am trying to read a CSV file and stored all the values in the single list.CSV file contains credentials as uid(userid) and pass(password) and separated by','I have successfully read all the lines and write it in the file.but when it writes in the file, it write the value in between " "(double quotes) like as("abcdefgh3 12345678")what i want actually to remove this "" double quotes sign when i write it in to the files.i am pasting my code here:
static void Main(string[] args)
{
var reader = new StreamReader(File.OpenRead(#"C:\Desktop\userid1.csv"));
List<string> listA = new List<string>();
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
var values = line.Split(',');
listA.Add(values[0]);
listA.Add(values[1]);
}
foreach (string a in listA)
{
TextWriter tr = new StreamWriter(#"E:\newfiless",true);
tr.Write(a);
tr.Write(tr.NewLine);
tr.Close();
}
}
and the resulted output is like this:
"uid
pass"
"Martin123
123456789"
"Damian
91644"
but i want in this form:
uid
pass
Martin123
123456789
Damian
91644
Thanking you all in advance.

The original file clearly has quotes, which makes it a CSV file with only one colum and in that column there are two values. Not usual, but it happens.
To actually remove quotes you can use Trim, TrimEnd or TrimStart.
You can remove the quotes while reading, or while writing, in this case it doesn't really matter.
var line = reader.ReadLine().Trim('"');
This will remove the quotes while reading. Note that this assumes the CSV is of this "broken" variant.
tr.WriteLine(a.Trim('"'));
This will handle it on write. This will work even if the file is "correct" CSV having two columns and values in quotes.
Note that you can use WriteLine to add the newline, no need for two Write calls.
Also as others have commented, don't create a TextWriter in the loop for every value, create it once.
using (TextWriter tr = new StreamWriter(#"E:\newfiless"))
{
foreach (string a in listA)
{
tr.WriteLine(a.Trim('"'));
}
}
The using will take care of closing the file and other possible resources even if there is an exception.

I assume that all you need to read the input file, strip out all starting/ending quotation marks, then split by comma and write it all to another file. You can actually accomplish it in a one-liner using SelectMany, which will produce a "flat" collection:
File.WriteAllLines(
#"c:\temp\output.txt",
File
.ReadAllLines(#"c:\temp\input.csv")
.SelectMany(line => line.Trim('"').Split(','))
);
It's not quite clear from your example where quotation marks are located in the file. For a typical .CSV file some comma-separated field might be wrapped in quotation marks to allow commas to be a part of the content. If it's the case, then parsing will be more complex.

You can use
tr.Write(a.Substring(1, line.Length - 2));
Edited
Please use Trim
tr.Write(a.TrimEnd('"').TrimStart('"'));

Related

How can I find and replace text in a larger file (150MB-250MB) with regular expressions in C#?

I am working with files that range between 150MB and 250MB, and I need to append a form feed (/f) character to each match found in a match collection. Currently, my regular expression for each match is this:
Regex myreg = new Regex("ABC: DEF11-1111(.*?)MORE DATA(.*?)EVEN MORE DATA(.*?)\f", RegexOptions.Singleline);
and I'd like to modify each match in the file (and then overwrite the file) to become something that could be later found with a shorter regular expression:
Regex myreg = new Regex("ABC: DEF11-1111(.*?)\f\f, RegexOptions.Singleline);
Put another way, I want to simply append a form feed character (\f) to each match that is found in my file and save it.
I see a ton of examples on stack overflow for replacing text, but not so much for larger files. Typical examples of what to do would include:
Using streamreader to store the entire file in a string, then do a
find and replace in that string.
Using MatchCollection in combination
with File.ReadAllText()
Read the file line by line and look for
matches there.
The problem with the first two is that is just eats up a ton of memory, and I worry about the program being able to handle all of that. The problem with the 3rd option is that my regular expression spans over many rows, and thus will not be found in a single line. I see other posts out there as well, but they cover replacing specific strings of text rather than working with regular expressions.
What would be a good approach for me to append a form feed character to each match found in a file, and then save that file?
Edit:
Per some suggestions, I tried playing around with StreamReader.ReadLine(). Specifically, I would read a line, see if it matched my expression, and then based on that result I would write to a file. If it matched the expression, I would write to the file. If it didn't match the expression, I would just append it to a string until it did match the expression. Like this:
Regex myreg = new Regex("ABC: DEF11-1111(.?)MORE DATA(.?)EVEN MORE DATA(.*?)\f", RegexOptions.Singleline);
//For storing/comparing our match.
string line, buildingmatch, match, whatremains;
buildingmatch = "";
match = "";
whatremains = "";
//For keep track of trailing bits after our match.
int matchlength = 0;
using (StreamWriter sw = new StreamWriter(destFile))
using (StreamReader sr = new StreamReader(srcFile))
{
//While we are still reading lines in the file...
while ((line = sr.ReadLine()) != null)
{
//Keep adding lines to buildingmatch until we can match the regular expression.
buildingmatch = buildingmatch + line + "\r\n";
if (myreg.IsMatch(buildingmatch)
{
match = myreg.Match(buildingmatch).Value;
matchlength = match.Lengh;
//Make sure we are not at the end of the file.
if (matchlength < buildingmatch.Length)
{
whatremains = buildingmatch.SubString(matchlength, buildingmatch.Length - matchlength);
}
sw.Write(match, + "\f\f");
buildingmatch = whatremains;
whatremains = "";
}
}
}
The problem is that this took about 55 minutes to run a roughly 150MB file. There HAS to be a better way to do this...
If you can load the whole string data into a single string variable, there is no need to first match and then append text to matches in a loop. You can use a single Regex.Replace operation:
string text = File.ReadAllText(srcFile);
using (StreamWriter sw = new StreamWriter(destfile, false, Encoding.UTF8, 5242880))
{
sw.Write(myregex.Replace(text, "$&\f\f"));
}
Details:
string text = File.ReadAllText(srcFile); - reads the srcFile file to the text variable (match would be confusing)
myregex.Replace(text, "$&\f\f") - replaces all occurrences of myregex matches with themselves ($& is a backreference to the whole match value) while appending two \f chars right after each match.
I was able to find a solution that works in a reasonable time; it can process my entire 150MB file in under 5 minutes.
First, as mentioned in the comments, it's a waste to compare the string to the Regex after every iteration. Rather, I started with this:
string match = File.ReadAllText(srcFile);
MatchCollection mymatches = myregex.Matches(match);
Strings can hold up to 2GB of data, so while not ideal, I figured roughly 150MB worth wouldn't hurt to be stored in a string. Then, as opposed to checking a match every x amount of lines read in from the file, I can check the file for matches all at once!
Next, I used this:
StringBuilder matchsb = new StringBuilder(134217728);
foreach (Match m in mymatches)
{
matchsb.Append(m.Value + "\f\f");
}
Since I already know (roughly) the size of my file, I can go ahead and initialize my stringbuilder. Not to mention, it's a lot more efficient to use string builder if you are doing multiple operations on a string (which I was). From there, it's just a matter of appending the form feed to each of my matches.
Finally, the part the cost the most on performance:
using (StreamWriter sw = new StreamWriter(destfile, false, Encoding.UTF8, 5242880))
{
sw.Write(matchsb.ToString());
}
The way that you initialize StreamWriter is critical. Normally, you just declare it as:
StreamWriter sw = new StreamWriter(destfile);
This is fine for most use cases, but the problem becomes apparent with you are dealing with larger files. When declared like this, you are writing to the file with a default buffer of 4KB. For a smaller file, this is fine. But for 150MB files? This will end up taking a long time. So I corrected the issue by changing the buffer to approximately 5MB.
I found this resource really helped me to understand how to write to files more efficiently: https://www.jeremyshanks.com/fastest-way-to-write-text-files-to-disk-in-c/
Hopefully this will help the next person along as well.

C# so I need to split out a string, I think

so I have this application that I have inherited from someone that is long gone. The gist of the application is that it reads in a .cvs file that has about 5800 lines in it, copies it over to another .cvs, which it creates new each time, after striping out a few things , #, ', &. Well everything works great, or it has until about a month ago. so I started checking into it, and what I have found so far is that there are about 131 items missing from the spreadsheet. Now I read someplace that the maximun amount of data a string can hold is over 1,000,000,000 chars, and my spreadsheet is way under that, around 800,000 chars, but the only thing I can think is doing it is the string object.
So anyway, here is the code in question, this piece appears
to both read in from the existing field, and output to the new file:
StreamReader s = new StreamReader(File);
//Read the rest of the data in the file.
string AllData = s.ReadToEnd();
//Split off each row at the Carriage Return/Line Feed
//Default line ending in most windows exports.
//You may have to edit this to match your particular file.
//This will work for Excel, Access, etc. default exports.
string[] rows = AllData.Split("\r\n".ToCharArray(), System.StringSplitOptions.RemoveEmptyEntries);
//Now add each row to the DataSet
foreach (string r in rows)
{
//Split the row at the delimiter.
string[] items = r.Split(delimiter.ToCharArray());
//Add the item
result.Rows.Add(items);
}
If anyone can help me I would really appreciate it. I either need to figure out how to split the data better, or I need to figure out why it is cutting out the last 131 lines from the existing excel file to the new excel file.
One easier way to do this, since you're using "\r\n" for lines, would be to just use the built-in line reading method: File.ReadLines(path)
foreach(var line in File.ReadLines(path))
{
var items = line.Split(',');
result.Rows.Add(items);
}
You may want to check out the TextFieldParser class, which is part of the Microsoft.VisualBasic.FileIO namespace (yes, you can use this with C# code)
Something along the lines of:
using(var reader = new TextFieldParser("c:\\path\\to\\file"))
{
//configure for a delimited file
reader.TextFieldType = FieldType.Delimited;
//configure the delimiter character (comma)
reader.Delimiters = new[] { "," };
while(!reader.EndOfData)
{
string[] row = reader.ReadFields();
//do stuff
}
}
This class can help with some of the issues of splitting a line into its fields, when the field may contain the delimiter.

Splitting strings using Environment.Newline leaves \n in most array items?

I used MyString.Split(Environment.Newline.ToCharArray()[0]) to split my string from a file into different pieces. But, every item in the array, except the first one starts with \n after I did that? I know the way that I'm splitting by newlines is kind of "cheaty" for lack of a better word, so if there is a better way of doing this, please tell me...
Here is the file...
If you are wanting to maintain using the .Split() instead of reading a file in a line at a time you can do...
var splitResult = MyString.Split( new string[]{ System.Environment.NewLine },
System.StringSplitOptions.RemoveEmptyEntries );
/* or System.StringSplitOptions.None if you want empty results as well */
EDIT:
The problem you were having is that in a non-unix environment the new-line "character" is actually two characters. So when you grabbed the zero index you were actually splitting on a carriage return...not the new-line character (\n).
Windows = "\r\n"
Unix = "\n"
Per http://msdn.microsoft.com/en-us/library/system.environment.newline.aspx
A newline in Windows is two characters (\r and \n). The Environment.Newline.ToCharArray()[0] expression specifies only one of those characters: \r. Therefore, the other character (\n) remains as a portion of the split string.
My I suggest you read your file using something like this:
public IEnumerable<string> ReadFile(string filePath)
{
using (StreamReader rdr = new StreamReader(filePath))
{
string line;
while ( (line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
You might need more error handling, or to specify different file open option, or to pass a stream to method rather than the path, but the idea of using an iterator over the ReadLine() method is sound. The result is you can just use code like this:
foreach (string line in ReadLine(" ... my file path ... "))
{
}

How to read a file into a string with CR/LF preserved?

If I asked the question "how to read a file into a string" the answer would be obvious. However -- here is the catch with CR/LF preserved.
The problem is, File.ReadAllText strips those characters. StreamReader.ReadToEnd just converted LF into CR for me which led to long investigation where I have bug in pretty obvious code ;-)
So, in short, if I have file containing foo\n\r\nbar I would like to get foo\n\r\nbar (i.e. exactly the same content), not foo bar, foobar, or foo\n\n\nbar. Is there some ready to use way in .Net space?
The outcome should be always single string, containing entire file.
Are you sure that those methods are the culprits that are stripping out your characters?
I tried to write up a quick test; StreamReader.ReadToEnd preserves all newline characters.
string str = "foo\n\r\nbar";
using (Stream ms = new MemoryStream(Encoding.ASCII.GetBytes(str)))
using (StreamReader sr = new StreamReader(ms, Encoding.UTF8))
{
string str2 = sr.ReadToEnd();
Console.WriteLine(string.Join(",", str2.Select(c => ((int)c))));
}
// Output: 102,111,111,10,13,10,98,97,114
// f o o \n \r \n b a r
An identical result is achieved when writing to and reading from a temporary file:
string str = "foo\n\r\nbar";
string temp = Path.GetTempFileName();
File.WriteAllText(temp, str);
string str2 = File.ReadAllText(temp);
Console.WriteLine(string.Join(",", str2.Select(c => ((int)c))));
It appears that your newlines are getting lost elsewhere.
This piece of code will preserve LR and CR
string r = File.ReadAllText(#".\TestData\TR120119.TRX", Encoding.ASCII);
The outcome should be always single string, containing entire file.
It takes two hops. First one is File.ReadAllBytes() to get all the bytes in the file. Which doesn't try to translate anything, you get the raw data in the file so the weirdo line-endings are preserved as-is.
But that's bytes, you asked for a string. So second hop is to apply Encoding.GetString() to convert the bytes to a string. The one thing you have to do is pick the right Encoding class, the one that matches the encoding used by the program that wrote the file. Given that the file is pretty messed up if it contains \n\r\n sequences, and you didn't document anything else about the file, your best bet is to use Encoding.Default. Tweak as necessary.
You can read the contents of a file using File.ReadAllLines, which will return an array of the lines. Then use String.Join to merge the lines together using a separator.
string[] lines = File.ReadAllLines(#"C:\Users\User\file.txt");
string allLines = String.Join("\r\n", lines);
Note that this will lose the precision of the actual line terminator characters. For example, if the lines end in only \n or \r, the resulting string allLines will have replaced them with \r\n line terminators.
There are of course other ways of acheiving this without losing the true EOL terminator, however ReadAllLines is handy in that it can detect many types of text encoding by itself, and it also takes up very few lines of code.
ReadAllText doesn't return carriage returns.
This method opens a file, reads each line of the file, and then adds each line as an element of a string. It then closes the file. A line is defined as a sequence of characters followed by a carriage return ('\r'), a line feed ('\n'), or a carriage return immediately followed by a line feed. The resulting string does not contain the terminating carriage return and/or line feed.
From MSDN - https://msdn.microsoft.com/en-us/library/ms143368(v=vs.110).aspx
This is similar to the accepted answer, but wanted to be more to the point. sr.ReadToEnd() will read the bytes like is desired:
string myFilePath = #"C:\temp\somefile.txt";
string myEvents = String.Empty;
FileStream fs = new FileStream(myFilePath, FileMode.Open);
StreamReader sr = new StreamReader(fs);
myEvents = sr.ReadToEnd();
sr.Close();
fs.Close();
You could even also do those in cascaded using statements. But I wanted to describe how the way you write to that file in the first place will determine how to read the content from the myEvents string, and might really be where the problem lies. I wrote to my file like this:
using System.Reflection;
using System.IO;
private static void RecordEvents(string someEvent)
{
string folderLoc = Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location);
if (!folderLoc.EndsWith(#"\")) folderLoc += #"\";
folderLoc = folderLoc.Replace(#"\\", #"\"); // replace double-slashes with single slashes
string myFilePath = folderLoc + "myEventFile.txt";
if (!File.Exists(myFilePath))
File.Create(myFilePath).Close(); // must .Close() since will conflict with opening FileStream, below
FileStream fs = new FileStream(myFilePath, FileMode.Append);
StreamWriter sr = new StreamWriter(fs);
sr.Write(someEvent + Environment.NewLine);
sr.Close();
fs.Close();
}
Then I could use the code farther above to get the string of the contents. Because I was going further and looking for the individual strings, I put this code after THAT code, up there:
if (myEvents != String.Empty) // we have something
{
// (char)2660 is ♠ -- I could have chosen any delimiter I did not
// expect to find in my text
myEvents = myEvents.Replace(Environment.NewLine, ((char)2660).ToString());
string[] eventArray = myEvents.Split((char)2660);
foreach (string s in eventArray)
{
if (!String.IsNullOrEmpty(s))
// do whatever with the individual strings from your file
}
}
And this worked fine. So I know that myEvents had to have the Environment.NewLine characters preserved because I was able to replace it with (char)2660 and do a .Split() on that string using that character to divide it into the individual segments.

How do I handle line breaks in a CSV file using C#?

I have an Excel spreadsheet being converted into a CSV file in C#, but am having a problem dealing with line breaks. For instance:
"John","23","555-5555"
"Peter","24","555-5
555"
"Mary,"21","555-5555"
When I read the CSV file, if the record does not starts with a double quote (") then a line break is there by mistake and I have to remove it. I have some CSV reader classes from the internet but I am concerned that they will fail on the line breaks.
How should I handle these line breaks?
Thanks everybody very much for your help.
Here's is what I've done so far. My records have fixed format and all start with
JTW;...;....;...;
JTW;...;...;....
JTW;....;...;..
..;...;... (wrong record, line break inserted)
JTW;...;...
So I checked for the ; in the [3] position of each line. If true, I write; if false, I'll append on the last (removing the line-break)
I'm having problems now because I'm saving the file as a txt.
By the way, I am converting the Excel spreadsheet to csv by saving as csv in Excel. But I'm not sure if the client is doing that.
So the file as a TXT is perfect. I've checked the records and totals. But now I have to convert it back to csv, and I would really like to do it in the program. Does anybody know how?
Here is my code:
namespace EditorCSV
{
class Program
{
static void Main(string[] args)
{
ReadFromFile("c:\\source.csv");
}
static void ReadFromFile(string filename)
{
StreamReader SR;
StreamWriter SW;
SW = File.CreateText("c:\\target.csv");
string S;
char C='a';
int i=0;
SR=File.OpenText(filename);
S=SR.ReadLine();
SW.Write(S);
S = SR.ReadLine();
while(S!=null)
{
try { C = S[3]; }
catch (IndexOutOfRangeException exception){
bool t = false;
while (t == false)
{
t = true;
S = SR.ReadLine();
try { C = S[3]; }
catch (IndexOutOfRangeException ex) { S = SR.ReadLine(); t = false; }
}
}
if( C.Equals(';'))
{
SW.Write("\r\n" + S);
i = i + 1;
}
else
{
SW.Write(S);
}
S=SR.ReadLine();
}
SR.Close();
SW.Close();
Console.WriteLine("Records Processed: " + i.ToString() + " .");
Console.WriteLine("File Created SucacessFully");
Console.ReadKey();
}
}
}
CSV has predefined ways of handling that. This site provides an easy to read explanation of the standard way to handle all the caveats of CSV.
Nevertheless, there is really no reason to not use a solid, open source library for reading and writing CSV files to avoid making non-standard mistakes. LINQtoCSV is my favorite library for this. It supports reading and writing in a clean and simple way.
Alternatively, this SO question on CSV libraries will give you the list of the most popular choices.
Rather than check if the current line is missing the (") as the first character, check instead to see if the last character is a ("). If it is not, you know you have a line break, and you can read the next line and merge it together.
I am assuming your example data was accurate - fields were wrapped in quotes. If quotes might not delimit a text field (or new-lines are somehow found in non-text data), then all bets are off!
There is a built-in method for reading CSV files in .NET (requires Microsoft.VisualBasic assembly reference added):
public static IEnumerable<string[]> ReadSV(TextReader reader, params string[] separators)
{
var parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader);
parser.SetDelimiters(separators);
while (!parser.EndOfData)
yield return parser.ReadFields();
}
If you're dealing with really large files this CSV reader claims to be the fastest one you'll find: http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader
I've used this piece of code recently to parse rows from a CSV file (this is a simplified version):
private void Parse(TextReader reader)
{
var row = new List<string>();
var isStringBlock = false;
var sb = new StringBuilder();
long charIndex = 0;
int currentLineCount = 0;
while (reader.Peek() != -1)
{
charIndex++;
char c = (char)reader.Read();
if (c == '"')
isStringBlock = !isStringBlock;
if (c == separator && !isStringBlock) //end of word
{
row.Add(sb.ToString().Trim()); //add word
sb.Length = 0;
}
else if (c == '\n' && !isStringBlock) //end of line
{
row.Add(sb.ToString().Trim()); //add last word in line
sb.Length = 0;
//DO SOMETHING WITH row HERE!
currentLineCount++;
row = new List<string>();
}
else
{
if (c != '"' && c != '\r') sb.Append(c == '\n' ? ' ' : c);
}
}
row.Add(sb.ToString().Trim()); //add last word
//DO SOMETHING WITH LAST row HERE!
}
Try CsvHelper (a library I maintain). It ignores empty rows. I believe there is a flag you can set in FastCsvReader to have it handle empty rows also.
Heed the advice from the experts and Don't roll your own CSV parser.
Your first thought is, "How do I handle new line breaks?"
Your next thought is, "I need to handle commas inside of quotes."
Your next thought will be, "Oh, crap, I need to handle quotes inside of quotes. Escaped quotes. Double quotes. Single quotes..."
It's a road to madness. Don't write your own. Find a library with an extensive unit test coverage that hits all the hard parts and has gone through hell for you. For .NET, use the free CsvHelper library.
Maybe you could count for (") during the ReadLine(). If they are odd, that will raise the flag. You could either ignore those lines, or get the next two and eliminate the first "\n" occurrence of the merge lines.
What I usually do is read the text in character by character opposed to line by line, due to this very problem.
As you're reading each character, you should be able to figure out where each cell starts and stops, but also the difference between a linebreak in a row and in a cell: If I remember correctly, for Excel generated files anyway, rows start with \r\n, and newlines in cells are only \r.
There is an example parser is c# that seems to handle your case correctly. Then you can read your data in and purge the line breaks out of it post-read.
Part 2 is the parser, and there is a Part 1 that covers the writer portion.
Read the line.
Split into columns(fields).
If you have enough columns expected for each line, then process.
If not, read the next line, and capture the remaining columns until you get what you need.
Repeat.
A somewhat simple regular expression could be used on each line. When it matches, you process each field from the match. When it doesn't find a match, you skip that line.
The regular expression could look something like this.
Match match = Regex.Match(line, #"^(?:,?(?<q>['"](?<field>.*?\k'q')|(?<field>[^,]*))+$");
if (match.Success)
{
foreach (var capture in match.Groups["field"].Captures)
{
string fieldValue = capture.Value;
// Use the value.
}
}
Have a look at FileHelpers Library
It supports reading\writing CSV with line breaks as well as reading\writing to excel
The LINQy solution:
string csvText = File.ReadAllText("C:\\Test.txt");
var query = csvText
.Replace(Environment.NewLine, string.Empty)
.Replace("\"\"", "\",\"").Split(',')
.Select((i, n) => new { i, n }).GroupBy(a => a.n / 3);
You might also check out my CSV parser SoftCircuits.CsvParser on NuGet. It will not only parse a CSV file but--if wanted--can also automatically map column values to your class properties. And it runs nearly four times faster than CsvHelper.
For a line break to exist in a CSV, there must be an open double quote that's not closed.
Assuming that all CSVs cells must open and close a double quote, just check if there's an odd number of quotation marks
my_string.Count(c => c == '"') % 2 == 1
and if that's the case, continue reading until you have the even number.

Categories