C# - Split CSV File by Removing Bad Rows - c#

I have a csv file with 2 million rows and file size of 2 GB. But due to a couple of free text form columns, these contain redundant CRLF and cause the file to not load in the SQL Server table. I get an error that the last column does not end with ".
I have the following code, but it gives an OutOfMemoryException when reading from fileName. The line is:
var lines = File.ReadAllLines(fileName);
How can I fix it? Ideally, I would like to split the file into two good and bad rows. Or delete rows that do not end with "CRLF.
int goodRow = 0;
int badRow = 0;
String badRowFileName = fileName.Substring(0, fileName.Length - 4) + "BadRow.csv";
String goodRowFileName = fileName.Substring(0, fileName.Length - 4) + "GoodRow.csv";
var charGood = "\"\"";
String lineOut = string.Empty;
String str = string.Empty;
var lines = File.ReadAllLines(fileName);
StringBuilder sbGood = new StringBuilder();
StringBuilder sbBad = new StringBuilder();
foreach (string line in lines)
{
if (line.Contains(charGood))
{
goodRow++;
sbGood.AppendLine(line);
}
else
{
badRow++;
sbBad.AppendLine(line);
}
}
if (badRow > 0)
{
File.WriteAllText(badRowFileName, sbBad.ToString());
}
if (goodRow > 0)
{
File.WriteAllText(goodRowFileName, sbGood.ToString());
}
sbGood.Clear();
sbBad.Clear();
msg = msg + "Good Rows - " + goodRow.ToString() + " Bad Rows - " + badRow.ToString() + " Done.";

You can translate that code like this to be much more efficient:
int goodRow = 0, badRow = 0;
String badRowFileName = fileName.Substring(0, fileName.Length - 4) + "BadRow.csv";
String goodRowFileName = fileName.Substring(0, fileName.Length - 4) + "GoodRow.csv";
var charGood = "\"\"";
using (var lines = File.ReadLines(fileName))
using (var swGood = new StreamWriter(goodRowFileName))
using (var swBad = new StreamWriter(badRowFileName))
{
foreach (string line in lines)
{
if (line.Contains(charGood))
{
goodRow++;
swGood.WriteLine(line);
}
else
{
badRow++;
swBad.WriteLine(line);
}
}
}
msg += $"Good Rows: {goodRow,9} Bad Rows: {badRow,9} Done.";
But I'd also look at using a real csv parser for this. There are plenty on NuGet. That might even let you clean up the data on the fly.

I would not suggest reading the entire file into memory, then processing the file, then writing all modified contents out to the new file.
Instead using file streams:
using (var rdr = new StreamReader(fileName))
using (var wrtrGood = new StreamWriter(goodRowFileName))
using (var wrtrBad = new StreamWriter(badRowFileName))
{
string line = null;
while ((line = rdr.ReadLine()) != null)
{
if (line.Contains(charGood))
{
goodRow++;
wrtr.WriteLine(line);
}
else
{
badRow++;
wrtrBad.WriteLine(line);
}
}
}

Related

C# parsing and arrays

I'm making a program that parses some data, and somehow I'm not receiving what I need.
I have data in a file in the following order:
1111
username
email#email.com
IMAGE01: http://www.1234567890.net/image/cc_141019050341.png
So I made an array named "lines" with one data per line of text in the file, and then:
this.videoId = lines[0];
this.clientUser = lines[1];
this.clientEmail = lines[2];
this.textLines = new List<string>();
this.imageLines = new Dictionary<int,string>();
for (int i = 3; i < lines.Length; i++)
{
if (lines[i].Contains("IMAGE"))
{
int imgNumber = Int32.Parse(
lines[i].Substring(Math.Max(0, lines[i].Length - 10), 2)
);
this.imageLines.Add(imgNumber, lines[i].Substring(Math.Max(0, lines[i].Length - 7)));
}
else
{
this.textLines.Add(lines[i]);
}
}
Then I put each parsed data into a different .txt file:
using (StreamWriter emailTxt = new StreamWriter(#"txt/" + "user_email.txt"))
{
emailTxt.Write(nek.clientEmail);
}
using (StreamWriter userTxt = new StreamWriter(#"txt/" + "user_data.txt"))
{
userTxt.Write(nek.clientUser + Environment.NewLine + unixTime);
}
using (StreamWriter imageTxt = new StreamWriter(#"txt/" + "user_images.txt"))
{
foreach (KeyValuePair<int, string> kp in nek.imageLines)
{
imageTxt.WriteLine(string.Format("{0:00}: {1}", kp.Key, kp.Value));
}
}
But, somehow I'm retrieving all data good, except imageTxt which should be:
http://www.1234567890.net/image/cc_141019050341.png
I'm receiving:
05: 341.png
Any ideas why? Thank you for your time.
your substrings are hitting cc_141019050341.png
cc_141019(first one extracts this 05)0(second one extracts this 341.png )
I would suggest you use regex to extract the parts you want
something like
IMAGE(?<num>\d+).*?:\s(?<url>.*)
you can use it in your code like
var match = new Regex(#"IMAGE(?<num>\d+).*?:\s(?<url>.*)").Match(line[i]]);
if (match.Success)
{
var url = match.Groups["url"];
var strNum = match.Groups["num"];
}

Convert .XYZ to .csv using c#

Hi i am using this method to replace " " to "," but is failing when i try to use it on data that have 32 millions lines. Is anyone knows how to modify it to make it running?
List<String> lines = new List<String>();
//loop through each line of file and replace " " sight to ","
using (StreamReader sr = new StreamReader(inputfile))
{
int id = 1;
int i = File.ReadAllLines(inputfile).Count();
while (sr.Peek() >= 0)
{
//Out of memory issuee
string fileLine = sr.ReadLine();
//do something with line
string ttt = fileLine.Replace(" ", ", ");
//Debug.WriteLine(ttt);
lines.Add(ttt);
//lines.Add(id++, 'ID');
}
using (StreamWriter writer = new StreamWriter(outputfile, false))
{
foreach (String line in lines)
{
writer.WriteLine(line+","+id);
id++;
}
}
}
//change extension to .csv
FileInfo f = new FileInfo(outputfile);
f.MoveTo(Path.ChangeExtension(outputfile, ".csv"));
I general i am trying to convert big .XYZ file to .csv format and add incremental field at the end. I am using c# for first time in my life to be honest :) Can you help me?
See my comment above - you could modify your reading / writing as follows :
using (StreamReader sr = new StreamReader(inputfile))
{
using (StreamWriter writer = new StreamWriter(outputfile, false))
{
int id = 1;
while (sr.Peek() >= 0)
{
string fileLine = sr.ReadLine();
//do something with line
string ttt = fileLine.Replace(" ", ", ");
writer.WriteLine(ttt + "," + id);
id++;
}
}
}

Is there a more efficient way of reading and writing a text fill at the same time?

I'm back at it again with another question, this time with regards to editing text files. My home work is as follow
Write a program that reads the contents of a text file and inserts the line numbers at the beginning of each line, then rewrites the file contents.
This is what I have so far, though I am not so sure if this is the most efficient way of doing it. I've only started learning on handling text files at the moment.
static void Main(string[] args)
{
string fileName = #"C:\Users\Nate\Documents\Visual Studio 2015\Projects\Chapter 15\Chapter 15 Question 3\Chapter 15 Question 3\TextFile1.txt";
StreamReader reader = new StreamReader(fileName);
int lineCounter = 0;
List<string> list = new List<string>();
using (reader)
{
string line = reader.ReadLine();
while (line != null)
{
list.Add("line " + (lineCounter + 1) + ": " + line);
line = reader.ReadLine();
lineCounter++;
}
}
StreamWriter writer = new StreamWriter(fileName);
using (writer)
{
foreach (string line in list)
{
writer.WriteLine(line);
}
}
}
your help would be appreciated!
thanks once again. :]
this should be enough (in case the file is relatively small):
using System.IO;
(...)
static void Main(string[] args)
{
string fileName = #"C:\Users\Nate\Documents\Visual Studio 2015\Projects\Chapter 15\Chapter 15 Question 3\Chapter 15 Question 3\TextFile1.txt";
string[] lines = File.ReadAllLines(fileName);
for (int i = 0; i< lines.Length; i++)
{
lines[i] = string.Format("{0} {1}", i + 1, lines[i]);
}
File.WriteAllLines(fileName, lines);
}
I suggest using Linq, use File.ReadLinesto read the content.
// Read all lines and apply format
var formatteLines = File
.ReadLines("filepath") // read lines
.Select((line, i) => string.Format("line {0} :{1} ", line, i+1)); // format each line.
// write formatted lines to either to the new file or override previous file.
File.WriteAllLines("outputfilepath", formatteLines);
Just one loop here. I think it will be efficient.
class Program
{
public static void Main()
{
string path = Directory.GetCurrentDirectory() + #"\MyText.txt";
StreamReader sr1 = File.OpenText(path);
string s = "";
int counter = 1;
StringBuilder sb = new StringBuilder();
while ((s = sr1.ReadLine()) != null)
{
var lineOutput = counter++ + " " + s;
Console.WriteLine(lineOutput);
sb.Append(lineOutput);
}
sr1.Close();
Console.WriteLine();
StreamWriter sw1 = File.AppendText(path);
sw1.Write(sb);
sw1.Close();
}

Copying CSV file while reordering/adding empty columns

Copying CSV file while reordering/adding empty columns.
For example if ever line of incoming file has values for 3 out of 10 columns in order different from output like (except first which is header with column names):
col2,col6,col4 // first line - column names
2, 5, 8 // subsequent lines - values for 3 columns
and output expected to have
col0,col1,col2,col3,col4,col5,col6,col7,col8,col9
then output should be "" for col0,col1,col3,col5,col7,col8,col9,and values from col2,col4,col4 in the input file. So for the shown second line (2,5,8) expected output is ",,2,,5,,8,,,,,"
Below code I've tried and it is slower than I want.
I have two lists.
The first list filecolumnnames is created by splitting a delimited string (line) and this list gets recreated for every line in the file.
The second list list has the order in which the first list needs to be rearranged and re concatenated.
This works
string fileName = "F:\\temp.csv";
//file data has first row col3,col2,col1,col0;
//second row: 4,3,2,1
//so on
string fileName_recreated = "F:\\temp_1.csv";
int count = 0;
const Int32 BufferSize = 1028;
using (var fileStream = File.OpenRead(fileName))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8, true, BufferSize))
{
String line;
List<int> list = new List<int>();
string orderedcolumns = "\"\"";
string tableheader = "col0,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10";
List<string> tablecolumnnames = new List<string>();
List<string> filecolumnnames = new List<string>();
while ((line = streamReader.ReadLine()) != null)
{
count = count + 1;
StringBuilder sb = new StringBuilder("");
tablecolumnnames = tableheader.Split(',').ToList();
if (count == 1)
{
string fileheader = line;
//fileheader=""col2,col1,col0"
filecolumnnames = fileheader.Split(',').ToList();
foreach (string col in tablecolumnnames)
{
int index = filecolumnnames.IndexOf(col);
if (index == -1)
{
sb.Append(",");
// orderedcolumns=orderedcolumns+"+\",\"";
list.Add(-1);
}
else
{
sb.Append(filecolumnnames[index] + ",");
//orderedcolumns = orderedcolumns+ "+filecolumnnames["+index+"]" + "+\",\"";
list.Add(index);
}
// MessageBox.Show(orderedcolumns);
}
}
else
{
filecolumnnames = line.Split(',').ToList();
foreach (int items in list)
{
//MessageBox.Show(items.ToString());
if (items == -1)
{
sb.Append(",");
}
else
{
sb.Append(filecolumnnames[items] + ",");
}
}
//expected format sb.Append(filecolumnnames[3] + "," + filecolumnnames[2] + "," + filecolumnnames[2] + ",");
//sb.Append(orderedcolumns);
var result = String.Join (", ", list.Select(index => filecolumnnames[index]));
}
using (FileStream fs = new FileStream(fileName_recreated, FileMode.Append, FileAccess.Write))
using (StreamWriter sw = new StreamWriter(fs))
{
sw.WriteLine(sb.ToString());
}
}
I am trying to make it faster by constructing a string orderedcolumns and remove the second for each loop which happens for every row and replace it with constructed string.
so if you uncomment the orderedcolumns string construction orderedcolumns = orderedcolumns+ "+filecolumnnames["+index+"]" + "+\",\""; and uncomment the append sb.Append(orderedcolumns); I am expecting the value inside the constructed string but when I append the orderedcolumns it is appending the text i.e.
""+","+filecolumnnames[3]+","+filecolumnnames[2]+","+filecolumnnames[1]+","+filecolumnnames[0]+","+","+","+","+","+","+","
i.e. I instead want it to take the value inside the filecolumnnames[3] list and not the filecolumnnames[3] name itself.
Expected value: if that line has 1,2,3,4
I want the output to be 4,3,2,1 as filecolumnnames[3] will have 4, filecolumnnames[2] will have 3..
String.Join is the way to construct comma/space delimited strings from sequence.
var result = String.Join (", ", list.Select(index => filecolumnnames[index]);
Since you are reading only subset of columns and orders in input and output don't match I'd use dictionary to hold each row of input.
var row = tablecolumnnames
.Zip(line.Split(','), (Name,Value)=> new {Name,Value})
.ToDictionary(x => x.Name, x.Value);
For output I'd fill sequence from defaults or input row:
var outputLine = String.Join(",",
filecolumnnames
.Select(name => row.ContainsKey(name) ? row[name] : ""));
Note code is typed in and not compiled.
orderedcolumns = orderedcolumns+ "+filecolumnnames["+index+"]" + "+\",\""; "
should be
orderedcolumns = orderedcolumns+ filecolumnnames[index] + ",";
you should however use join as others have pointed out. Or
orderedcolumns.AppendFormat("{0},", filecolumnnames[index]);
you will have to deal with the extra ',' on the end

how to format data in a text file

ihave an string builder where it conatins email id( it conatins thousands of email id)
StringBuilder sb = new StringBuilder();
foreach (DataRow dr2 in dtResult.Rows)
{
strtxt = dr2[strMailID].ToString()+";";
sb.Append(strtxt);
}
string filepathEmail = Server.MapPath("Email");
using (StreamWriter outfile = new StreamWriter(filepathEmail + "\\" + "Email.txt"))
{
outfile.Write(sb.ToString());
}
now data is getting stored in text file like this:
abc#gmail.com;ab#gmail.com;abc#gmail.com;ab#gmail.com;abc#gmail.com;ab#gmail.com;abc#gmail.com;ab#gmail.com;abc#gmail.com;ab#gmail.com;abc#gmail.com;ab#gmail.com;abc#gmail.com;ab#gmail.com;abc#gmail.com;ab#gmail.com;abc#gmail.com;ab#gmail.com;abc#gmail.com;ab#gmail.com;abc#gmail.com;ab#gmail.com;abc#gmail.com;ab#gmail.com;abc#gmail.com;ab#gmail.com;
abc#gmail.com;ab#gmail.com;abc#gmail.com;ab#gmail.com;abc#gmail.com;ab#gmail.com;abc#gmail.com;ab#gmail.com;
But i need to store them like where every row should only only 10 email id, so that i looks good**
any idea how to format the data like this in .txt file? any help would be great
Just add a counter in your loop and append a line break every 10 lines.
int counter = 0;
StringBuilder sb = new StringBuilder();
foreach (DataRow dr2 in dtResult.Rows)
{
counter++;
strtxt = dr2[strMailID].ToString()+";";
sb.Append(strtxt);
if (counter % 10 == 0)
{
sb.Append(Environment.NewLine);
}
}
Use a counter and add a line break each tenth item:
StringBuilder sb = new StringBuilder();
int cnt = 0;
foreach (DataRow dr2 in dtResult.Rows) {
sb.Append(dr2[strMailID]).Append(';');
if (++cnt == 10) {
cnt = 0;
sb.AppendLine();
}
}
string filepathEmail = Path.Combine(Server.MapPath("Email"), "Email.txt");
File.WriteAllText(filepathEmail, sb.ToString());
Notes:
Concatentate strings using the StringBuilder instead of first concatenating and then appending.
Use Path.Combine to combine the path and file name, this works on any platform.
You can use the File.WriteAllText method to save the string in a single call instead of writing to a StreamWriter.
as it said you may add a "line break" I suggest to add '\t' tab after each address so your file will be CSV format and you can import it in Excel for instance.
Use a counter to keep track of number of mail already written, like this:
int i = 0;
foreach (string mail in mails) {
var strtxt = mail + ";";
sb.Append(strtxt);
i++;
if (i % 10==0)
sb.AppendLine();
}
Every 10 mails written, i modulo 10 equals 0, so you put an end line in the string builder.
Hope this can help.
Here's an alternate method using LINQ if you don't mind any overheads.
string filepathEmail = Server.MapPath("Email");
using (StreamWriter outfile = new StreamWriter(filepathEmail + "\\" + "Email.txt"))
{
var rows = dtResult.Rows.Cast<DataRow>(); //make the rows enumerable
var lines = from ivp in rows.Select((dr2, i) => new {i, dr2})
group ivp.dr2[strMailID] by ivp.i / 10 into line //group every 10 emails
select String.Join(";", line); //put them into a string
foreach (string line in lines)
outfile.WriteLine(line);
}

Categories