Merging CSV lines in huge file - c#

I have a CSV that looks like this
783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-01 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-01 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:15,1,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:30,2,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
...
783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-02 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-02 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
although there are 5 billion records. If you notice the first column and part of the 2nd column (the day), three of the records are all 'grouped' together and are just a breakdown of 15 minute intervals for the first 30 minutes of that day.
I want the output to look like
783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
...
783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
Where the first 4 columns of the repeating rows are ommitted and the rest of the columns are combined with the first record of it's kind. Basically I am converting the day from being each line is 15 minutes, to each line is 1 day.
Since I will be processing 5 billion records, I think the best thing is to use regular expressions (and EmEditor) or some tool that is made for this (multithreading, optimized), rather than a custom programmed solution. Althought I am open to ideas in nodeJS or C# that are relatively simple and super quick.
How can this be done?

If there's always a set number of records records and they're in order, it'd be fairly easy to just read a few lines at a time and parse and output them. Trying to do regex on billions of records would take forever. Using StreamReader and StreamWriter should make it possible to read and write these large files since they read and write one line at a time.
using (StreamReader sr = new StreamReader("inputFile.txt"))
using (StreamWriter sw = new StreamWriter("outputFile.txt"))
{
string line1;
int counter = 0;
var lineCountToGroup = 3; //change to 96
while ((line1 = sr.ReadLine()) != null)
{
var lines = new List<string>();
lines.Add(line1);
for(int i = 0; i < lineCountToGroup - 1; i++) //less 1 because we already added line1
lines.Add(sr.ReadLine());
var groupedLine = lines.SomeLinqIfNecessary();//whatever your grouping logic is
sw.WriteLine(groupedLine);
}
}
Disclaimer- untested code with no error handling and assuming that there are indeed the correct number of lines repeated, etc. You'd obviously need to do some tweaks for your exact scenario.

You could do something like this (untested code without any error handling - but should give you the general gist of it):
using (var sin = new SteamReader("yourfile.csv")
using (var sout = new SteamWriter("outfile.csv")
{
var line = sin.ReadLine(); // note: should add error handling for empty files
var cells = line.Split(","); // note: you should probably check the length too!
var key = cells[0]; // use this to match other rows
StringBuilder output = new StringBuilder(line); // this is the output line we build
while ((line = sin.ReadLine()) != null) // if we have more lines
{
cells = line.Split(","); // split so we can get the first column
while(cells[0] == key) // if the first column matches the current key
{
output.Append(String.Join(",",cells.Skip(4))); // add this row to our output line
}
// once the key changes
sout.WriteLine(output.ToString()); // write out the line we've built up
output.Clear();
output.Append(line); // update the new line to build
key = cells[0]; // and update the key
}
// once all lines have been processed
sout.WriteLine(output.ToString()); // We'll have just the last line to write out
}
The idea is to loop through each line in turn and keep track of the current value of the first column. When that value changes, you write out the output line you've been building up and update the key. This way you don't have to worry about exactly how many matches you have or if you might be missing a few points.
One note, it might be more efficient to use a StringBuilder for output rather than a String if you are going to concatentate 96 rows.

Define the ProcessOutputLine to store merged lines.
Call ProcessLine after each ReadLine and at end of file.
string curKey ="" ;
string keyLength = ... ; // set totalength of 4 first columns
string outputLine = "" ;
private void ProcessInputLine(string line)
{
string newKey=line.substring(0,keyLength) ;
if (newKey==curKey) outputline+=line.substring(keyLength) ;
else
{
if (outputline!="") ProcessOutPutLine(outputLine)
curkey = newKey ;
outputLine=Line ;
}
EDIT : this solution is very similar to that of Matt Burland, the only noticable difference is that I don't use the Split function.

Related

Reading Dynamic Sets from File using C#

Here's my file i Want to Use C# code to make a each set that consist of A to E (from the input file)
A,1234,58978,...
B,55785,..
C,5788,...
C,446687,..
E,5456,...
E,4578,..
A,47,78,
B,5,..
C,7,..
C,66,..
E,56,..
E,48,
A,87,48,
B,8,..
C,74,..
C,64,..
E,57,..
E,48,
To be very Clear,
my first Set will be like,
A,1234,58978,...
B,55785,..
C,5788,...
C,446687,..
E,5456,...
E,4578,..
my Second Set will be like ,
A,47,78,
B,5,..
C,7,..
C,66,..
E,56,..
E,48,
my third Set will be like,
A,87,48,
B,8,..
C,74,..
C,64,..
E,57,..
E,48,
My file consist of n-sets.So a possible way is to to store in an Array[].So that i can also process those multiple sets with in foreach loop.
Yet, i couldn't find a way to calculate and store this dynamic n-sets of Value in C#.
The results will be in the dictionary sets in the below code. Basically it reads each line and compares the first character of the line to the previous line's first character. If bigger, than it will just add to the set. If smaller, it will create a new set and add to that. It will keep doing that till it has done reading the whole file.
I am using the ReadLines method so it does not load the whole file into memory but loads them line by line as we request it.
string last = "";
var sets = new Dictionary<int, List<string>>();
File.ReadLines("YourFilePath.txt").Select(x =>
{
string first = x.Substring(0, 1);
last = last == "" ? first : last;
if (first.CompareTo(last) > 0)
{
sets[sets.Count-1].Add(x);
}
else
{
sets.Add(sets.Count, new List<string>());
sets[sets.Count - 1].Add(x);
}
return x;
}).ToList();

C# :- How to sort a large csv file with 10 columns. Based on 5th column (Period Column). Without memory

How to sort a large csv file with 10 columns?
The sorting should be based on data type for example, string, Date, integer etc
Assuming Based on 5th column (Period Column) we need to sort.
As it is large CSV file, Without loading the same in memory we have to do.
I tried using logparser, but beyond certain size it throws error saying
"log parser tool has stopped working"
So please suggest any algorithm which i can implement in c#. Or if there is any other component or code which can help me.
Thanks in advance
Do know that running a program without memory is hard, specially if you have an algorithm that by its nature requires memory allocation.
I've looked at the External sort method mentioned by Jim Menschel and this is my implementation.
I didn't implement sorting on the fifth field but left some hints in the code so you can add that yourself.
This code reads a file, line by line and creates, in a temporary directory for each line a new file. Then we open two of those files and create a new target file. After reading a line from the two open files, we can compare them (or their fields). Based on their comparison we write the smallest one to the target file and read the next line from the file we used.
Although this doesn't keep much strings in memory it is hard on the diskdrive. I checked the NTFS limits and 50,000,000 files is within the specs.
Here are the main methods of the class:
main entry point
This take the file to be sorted
public void Sort(string file)
{
Directory.CreateDirectory(sortdir);
Split(file);
var sortedFile = SortAndCombine();
// if you feel confident you can override the original file
File.Move(sortedFile, file + ".sorted");
Directory.Delete(sortdir);
}
Split file
Split the file in a new file for each line
Yes, that will be a lot of files but it guarantees the least amount of memory used. It is easy to optimize though, read a couple of lines, sort those and write to a file.
void Split(string file)
{
using (var sr = new StreamReader(file, Encoding.UTF8))
{
var line = sr.ReadLine();
while (!String.IsNullOrEmpty(line))
{
// whatever you do, make sure this file your writed
// is ordered, just writing a single line is the easiest
using (var sw = new StreamWriter(CreateUniqueFilename()))
{
sw.WriteLine(line);
}
line = sr.ReadLine();
}
}
}
Combine the files
Iterate over all files and take one and the next one, merge those files
string SortAndCombine()
{
long processed; // keep track of how much we processed
do
{
// iterate the folder
var files = Directory.EnumerateFiles(sortdir).GetEnumerator();
bool hasnext = files.MoveNext();
processed = 0;
while (hasnext)
{
processed++;
// have one
string fileOne = files.Current;
hasnext = files.MoveNext();
if (hasnext)
{
// we have number two
string fileTwo = files.Current;
// do the work
MergeSort(fileOne, fileTwo);
hasnext = files.MoveNext();
}
}
} while (processed > 1);
var lastfile = Directory.EnumerateFiles(sortdir).GetEnumerator();
lastfile.MoveNext();
return lastfile.Current; // by magic is the name of the last file
}
Merge and Sort
Open two files and create one target file. Read a line from both of these and write sthe mallest of the two to the target file.
Keep doing that until both lines are null
void MergeSort(string fileOne, string fileTwo)
{
string result = CreateUniqueFilename();
using(var srOne = new StreamReader(fileOne, Encoding.UTF8))
{
using(var srTwo = new StreamReader(fileTwo, Encoding.UTF8))
{
// I left the actual field parsing as an excersise for the reader
string lineOne, lineTwo; // fieldOne, fieldTwo;
using(var target = new StreamWriter(result))
{
lineOne = srOne.ReadLine();
lineTwo = srTwo.ReadLine();
// naive field parsing
// fieldOne = lineOne.Split(';')[4];
// fieldTwo = lineTwo.Split(';')[4];
while(
!String.IsNullOrEmpty(lineOne) ||
!String.IsNullOrEmpty(lineTwo))
{
// use your parsed fieldValues here
if (lineOne != null && (lineOne.CompareTo(lineTwo) < 0 || lineTwo==null))
{
target.WriteLine(lineOne);
lineOne = srOne.ReadLine();
// fieldOne = lineOne.Split(';')[4];
}
else
{
if (lineTwo!=null)
{
target.WriteLine(lineTwo);
lineTwo = srTwo.ReadLine();
// fieldTwo = lineTwo.Split(';')[4];
}
}
}
}
}
}
// all is perocessed, remove the input files.
File.Delete(fileOne);
File.Delete(fileTwo);
}
Helper variable and method
There is one shared member for the temporary directory and a method for generating temporary unique filenames.
private string sortdir = Path.Combine(Path.GetTempPath(), Guid.NewGuid().ToString("N"));
string CreateUniqueFilename()
{
return Path.Combine(sortdir, Guid.NewGuid().ToString("N"));
}
Memory analysis
I've created a small file with 5000 lines in it with the following code:
using(var sw= new StreamWriter("c:\\temp\\test1.txt"))
{
for(int line=0; line<5000; line++)
{
sw.WriteLine(Guid.NewGuid().ToString());
}
}
I then ran the sorting code with the memory profiler. This is what the summary looked like on my box with Windows 10, 4GB RAM and a spinning disk:
The object lifetime shows as expected a lot of String, char[] and byte[] allocations, but none of those have survived a Gen 0 collection, which means they are all short lived and I don't expect this to be a problem if the number of lines to sort increases.
This is the simplest solution that works for me. From here easy alterations and improvements are possible, either leading to even less memory consumption, reduce allocations or a higher speed. Make sure to measure, select the area where you can make the biggest impact and compare successive results. That should give you the optimum between memory usage and performance.
Instead of reading CSV completely you can simply index it:
Read unsorted CSV line by line and remember 5th element (column) value and something to identify this line later: line number or offset of this line from beginning of the file and size.
You will have some kind of List<Tuple<string, ...>>. Sort that
var sortedList = unsortedList.OrderBy(item => item.Item1);
Now you can create sorted CSV by enumerating sorted list, reading line from source file and appending it to new CSV:
using (var sortedCSV = File.AppendText(newCSVFileName))
foreach(var item in sortedList)
{
... // read line from unsorted csv using item.Item2, etc.
sortedCSV.WriteLine(...);
}

Get line with starts with some number

I have a file and I have to process this file, but I have to pick just the last line of the file, and check if this line begins with the number 9, how can I do this using linq ... ?
This record, which begins with the number 9, can sometimes, not be the last line of the file, because the last line can be a \r\n
I maded one simple system to make thsi:
var lines = File.ReadAllLines(file);
for (int i = 0; i < lines.Length; i++)
{
if (lines[i].StartsWith("9"))
{
//...
}
}
But, I whant to know if is possible to make something more fast... or, more better, using linq... :)
string output=File.ReadAllLines(path)
.Last(x=>!Regex.IsMatch(x,#"^[\r\n]*$"));
if(output.StartsWith("9"))//found
The other answers are fine, but the following is more intuitive to me (I love self-documenting code):
Edit: misinterpreted your question, updating my example code to be more appropriate
var nonEmptyLines =
from line in File.ReadAllLines(path)
where !String.IsNullOrEmpty(line.Trim())
select line;
if (nonEmptyLines.Any())
{
var lastLine = nonEmptyLines.Last();
if (lastLine.StartsWith("9")) // or char.IsDigit(lastLine.First()) for 'any number'
{
// Your logic here
}
}
You don't need LINQ something like following should work:
var fileLines = File.ReadAllLines("yourpath");
if(char.IsDigit(fileLines[fileLines.Count() - 1][0])
{
//last line starts with a digit.
}
Or for checking against specific digit 9 you can do:
if(fileLines.Last().StartsWith("9"))
if(list.Last(x =>!string.IsNullOrWhiteSpace(x)).StartsWith("9"))
{
}
Since you need to check the last two lines (in case the last line is a newline), you can do this. You can change lines to however many last lines you want to check.
int lines = 2;
if(File.ReadLines(file).Reverse().Take(lines).Any(x => x.StartsWith("9")))
{
//one of the last X lines starts with 9
}
else
{
//none of the last X lines start with 9
}

reading string each number c#

suppose this is my txt file:
line1
line2
line3
line4
line5
im reading content of this file with:
string line;
List<string> stdList = new List<string>();
StreamReader file = new StreamReader(myfile);
while ((line = file.ReadLine()) != null)
{
stdList.Add(line);
}
finally
{//need help here
}
Now i want to read data in stdList, but read only value every 2 line(in this case i've to read "line2" and "line4").
can anyone put me in the right way?
Even shorter than Yuck's approach and it doesn't need to read the whole file into memory in one go :)
var list = File.ReadLines(filename)
.Where((ignored, index) => index % 2 == 1)
.ToList();
Admittedly it does require .NET 4. The key part is the overload of Where which provides the index as well as the value for the predicate to act on. We don't really care about the value (which is why I've named the parameter ignored) - we just want odd indexes. Obviously we care about the value when we build the list, but that's fine - it's only ignored for the predicate.
You can simplify your file read logic into one line, and just loop through every other line this way:
var lines = File.ReadAllLines(myFile);
for (var i = 1; i < lines.Length; i += 2) {
// do something
}
EDIT: Starting at i = 1 which is line2 in your example.
Add a conditional block and a tracking mechanism inside of a loop. (The body of the loop is as follows:)
int linesProcessed = 0;
if( linesProcessed % 2 == 1 ){
// Read the line.
stdList.Add(line);
}
else{
// Don't read the line (Do nothing.)
}
linesProcessed++;
The line linesProcessed % 2 == 1 says: take the number of lines we have processed already, and find the mod 2 of this number. (The remainder when you divide that integer by 2.) That will check to see if the number of lines processed is even or odd.
If you have processed no lines, it will be skipped (such as line 1, your first line.) If you have processed one line or any odd number of lines already, go ahead and process this current line (such as line 2.)
If modular math gives you any trouble, see the question: https://stackoverflow.com/a/90247/758446
try this:
string line;
List<string> stdList = new List<string>();
StreamReader file = new StreamReader(myfile);
while ((line = file.ReadLine()) != null)
{
stdList.Add(line);
var trash = file.ReadLine(); //this advances to the next line, and doesn't do anything with the result
}
finally
{
}

String Builder vs Lists

I am reading in multiple files in with millions of lines and I am creating a list of all line numbers that have a specific issue. For example if a specific field is left blank or contains an invalid value.
So my question is what would be the most efficient date type to keep track of a list of numbers that could be upwards of a million number of rows. Would using String Builder, Lists, or something else be more efficient?
My end goal is to out put a message like "Specific field is blank on 1-32, 40, 45, 47, 49-51, etc. So in the case of a String Builder, I would check the previous value and if it is is only 1 more I would change it from 1 to 1-2 and if it was more than one would separate it by a comma. With the List, I would just add each number to the list and then combine then once the file has been completely read. However in this case I could have multiple list containing millions of numbers.
Here is the current code I am using to combine a list of numbers using String Builder:
string currentLine = sbCurrentLineNumbers.ToString();
string currentLineSub;
StringBuilder subCurrentLine = new StringBuilder();
StringBuilder subCurrentLineSub = new StringBuilder();
int indexLastSpace = currentLine.LastIndexOf(' ');
int indexLastDash = currentLine.LastIndexOf('-');
int currentStringInt = 0;
if (sbCurrentLineNumbers.Length == 0)
{
sbCurrentLineNumbers.Append(lineCount);
}
else if (indexLastSpace == -1 && indexLastDash == -1)
{
currentStringInt = Convert.ToInt32(currentLine);
if (currentStringInt == lineCount - 1)
sbCurrentLineNumbers.Append("-" + lineCount);
else
{
sbCurrentLineNumbers.Append(", " + lineCount);
commaCounter++;
}
}
else if (indexLastSpace > indexLastDash)
{
currentLineSub = currentLine.Substring(indexLastSpace);
currentStringInt = Convert.ToInt32(currentLineSub);
if (currentStringInt == lineCount - 1)
sbCurrentLineNumbers.Append("-" + lineCount);
else
{
sbCurrentLineNumbers.Append(", " + lineCount);
commaCounter++;
}
}
else if (indexLastSpace < indexLastDash)
{
currentLineSub = currentLine.Substring(indexLastDash + 1);
currentStringInt = Convert.ToInt32(currentLineSub);
string charOld = currentLineSub;
string charNew = lineCount.ToString();
if (currentStringInt == lineCount - 1)
sbCurrentLineNumbers.Replace(charOld, charNew);
else
{
sbCurrentLineNumbers.Append(", " + lineCount);
commaCounter++;
}
}
My end goal is to out put a message like "Specific field is blank on 1-32, 40, 45, 47, 49-51
If that's the end goal, no point in going through an intermediary representation such as a List<int> - just go with a StringBuilder. You will save on memory and CPU that way.
StringBuilder serves your purpose so stick with that, if you ever need the line numbers you can easily change the code then.
Depends on how you can / want to break the code up.
Given you are reading it in line order, not sure you need a list at all.
Your current desired output implies that you can't output anything until the file is completely scanned. The size of the file suggests a one pass`analysis phase would be a good idea as well, given you are going to use buffered input as opposed to reading the entire thing into memory.
I'd be tempted with an enum to describe the issue e.g Field??? is blank and then use that as the key a dictionary of string builders.
As a first thought anyway
Is your output supposed to be human readable? If so, you'll hit the limit of what is reasonable to read, long before you have any performance/memory issues from your data structure. Use whatever is easiest for you to work with.
If the output is supposed to be machine readable, then that output might suggest an appropriate data structure.
As others have pointed out, I would probably use StringBuilder. The List may have to resize many times; the new implementation of StringBuilder does not have to resize.

Categories