I have a CSV that looks like this
783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-01 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-01 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:15,1,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:30,2,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
...
783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-02 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-02 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
although there are 5 billion records. If you notice the first column and part of the 2nd column (the day), three of the records are all 'grouped' together and are just a breakdown of 15 minute intervals for the first 30 minutes of that day.
I want the output to look like
783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
...
783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
Where the first 4 columns of the repeating rows are ommitted and the rest of the columns are combined with the first record of it's kind. Basically I am converting the day from being each line is 15 minutes, to each line is 1 day.
Since I will be processing 5 billion records, I think the best thing is to use regular expressions (and EmEditor) or some tool that is made for this (multithreading, optimized), rather than a custom programmed solution. Althought I am open to ideas in nodeJS or C# that are relatively simple and super quick.
How can this be done?
If there's always a set number of records records and they're in order, it'd be fairly easy to just read a few lines at a time and parse and output them. Trying to do regex on billions of records would take forever. Using StreamReader and StreamWriter should make it possible to read and write these large files since they read and write one line at a time.
using (StreamReader sr = new StreamReader("inputFile.txt"))
using (StreamWriter sw = new StreamWriter("outputFile.txt"))
{
string line1;
int counter = 0;
var lineCountToGroup = 3; //change to 96
while ((line1 = sr.ReadLine()) != null)
{
var lines = new List<string>();
lines.Add(line1);
for(int i = 0; i < lineCountToGroup - 1; i++) //less 1 because we already added line1
lines.Add(sr.ReadLine());
var groupedLine = lines.SomeLinqIfNecessary();//whatever your grouping logic is
sw.WriteLine(groupedLine);
}
}
Disclaimer- untested code with no error handling and assuming that there are indeed the correct number of lines repeated, etc. You'd obviously need to do some tweaks for your exact scenario.
You could do something like this (untested code without any error handling - but should give you the general gist of it):
using (var sin = new SteamReader("yourfile.csv")
using (var sout = new SteamWriter("outfile.csv")
{
var line = sin.ReadLine(); // note: should add error handling for empty files
var cells = line.Split(","); // note: you should probably check the length too!
var key = cells[0]; // use this to match other rows
StringBuilder output = new StringBuilder(line); // this is the output line we build
while ((line = sin.ReadLine()) != null) // if we have more lines
{
cells = line.Split(","); // split so we can get the first column
while(cells[0] == key) // if the first column matches the current key
{
output.Append(String.Join(",",cells.Skip(4))); // add this row to our output line
}
// once the key changes
sout.WriteLine(output.ToString()); // write out the line we've built up
output.Clear();
output.Append(line); // update the new line to build
key = cells[0]; // and update the key
}
// once all lines have been processed
sout.WriteLine(output.ToString()); // We'll have just the last line to write out
}
The idea is to loop through each line in turn and keep track of the current value of the first column. When that value changes, you write out the output line you've been building up and update the key. This way you don't have to worry about exactly how many matches you have or if you might be missing a few points.
One note, it might be more efficient to use a StringBuilder for output rather than a String if you are going to concatentate 96 rows.
Define the ProcessOutputLine to store merged lines.
Call ProcessLine after each ReadLine and at end of file.
string curKey ="" ;
string keyLength = ... ; // set totalength of 4 first columns
string outputLine = "" ;
private void ProcessInputLine(string line)
{
string newKey=line.substring(0,keyLength) ;
if (newKey==curKey) outputline+=line.substring(keyLength) ;
else
{
if (outputline!="") ProcessOutPutLine(outputLine)
curkey = newKey ;
outputLine=Line ;
}
EDIT : this solution is very similar to that of Matt Burland, the only noticable difference is that I don't use the Split function.
Forgive my noobiness but I just need some guidance and I can't find another question that answers this. I have a fairly large csv file (~300k rows) and I need to determine for a given input, whether any line in the csv begins with that input. I have sorted the csv alphabetically, but I don't know:
1) how to process the rows in the csv- should I read it in as a list/collection, or use OLEDB, or an embedded database or something else?
2) how to find something efficiently from an alphabetical list (using the fact that it's sorted to speed things up, rather than searching the whole list)
You don't give enough specifics to give you a concrete answer but...
IF the CSV file changes often then use OLEDB and just change the SQL query based on your input.
string sql = #"SELECT * FROM [" + fileName + "] WHERE Column1 LIKE 'blah%'";
using(OleDbConnection connection = new OleDbConnection(
#"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + fileDirectoryPath +
";Extended Properties=\"Text;HDR=" + hasHeaderRow + "\""))
IF the CSV file doesn't change often and you run a lot of "queries" against it, load it once into memory and quickly search it each time.
IF you want your search to be an exact match on a column use a Dictionary where the key is the column you want to match on and the value is the row data.
Dictionary<long, string> Rows = new Dictionar<long, string>();
...
if(Rows.ContainsKey(search)) ...
IF you want your search to be a partial match like StartsWith then have 1 array containing your searchable data (ie: first column) and another list or array containing your row data. Then use C#'s built in binary search http://msdn.microsoft.com/en-us/library/2cy9f6wb.aspx
string[] SortedSearchables = new string[];
List<string> SortedRows = new List<string>();
...
string result = null;
int foundIdx = Array.BinarySearch<string>(SortedSearchables, searchTerm);
if(foundIdx < 0) {
foundIdx = ~foundIdx;
if(foundIdx < SortedRows.Count && SortedSearchables[foundIdx].StartsWith(searchTerm)) {
result = SortedRows[foundIdx];
}
} else {
result = SortedRows[foundIdx];
}
NOTE code was written inside the browser window and may contain syntax errors as it wasn't tested.
If you can cache the data in memory, and you only need to search the list on one primary key column, I would recommend storing the data in memory as a Dictionary object. The Dictionary class stores the data as key/value pairs in a hash table. You could use the primary key column as the key in the dictionary, and then use the rest of the columns as the value in the dictionary. Looking up items by key in a hash table is typically very fast.
For instance, you could load the data into a dictionary, like this:
Dictionary<string, string[]> data = new Dictionary<string, string[]>();
using (TextFieldParser parser = new TextFieldParser("C:\test.csv"))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
while (!parser.EndOfData)
{
try
{
string[] fields = parser.ReadFields();
data[fields[0]] = fields;
}
catch (MalformedLineException ex)
{
// ...
}
}
}
And then you could get the data for any item, like this:
string fields[] = data["key I'm looking for"];
If you're only doing it once per program run, this seems pretty fast. (Updated to use StreamReader instead of FileStream based on comments below)
static string FindRecordBinary(string search, string fileName)
{
using (StreamReader fs = new StreamReader(fileName))
{
long min = 0; // TODO: What about header row?
long max = fs.BaseStream.Length;
while (min <= max)
{
long mid = (min + max) / 2;
fs.BaseStream.Position = mid;
fs.DiscardBufferedData();
if (mid != 0) fs.ReadLine();
string line = fs.ReadLine();
if (line == null) { min = mid+1; continue; }
int compareResult;
if (line.Length > search.Length)
compareResult = String.Compare(
line, 0, search, 0, search.Length, false );
else
compareResult = String.Compare(line, search);
if (0 == compareResult) return line;
else if (compareResult > 0) max = mid-1;
else min = mid+1;
}
}
return null;
}
This runs in 0.007 seconds for a 600,000 record test file that's 50 megs. In comparison a file-scan averages over half a second depending where the record is located. (a 100 fold difference)
Obviously if you do it more than once, caching is going to speed things up. One simple way to do partial caching would be to keep the StreamReader open and re-use it, just reset min and max each time through. This would save you storing 50 megs in memory all the time.
EDIT: Added knaki02's suggested fix.
Given the CSV is sorted - if you can load the entire thing into memory (If the only processing you need to do is a .StartsWith() on each line) - you can use a Binary search to have exceptionally fast searching.
Maybe something like this (NOT TESTED!):
var csv = File.ReadAllLines(#"c:\file.csv").ToList();
var exists = csv.BinarySearch("StringToFind", new StartsWithComparer());
...
public class StartsWithComparer: IComparer<string>
{
public int Compare(string x, string y)
{
if(x.StartsWith(y))
return 0;
else
return x.CompareTo(y);
}
}
I wrote this quickly for work, could be improved on...
Define the column numbers:
private enum CsvCols
{
PupilReference = 0,
PupilName = 1,
PupilSurname = 2,
PupilHouse = 3,
PupilYear = 4,
}
Define the Model
public class ImportModel
{
public string PupilReference { get; set; }
public string PupilName { get; set; }
public string PupilSurname { get; set; }
public string PupilHouse { get; set; }
public string PupilYear { get; set; }
}
Import and populate a list of models:
var rows = File.ReadLines(csvfilePath).Select(p => p.Split(',')).Skip(1).ToArray();
var pupils = rows.Select(x => new ImportModel
{
PupilReference = x[(int) CsvCols.PupilReference],
PupilName = x[(int) CsvCols.PupilName],
PupilSurname = x[(int) CsvCols.PupilSurname],
PupilHouse = x[(int) CsvCols.PupilHouse],
PupilYear = x[(int) CsvCols.PupilYear],
}).ToList();
Returns you a list of strongly typed objects
If your file is in memory (for example because you did sorting) and you keep it as an array of strings (lines) then you can use a simple bisection search method. You can start with the code on this question on CodeReview, just change the comparer to work with string instead of int and to check only the beginning of each line.
If you have to re-read the file each time because it may be changed or it's saved/sorted by another program then the most simple algorithm is the best one:
using (var stream = File.OpenText(path))
{
// Replace this with you comparison, CSV splitting
if (stream.ReadLine().StartsWith("..."))
{
// The file contains the line with required input
}
}
Of course you may read the entire file in memory (to use LINQ or List<T>.BinarySearch()) each time but this is far from optimal (you'll read everything even if you may need to examine just few lines) and the file itself could even be too large.
If you really need something more and you do not have your file in memory because of sorting (but you should profile your actual performance compared to your requirements) you have to implement a better search algorithm, for example the Boyer-Moore algorithm.
OP stated really just needs to search based on line.
The questions is then to hold the lines in memory or not.
If the line 1 k then 300 mb of memory.
If a line is 1 meg then 300 gb of memory.
Stream.Readline will have a low memory profile
Since it is sorted you can stop looking once it is greater than.
If you hold it in memory then a simple
List<String>
With LINQ will work.
LINQ is not smart enough to take advantage of the sort but against 300K would still be pretty fast.
BinarySearch will take advantage of the sort.
Try the free CSV Reader. No Need to invent the wheel over and over again ;)
1) If you do not need to store the results, just iterate though the CSV - handle each line and forget it. If you need to process all lines again and again, store them in a List or Dictionary (with a good key of course)
2) Try the generic extension methods like this
var list = new List<string>() { "a", "b", "c" };
string oneA = list.FirstOrDefault(entry => !string.IsNullOrEmpty(entry) && entry.ToLowerInvariant().StartsWidth("a"));
IEnumerable<string> allAs = list.Where(entry => !string.IsNullOrEmpty(entry) && entry.ToLowerInvariant().StartsWidth("a"));
Here is my VB.net Code. It is for a Quote Qualified CSV, so for a regular CSV, change Let n = P.Split(New Char() {""","""}) to Let n = P.Split(New Char() {","})
Dim path as String = "C:\linqpad\Patient.txt"
Dim pat = System.IO.File.ReadAllLines(path)
Dim Patz = From P in pat _
Let n = P.Split(New Char() {""","""}) _
Order by n(5) _
Select New With {
.Doc =n(1), _
.Loc = n(3), _
.Chart = n(5), _
.PatientID= n(31), _
.Title = n(13), _
.FirstName = n(9), _
.MiddleName = n(11), _
.LastName = n(7),
.StatusID = n(41) _
}
Patz.dump
Normally I would recommend finding a dedicated CSV parser (like this or this). However, I noticed this line in your question:
I need to determine for a given input, whether any line in the csv begins with that input.
That tells me that computer time spend parsing CSV data before this is determined is time wasted. You just need code to simply match text for text, and you can do that via a string comparison as easily as anything else.
Additionally, you mention that the data is sorted. This should allow you speed things up tremendously... but you need to be aware that to take advantage of this you will need to write your own code to make seek calls on low-level file streams. This will be by far your best performing result, but it will also by far require the most initial work and maintenance.
I recommend an engineering based approach, where you set a performance goal, build something relatively simple, and measure the results against that goal. In particular, start with the 2nd link I posted above. The CSV reader there will only load one record into memory at a time, so it should perform reasonably well, and it's easy to get started with. Build something that uses that reader, and measure the results. If they meet your goal, then stop there.
If they don't meet your goal, adapt the code from the link so that as you read each line you first do a string comparison (before bothering to parse the csv data), and only do the work to parse csv for the lines that match. This should perform better, but only do the work if the first option does not meet your goals. When this is ready, measure the performance again.
Finally, if you still don't meet the performance goal, we're into the territory of writing low-level code to do a binary search on your file stream using seek calls. This is likely the best you'll be able to do, performance-wise, but it will be very messy and bug-prone code to write, and so you only want to go here if you absolutely do not meet your goals from earlier steps.
Remember, performance is a feature, and just like any other feature you need to evaluate how you build for that feature relative to real design goals. "As fast as possible" is not a reasonable design goal. Something like "respond to a user search within .25 seconds" is a real design goal, and if the simpler but slower code still meets that goal, you need to stop there.