Select rows from CSV based on a list of IDs - c#

I have a task of extracting a few hundred thousand rows from CSV files where the row contains a specified ID. So I have about 300,000 IDs stored in a string List and need to extract any row in the CSV that contains any of these IDs.
At the minute I am using a Linq statement to see if each row contains any of the IDs in the List:
using (StreamReader sr = new StreamReader(csvFile))
{
string inLine = sr.ReadLine();
if(searchStrings.Any(sr.ReadLine().Contains))
{
stremWriter.Write(inLine);
}
}
This kind of works ok but it is Very slow since there are 300,000 values in the searchStrings List and a few million rows in the CSVs that I need to search.
Does anyone know how to make this search more efficient to speed it up?
Or an alternative method for extracting the required rows?
Thanks

I've faced a similarish problem before, I had to iterate through a several hundred thousand line .csv and parse each row.
I went with a threaded approach where I tried to do the reading and parsing simultaneously in batches.
Here's roughly how I did it;
using System.Collections.Concurrent; using System.Threading;
private static ConcurrentBag<String> items = new ConcurrentBag<String>();
private static List<String> searchStrings;
static void Main(string[] args)
{
using (StreamReader sr = new StreamReader(csvFile))
{
const int buffer_size = 10000;
string[] buffer = new string[buffer_size];
int count = 0;
String line = null;
while ((line = sr.ReadLine()) != null)
{
buffer[count] = line;
count++;
if (count == buffer_size)
{
new Thread(() =>
{
find(buffer);
}).Start();
buffer = new String[buffer_size];
count = 0;
}
}
if (count > 0)
{
find(buffer);
}
//some kind of sync here, can be done with a bool - make sure all the threads have finished executing
foreach (var str in searchStrings)
streamWriter.write(str);
}
}
private static void find(string[] buffer)
{
//do your search algorithm on the array of strings
//add to the concurrentbag if they match
}
I just quickly threw this code together from what I remember doing before so it might not be entirely correct. Doing it like this certainly speeds things up though (with very large files at least).
The idea is to always be reading from the hdd as string parsing can be pretty expensive, and thus batching the work on multiple cores can make it significantly faster.
With this, I was able to parse (splitting each line into about 50 items and parsing the key/value pairs and building objects in memory from them - by far the most time consuming part) around 250k lines in just over 7s.

Just throwing this out there, it's not specifically relevant to any of the tags on your question but the *nix "grep -f" functionality would work here. Essentially, you'd have a file with the list of strings you want to match (e.g., StringsToFind.txt) and you'd have your csv input file (e.g., input.csv) and the following command would output the matching lines to output.csv
grep -f StringsToFind.txt input.csv > output.csv
See grep man page for more details.

Related

C# :- How to sort a large csv file with 10 columns. Based on 5th column (Period Column). Without memory

How to sort a large csv file with 10 columns?
The sorting should be based on data type for example, string, Date, integer etc
Assuming Based on 5th column (Period Column) we need to sort.
As it is large CSV file, Without loading the same in memory we have to do.
I tried using logparser, but beyond certain size it throws error saying
"log parser tool has stopped working"
So please suggest any algorithm which i can implement in c#. Or if there is any other component or code which can help me.
Thanks in advance
Do know that running a program without memory is hard, specially if you have an algorithm that by its nature requires memory allocation.
I've looked at the External sort method mentioned by Jim Menschel and this is my implementation.
I didn't implement sorting on the fifth field but left some hints in the code so you can add that yourself.
This code reads a file, line by line and creates, in a temporary directory for each line a new file. Then we open two of those files and create a new target file. After reading a line from the two open files, we can compare them (or their fields). Based on their comparison we write the smallest one to the target file and read the next line from the file we used.
Although this doesn't keep much strings in memory it is hard on the diskdrive. I checked the NTFS limits and 50,000,000 files is within the specs.
Here are the main methods of the class:
main entry point
This take the file to be sorted
public void Sort(string file)
{
Directory.CreateDirectory(sortdir);
Split(file);
var sortedFile = SortAndCombine();
// if you feel confident you can override the original file
File.Move(sortedFile, file + ".sorted");
Directory.Delete(sortdir);
}
Split file
Split the file in a new file for each line
Yes, that will be a lot of files but it guarantees the least amount of memory used. It is easy to optimize though, read a couple of lines, sort those and write to a file.
void Split(string file)
{
using (var sr = new StreamReader(file, Encoding.UTF8))
{
var line = sr.ReadLine();
while (!String.IsNullOrEmpty(line))
{
// whatever you do, make sure this file your writed
// is ordered, just writing a single line is the easiest
using (var sw = new StreamWriter(CreateUniqueFilename()))
{
sw.WriteLine(line);
}
line = sr.ReadLine();
}
}
}
Combine the files
Iterate over all files and take one and the next one, merge those files
string SortAndCombine()
{
long processed; // keep track of how much we processed
do
{
// iterate the folder
var files = Directory.EnumerateFiles(sortdir).GetEnumerator();
bool hasnext = files.MoveNext();
processed = 0;
while (hasnext)
{
processed++;
// have one
string fileOne = files.Current;
hasnext = files.MoveNext();
if (hasnext)
{
// we have number two
string fileTwo = files.Current;
// do the work
MergeSort(fileOne, fileTwo);
hasnext = files.MoveNext();
}
}
} while (processed > 1);
var lastfile = Directory.EnumerateFiles(sortdir).GetEnumerator();
lastfile.MoveNext();
return lastfile.Current; // by magic is the name of the last file
}
Merge and Sort
Open two files and create one target file. Read a line from both of these and write sthe mallest of the two to the target file.
Keep doing that until both lines are null
void MergeSort(string fileOne, string fileTwo)
{
string result = CreateUniqueFilename();
using(var srOne = new StreamReader(fileOne, Encoding.UTF8))
{
using(var srTwo = new StreamReader(fileTwo, Encoding.UTF8))
{
// I left the actual field parsing as an excersise for the reader
string lineOne, lineTwo; // fieldOne, fieldTwo;
using(var target = new StreamWriter(result))
{
lineOne = srOne.ReadLine();
lineTwo = srTwo.ReadLine();
// naive field parsing
// fieldOne = lineOne.Split(';')[4];
// fieldTwo = lineTwo.Split(';')[4];
while(
!String.IsNullOrEmpty(lineOne) ||
!String.IsNullOrEmpty(lineTwo))
{
// use your parsed fieldValues here
if (lineOne != null && (lineOne.CompareTo(lineTwo) < 0 || lineTwo==null))
{
target.WriteLine(lineOne);
lineOne = srOne.ReadLine();
// fieldOne = lineOne.Split(';')[4];
}
else
{
if (lineTwo!=null)
{
target.WriteLine(lineTwo);
lineTwo = srTwo.ReadLine();
// fieldTwo = lineTwo.Split(';')[4];
}
}
}
}
}
}
// all is perocessed, remove the input files.
File.Delete(fileOne);
File.Delete(fileTwo);
}
Helper variable and method
There is one shared member for the temporary directory and a method for generating temporary unique filenames.
private string sortdir = Path.Combine(Path.GetTempPath(), Guid.NewGuid().ToString("N"));
string CreateUniqueFilename()
{
return Path.Combine(sortdir, Guid.NewGuid().ToString("N"));
}
Memory analysis
I've created a small file with 5000 lines in it with the following code:
using(var sw= new StreamWriter("c:\\temp\\test1.txt"))
{
for(int line=0; line<5000; line++)
{
sw.WriteLine(Guid.NewGuid().ToString());
}
}
I then ran the sorting code with the memory profiler. This is what the summary looked like on my box with Windows 10, 4GB RAM and a spinning disk:
The object lifetime shows as expected a lot of String, char[] and byte[] allocations, but none of those have survived a Gen 0 collection, which means they are all short lived and I don't expect this to be a problem if the number of lines to sort increases.
This is the simplest solution that works for me. From here easy alterations and improvements are possible, either leading to even less memory consumption, reduce allocations or a higher speed. Make sure to measure, select the area where you can make the biggest impact and compare successive results. That should give you the optimum between memory usage and performance.
Instead of reading CSV completely you can simply index it:
Read unsorted CSV line by line and remember 5th element (column) value and something to identify this line later: line number or offset of this line from beginning of the file and size.
You will have some kind of List<Tuple<string, ...>>. Sort that
var sortedList = unsortedList.OrderBy(item => item.Item1);
Now you can create sorted CSV by enumerating sorted list, reading line from source file and appending it to new CSV:
using (var sortedCSV = File.AppendText(newCSVFileName))
foreach(var item in sortedList)
{
... // read line from unsorted csv using item.Item2, etc.
sortedCSV.WriteLine(...);
}

Read last 30,000 lines of a file [duplicate]

This question already has answers here:
How to read last "n" lines of log file [duplicate]
(9 answers)
Closed 9 years ago.
If has a csv file whose data will increase by time to time. Now what i need to do is to read the last 30,000 lines.
Code :
string[] lines = File.ReadAllLines(Filename).Where(r => r.ToString() != "").ToArray();
int count = lines.Count();
int loopCount = count > 30000 ? count - 30000 : 0;
for (int i = loopCount; i < lines.Count(); i++)
{
string[] columns = lines[i].Split(',');
orderList.Add(columns[2]);
}
It is working fine but the problem is
File.ReadAllLines(Filename)
Read a complete file which causes performance lack. I want something like it only reads the last 30,000 lines which iteration through the complete file.
PS : i am using .Net 3.5 . Files.ReadLines() not exists in .Net 3.5
You can Use File.ReadLines() Method instead of using File.ReadAllLines()
From MSDN:File.ReadLines()
The ReadLines and ReadAllLines methods differ as follows:
When you use ReadLines, you can start enumerating the collection of strings before
the whole collection is returned; when you use ReadAllLines, you must
wait for the whole array of strings be returned before you can access
the array.
Therefore, when you are working with very large files,
ReadLines can be more efficient.
Solution 1 :
string[] lines = File.ReadAllLines(FileName).Where(r => r.ToString() != "").ToArray();
int count = lines.Count();
List<String> orderList = new List<String>();
int loopCount = count > 30000 ? 30000 : 0;
for (int i = count-1; i > loopCount; i--)
{
string[] columns = lines[i].Split(',');
orderList.Add(columns[2]);
}
Solution 2: if you are using .NET Framework 3.5 as you said in comments below , you can not use File.ReadLines() method as it is avaialble since .NET 4.0 .
You can use StreamReader as below:
List<string> lines = new List<string>();
List<String> orderList = new List<String>();
String line;
int count=0;
using (StreamReader reader = new StreamReader("c:\\Bethlehem-Deployment.txt"))
{
while ((line = reader.ReadLine()) != null)
{
lines.Add(line);
count++;
}
}
int loopCount = (count > 30000) ? 30000 : 0;
for (int i = count-1; i > loopCount; i--)
{
string[] columns = lines[i].Split(',');
orderList.Add(columns[0]);
}
You can use File.ReadLines by you can start enumerating the collection of strings before the whole collection is returned.
After that you can use the linq to make things lot more easier. Reverse will reverse the order of collection and Take will take the n number of items. Now put again Reverse to get the last n lines in original format.
var lines = File.ReadLines(Filename).Reverse().Take(30000).Reverse();
If you are using the .NET 3.5 or earlier you can create your own method which works same as File.ReadLines like this. Here is the code for the method originally written by #Jon
public IEnumerable<string> ReadLines(string file)
{
using (TextReader reader = File.OpenText(file))
{
string line;
while ((line = reader.ReadLine()) != null)
{
yield return line;
}
}
}
Now you can use linq over this function as well like the above statement.
var lines = ReadLines(Filename).Reverse().Take(30000).Reverse();
The problem is that you do not know where to start reading the file to get the last 30,000 lines. Unless you want to maintain a separate index of line offsets you can either read the file from the start counting lines only retaining the last 30,000 lines or you can start from the end counting lines backwards. The last approach can be efficient if the file is very large and you only want a few lines. However, 30,000 does not seem like "a few lines" so here is an approach that reads the file from the start and uses a queue to keep the last 30,000 lines:
var filename = #" ... ";
var linesToRead = 30000;
var queue = new Queue<String>();
using (var streamReader = File.OpenText(fileName)) {
while (!streamReader.EndOfStream) {
queue.Enqueue(streamReader.ReadLine());
if (queue.Count > linesToRead)
queue.Dequeue();
}
}
Now you can access the lines that are stored in queue. This class implements IEnumerable<String> allowing you to use foreach to iterate the lines. However, if you want random access you will have to use the ToArray method to convert the queue into an array which adds some overhead to the computation.
This solution is efficient in terms memory because at most 30,000 lines has to be kept in memory and the garbage collector can free any extra lines when required. Using File.ReadAllLines will pull all the lines into memory at once possibly increasing the memory required by the process.
Or I have a diffrent ideo for this.
Try splitting the csv to categories like A-D , E-G ....
and acces what first character you need .
Or you can split data with count of entites. Every file will contain 15.000 entites for example. And a text file which will contain tiny data about entits and location Like :
Txt File:
entitesID | inWhich.Csv
....

Read large txt file multithreaded?

I have large txt file with 100000 lines.
I need to start n-count of threads and give every thread unique line from this file.
What is the best way to do this? I think I need to read file line by line and iterator must be global to lock it. Loading the text file to list will be time-consuming and I can receive OutofMemory exception. Any ideas?
You can use the File.ReadLines Method to read the file line-by-line without loading the whole file into memory at once, and the Parallel.ForEach Method to process the lines in multiple threads in parallel:
Parallel.ForEach(File.ReadLines("file.txt"), (line, _, lineNumber) =>
{
// your code here
});
After performing my own benchmarks for loading 61,277,203 lines into memory and shoving values into a Dictionary / ConcurrentDictionary() the results seem to support #dtb's answer above that using the following approach is the fastest:
Parallel.ForEach(File.ReadLines(catalogPath), line =>
{
});
My tests also showed the following:
File.ReadAllLines() and File.ReadAllLines().AsParallel() appear to run at almost exactly the same speed on a file of this size. Looking at my CPU activity, it appears they both seem to use two out of my 8 cores?
Reading all the data first using File.ReadAllLines() appears to be much slower than using File.ReadLines() in a Parallel.ForEach() loop.
I also tried a producer / consumer or MapReduce style pattern where one thread was used to read the data and a second thread was used to process it. This also did not seem to outperform the simple pattern above.
I have included an example of this pattern for reference, since it is not included on this page:
var inputLines = new BlockingCollection<string>();
ConcurrentDictionary<int, int> catalog = new ConcurrentDictionary<int, int>();
var readLines = Task.Factory.StartNew(() =>
{
foreach (var line in File.ReadLines(catalogPath))
inputLines.Add(line);
inputLines.CompleteAdding();
});
var processLines = Task.Factory.StartNew(() =>
{
Parallel.ForEach(inputLines.GetConsumingEnumerable(), line =>
{
string[] lineFields = line.Split('\t');
int genomicId = int.Parse(lineFields[3]);
int taxId = int.Parse(lineFields[0]);
catalog.TryAdd(genomicId, taxId);
});
});
Task.WaitAll(readLines, processLines);
Here are my benchmarks:
I suspect that under certain processing conditions, the producer / consumer pattern might outperform the simple Parallel.ForEach(File.ReadLines()) pattern. However, it did not in this situation.
Read the file on one thread, adding its lines to a blocking queue. Start N tasks reading from that queue. Set max size of the queue to prevent out of memory errors.
Something like:
public class ParallelReadExample
{
public static IEnumerable LineGenerator(StreamReader sr)
{
while ((line = sr.ReadLine()) != null)
{
yield return line;
}
}
static void Main()
{
// Display powers of 2 up to the exponent 8:
StreamReader sr = new StreamReader("yourfile.txt")
Parallel.ForEach(LineGenerator(sr), currentLine =>
{
// Do your thing with currentLine here...
} //close lambda expression
);
sr.Close();
}
}
Think it would work. (No C# compiler/IDE here)
If you want to limit the number of threads to n, the easiest way is to use AsParallel() along with WithDegreeOfParallelism(n) to limit the thread count:
string filename = "C:\\TEST\\TEST.DATA";
int n = 5;
foreach (var line in File.ReadLines(filename).AsParallel().WithDegreeOfParallelism(n))
{
// Process line.
}
As #dtb mentioned above, the fastest way to read a file and then process the individual lines in a file is to:
1) do a File.ReadAllLines() into an array
2) Use a Parallel.For loop to iterate over the array.
You can read more performance benchmarks here.
The basic gist of the code you would have to write is:
string[] AllLines = File.ReadAllLines(fileName);
Parallel.For(0, AllLines.Length, x =>
{
DoStuff(AllLines[x]);
//whatever you need to do
});
With the introduction of bigger array sizes in .Net4, as long as you have plenty of memory, this shouldn't be an issue.

Most efficient way to process a large csv in .NET

Forgive my noobiness but I just need some guidance and I can't find another question that answers this. I have a fairly large csv file (~300k rows) and I need to determine for a given input, whether any line in the csv begins with that input. I have sorted the csv alphabetically, but I don't know:
1) how to process the rows in the csv- should I read it in as a list/collection, or use OLEDB, or an embedded database or something else?
2) how to find something efficiently from an alphabetical list (using the fact that it's sorted to speed things up, rather than searching the whole list)
You don't give enough specifics to give you a concrete answer but...
IF the CSV file changes often then use OLEDB and just change the SQL query based on your input.
string sql = #"SELECT * FROM [" + fileName + "] WHERE Column1 LIKE 'blah%'";
using(OleDbConnection connection = new OleDbConnection(
#"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" + fileDirectoryPath +
";Extended Properties=\"Text;HDR=" + hasHeaderRow + "\""))
IF the CSV file doesn't change often and you run a lot of "queries" against it, load it once into memory and quickly search it each time.
IF you want your search to be an exact match on a column use a Dictionary where the key is the column you want to match on and the value is the row data.
Dictionary<long, string> Rows = new Dictionar<long, string>();
...
if(Rows.ContainsKey(search)) ...
IF you want your search to be a partial match like StartsWith then have 1 array containing your searchable data (ie: first column) and another list or array containing your row data. Then use C#'s built in binary search http://msdn.microsoft.com/en-us/library/2cy9f6wb.aspx
string[] SortedSearchables = new string[];
List<string> SortedRows = new List<string>();
...
string result = null;
int foundIdx = Array.BinarySearch<string>(SortedSearchables, searchTerm);
if(foundIdx < 0) {
foundIdx = ~foundIdx;
if(foundIdx < SortedRows.Count && SortedSearchables[foundIdx].StartsWith(searchTerm)) {
result = SortedRows[foundIdx];
}
} else {
result = SortedRows[foundIdx];
}
NOTE code was written inside the browser window and may contain syntax errors as it wasn't tested.
If you can cache the data in memory, and you only need to search the list on one primary key column, I would recommend storing the data in memory as a Dictionary object. The Dictionary class stores the data as key/value pairs in a hash table. You could use the primary key column as the key in the dictionary, and then use the rest of the columns as the value in the dictionary. Looking up items by key in a hash table is typically very fast.
For instance, you could load the data into a dictionary, like this:
Dictionary<string, string[]> data = new Dictionary<string, string[]>();
using (TextFieldParser parser = new TextFieldParser("C:\test.csv"))
{
parser.TextFieldType = FieldType.Delimited;
parser.SetDelimiters(",");
while (!parser.EndOfData)
{
try
{
string[] fields = parser.ReadFields();
data[fields[0]] = fields;
}
catch (MalformedLineException ex)
{
// ...
}
}
}
And then you could get the data for any item, like this:
string fields[] = data["key I'm looking for"];
If you're only doing it once per program run, this seems pretty fast. (Updated to use StreamReader instead of FileStream based on comments below)
static string FindRecordBinary(string search, string fileName)
{
using (StreamReader fs = new StreamReader(fileName))
{
long min = 0; // TODO: What about header row?
long max = fs.BaseStream.Length;
while (min <= max)
{
long mid = (min + max) / 2;
fs.BaseStream.Position = mid;
fs.DiscardBufferedData();
if (mid != 0) fs.ReadLine();
string line = fs.ReadLine();
if (line == null) { min = mid+1; continue; }
int compareResult;
if (line.Length > search.Length)
compareResult = String.Compare(
line, 0, search, 0, search.Length, false );
else
compareResult = String.Compare(line, search);
if (0 == compareResult) return line;
else if (compareResult > 0) max = mid-1;
else min = mid+1;
}
}
return null;
}
This runs in 0.007 seconds for a 600,000 record test file that's 50 megs. In comparison a file-scan averages over half a second depending where the record is located. (a 100 fold difference)
Obviously if you do it more than once, caching is going to speed things up. One simple way to do partial caching would be to keep the StreamReader open and re-use it, just reset min and max each time through. This would save you storing 50 megs in memory all the time.
EDIT: Added knaki02's suggested fix.
Given the CSV is sorted - if you can load the entire thing into memory (If the only processing you need to do is a .StartsWith() on each line) - you can use a Binary search to have exceptionally fast searching.
Maybe something like this (NOT TESTED!):
var csv = File.ReadAllLines(#"c:\file.csv").ToList();
var exists = csv.BinarySearch("StringToFind", new StartsWithComparer());
...
public class StartsWithComparer: IComparer<string>
{
public int Compare(string x, string y)
{
if(x.StartsWith(y))
return 0;
else
return x.CompareTo(y);
}
}
I wrote this quickly for work, could be improved on...
Define the column numbers:
private enum CsvCols
{
PupilReference = 0,
PupilName = 1,
PupilSurname = 2,
PupilHouse = 3,
PupilYear = 4,
}
Define the Model
public class ImportModel
{
public string PupilReference { get; set; }
public string PupilName { get; set; }
public string PupilSurname { get; set; }
public string PupilHouse { get; set; }
public string PupilYear { get; set; }
}
Import and populate a list of models:
var rows = File.ReadLines(csvfilePath).Select(p => p.Split(',')).Skip(1).ToArray();
var pupils = rows.Select(x => new ImportModel
{
PupilReference = x[(int) CsvCols.PupilReference],
PupilName = x[(int) CsvCols.PupilName],
PupilSurname = x[(int) CsvCols.PupilSurname],
PupilHouse = x[(int) CsvCols.PupilHouse],
PupilYear = x[(int) CsvCols.PupilYear],
}).ToList();
Returns you a list of strongly typed objects
If your file is in memory (for example because you did sorting) and you keep it as an array of strings (lines) then you can use a simple bisection search method. You can start with the code on this question on CodeReview, just change the comparer to work with string instead of int and to check only the beginning of each line.
If you have to re-read the file each time because it may be changed or it's saved/sorted by another program then the most simple algorithm is the best one:
using (var stream = File.OpenText(path))
{
// Replace this with you comparison, CSV splitting
if (stream.ReadLine().StartsWith("..."))
{
// The file contains the line with required input
}
}
Of course you may read the entire file in memory (to use LINQ or List<T>.BinarySearch()) each time but this is far from optimal (you'll read everything even if you may need to examine just few lines) and the file itself could even be too large.
If you really need something more and you do not have your file in memory because of sorting (but you should profile your actual performance compared to your requirements) you have to implement a better search algorithm, for example the Boyer-Moore algorithm.
OP stated really just needs to search based on line.
The questions is then to hold the lines in memory or not.
If the line 1 k then 300 mb of memory.
If a line is 1 meg then 300 gb of memory.
Stream.Readline will have a low memory profile
Since it is sorted you can stop looking once it is greater than.
If you hold it in memory then a simple
List<String>
With LINQ will work.
LINQ is not smart enough to take advantage of the sort but against 300K would still be pretty fast.
BinarySearch will take advantage of the sort.
Try the free CSV Reader. No Need to invent the wheel over and over again ;)
1) If you do not need to store the results, just iterate though the CSV - handle each line and forget it. If you need to process all lines again and again, store them in a List or Dictionary (with a good key of course)
2) Try the generic extension methods like this
var list = new List<string>() { "a", "b", "c" };
string oneA = list.FirstOrDefault(entry => !string.IsNullOrEmpty(entry) && entry.ToLowerInvariant().StartsWidth("a"));
IEnumerable<string> allAs = list.Where(entry => !string.IsNullOrEmpty(entry) && entry.ToLowerInvariant().StartsWidth("a"));
Here is my VB.net Code. It is for a Quote Qualified CSV, so for a regular CSV, change Let n = P.Split(New Char() {""","""}) to Let n = P.Split(New Char() {","})
Dim path as String = "C:\linqpad\Patient.txt"
Dim pat = System.IO.File.ReadAllLines(path)
Dim Patz = From P in pat _
Let n = P.Split(New Char() {""","""}) _
Order by n(5) _
Select New With {
.Doc =n(1), _
.Loc = n(3), _
.Chart = n(5), _
.PatientID= n(31), _
.Title = n(13), _
.FirstName = n(9), _
.MiddleName = n(11), _
.LastName = n(7),
.StatusID = n(41) _
}
Patz.dump
Normally I would recommend finding a dedicated CSV parser (like this or this). However, I noticed this line in your question:
I need to determine for a given input, whether any line in the csv begins with that input.
That tells me that computer time spend parsing CSV data before this is determined is time wasted. You just need code to simply match text for text, and you can do that via a string comparison as easily as anything else.
Additionally, you mention that the data is sorted. This should allow you speed things up tremendously... but you need to be aware that to take advantage of this you will need to write your own code to make seek calls on low-level file streams. This will be by far your best performing result, but it will also by far require the most initial work and maintenance.
I recommend an engineering based approach, where you set a performance goal, build something relatively simple, and measure the results against that goal. In particular, start with the 2nd link I posted above. The CSV reader there will only load one record into memory at a time, so it should perform reasonably well, and it's easy to get started with. Build something that uses that reader, and measure the results. If they meet your goal, then stop there.
If they don't meet your goal, adapt the code from the link so that as you read each line you first do a string comparison (before bothering to parse the csv data), and only do the work to parse csv for the lines that match. This should perform better, but only do the work if the first option does not meet your goals. When this is ready, measure the performance again.
Finally, if you still don't meet the performance goal, we're into the territory of writing low-level code to do a binary search on your file stream using seek calls. This is likely the best you'll be able to do, performance-wise, but it will be very messy and bug-prone code to write, and so you only want to go here if you absolutely do not meet your goals from earlier steps.
Remember, performance is a feature, and just like any other feature you need to evaluate how you build for that feature relative to real design goals. "As fast as possible" is not a reasonable design goal. Something like "respond to a user search within .25 seconds" is a real design goal, and if the simpler but slower code still meets that goal, you need to stop there.

Replace Long list Words in a big Text File

i need a fast method to work with big text file
i have 2 files,
a big text file (~20Gb)
and an another text file that contain ~12 million list of Combo words
i want find all combo words in the first text file and replace it with an another Combo word (combo word with underline)
example "Computer Information" >Replace With> "Computer_Information"
i use this code, but performance is very poor (i test in Hp G7 Server With 16Gb Ram and 16 Core)
public partial class Form1 : Form
{
HashSet<string> wordlist = new HashSet<string>();
private void loadComboWords()
{
using (StreamReader ff = new StreamReader(txtComboWords.Text))
{
string line;
while ((line = ff.ReadLine()) != null)
{
wordlist.Add(line);
}
}
}
private void replacewords(ref string str)
{
foreach (string wd in wordlist)
{
// ReplaceEx(ref str,wd,wd.Replace(" ","_"));
if (str.IndexOf(wd) > -1)
str.Replace(wd, wd.Replace(" ", "_"));
}
}
private void button3_Click(object sender, EventArgs e)
{
string line;
using (StreamReader fread = new StreamReader(txtFirstFile.Text))
{
string writefile = Path.GetFullPath(txtFirstFile.Text) + Path.GetFileNameWithoutExtension(txtFirstFile.Text) + "_ReplaceComboWords.txt";
StreamWriter sw = new StreamWriter(writefile);
long intPercent;
label3.Text = "initialing";
loadComboWords();
while ((line = fread.ReadLine()) != null)
{
replacewords(ref line);
sw.WriteLine(line);
intPercent = (fread.BaseStream.Position * 100) / fread.BaseStream.Length;
Application.DoEvents();
label3.Text = intPercent.ToString();
}
sw.Close();
fread.Close();
label3.Text = "Finished";
}
}
}
any idea to do this job in reasonable time
Thanks
At first glance the approach you've taken looks fine - it should work OK, and there's nothing obvious that will cause e.g. lots of garbage collection.
The main thing I think is that you'll only be using one of those sixteen cores: there's nothing in place to share the load across the other fifteen.
I think the easiest way to do this is to split the large 20Gb file into sixteen chunks, then analyse each of the chunks together, then merge the chunks back together again. The extra time taken splitting and reassembling the file should be minimal compared to the ~16 times gain involved in scanning these sixteen chunks together.
In outline, one way to do this might be:
private List<string> SplitFileIntoChunks(string baseFile)
{
// Split the file into chunks, and return a list of the filenames.
}
private void AnalyseChunk(string filename)
{
// Analyses the file and performs replacements,
// perhaps writing to the same filename with a different
// file extension
}
private void CreateOutputFileFromChunks(string outputFile, List<string> splitFileNames)
{
// Combines the rewritten chunks created by AnalyseChunk back into
// one large file, outputFile.
}
public void AnalyseFile(string inputFile, string outputFile)
{
List<string> splitFileNames = SplitFileIntoChunks(inputFile);
var tasks = new List<Task>();
foreach (string chunkName in splitFileNames)
{
var task = Task.Factory.StartNew(() => AnalyseChunk(chunkName));
tasks.Add(task);
}
Task.WaitAll(tasks.ToArray());
CreateOutputFileFromChunks(outputFile, splitFileNames);
}
One tiny nit: move the calculation of the length of the stream out of the loop, you only need to get that once.
EDIT: also, include #Pavel Gatilov's idea to invert the logic of the inner loop and search for each word in the line in the 12 million list.
Several ideas:
I think it will be more efficient to split each line into words and look if each of several words appears in your word list. 10 lookups in a hashset is better than millions of searches of a substring. If you have composite keywords, make appropriate indexes: one that contains all single words that occur in the real keywords and another that contains all the real keywords.
Perhaps, loading strings into StringBuilder is better for replacing.
Update progress after, say 10000 lines processed, not after each one.
Process in background threads. It won't make it much faster, but the app will be responsible.
Parallelize the code, as Jeremy has suggested.
UPDATE
Here is a sample code that demonstrates the by-word index idea:
static void ReplaceWords()
{
string inputFileName = null;
string outputFileName = null;
// this dictionary maps each single word that can be found
// in any keyphrase to a list of the keyphrases that contain it.
IDictionary<string, IList<string>> singleWordMap = null;
using (var source = new StreamReader(inputFileName))
{
using (var target = new StreamWriter(outputFileName))
{
string line;
while ((line = source.ReadLine()) != null)
{
// first, we split each line into a single word - a unit of search
var singleWords = SplitIntoWords(line);
var result = new StringBuilder(line);
// for each single word in the line
foreach (var singleWord in singleWords)
{
// check if the word exists in any keyphrase we should replace
// and if so, get the list of the related original keyphrases
IList<string> interestingKeyPhrases;
if (!singleWordMap.TryGetValue(singleWord, out interestingKeyPhrases))
continue;
Debug.Assert(interestingKeyPhrases != null && interestingKeyPhrases.Count > 0);
// then process each of the keyphrases
foreach (var interestingKeyphrase in interestingKeyPhrases)
{
// and replace it in the processed line if it exists
result.Replace(interestingKeyphrase, GetTargetValue(interestingKeyphrase));
}
}
// now, save the processed line
target.WriteLine(result);
}
}
}
}
private static string GetTargetValue(string interestingKeyword)
{
throw new NotImplementedException();
}
static IEnumerable<string> SplitIntoWords(string keyphrase)
{
throw new NotImplementedException();
}
The code shows the basic ideas:
We split both keyphrases and processed lines into equivalent units which may be efficiently compared: the words.
We store a dictionary that for any word quickly gives us references to all keyphrases that contain the word.
Then we apply your original logic. However, we do not do it for all 12 mln keyphrases, but rather for a very small subset of keyphrases that have at least a single-word intersection with the processed line.
I'll leave the rest of the implementation to you.
The code however has several issues:
The SplitIntoWords must actually normalize the words to some canonical form. It depends on the required logic. In the simplest case you'll probably be fine with whitespace-character splitting and lowercasing. But it may happen that you'll need a morphological matching - that would be harder (it's very close to full-text search tasks).
For the sake of the speed, it's likely to be better if the GetTargetValue method was called once for each keyphrase before processing the input.
If a lot of your keyphrases have coinciding words, you'll still have a signigicant amount of extra work. In that case you'll need to keep the positions of keywords in the keyphrases in order to use word distance calculation to exclude irrelevant keyphrases while processing an input line.
Also, I'm not sure if StringBuilder is actually faster in this particular case. You should experiment with both StringBuilder and string to find out the truth.
It's a sample after all. The design is not very good. I'd consider extracting some classes with consistent interfaces (e.g. KeywordsIndex).

Categories