im trying to read in a log file in c# thats huge - approx 300mbs of raw text data. ive been testing my program on smaller files approx 1mb which stores all log messages into a string[] array and searching with contains.
however that is too slow and takes up too much memory, i will never be able to process the 300mb log file. i need a way to grep the file, which quickly filters through it finding useful data and printing the line of log information corresponding to the search.
the big question is scale, i think 300mb will be my max, but need my program to handle it. what functions, data structions, searching can i use that will scale well with speed and efficiency to read a log file that big
File.ReadLines is probably your best bet as it gives you an IEnumerable of lines of the text file and reads them lazily as you iterate over the IEnumerable. You can then use whatever method for searching the line you'd like to use (Regex, Contains, etc) and do something with it. My example below spawns a thread to search the line and output it to the console, but you can do just about anything. Of course, TEST, TEST, TEST on large files to see your performance mileage. I imagine if each individual thread spawned below takes too long, you can run into a thread limit.
IEnumerable<string> lines = File.ReadLines("myLargeFile.txt");
foreach (string line in lines) {
string lineInt = line;
(new Thread(() => {
if (lineInt.Contains(keyword)) {
Console.WriteLine(lineInt);
}
})).Start();
}
EDIT: Through my own testing, this is obviously faster:
foreach (string lineInt in File.ReadLines("myLargeFile.txt").Where(lineInt => lineInt.Contains(keyword))) {
Console.WriteLine(lineInt);
}
Related
So I made a math problem program that basically reads one number from a text file (only number in that text file) and replaces it with a number+1 if number is not a solution.
Now the issue is, if I only add a text in the next row using
sw.WriteLine(text);
that makes the calculations really fast, doing 100k+ numbers in a few seconds, but it's just adding the number to the text file without deleting previous.
Alternatively I used
string[] lines = File.ReadAllLines("numbers.txt");
foreach (string line in lines)
{
lines[0] = Convert.ToString(biginta);
}
File.WriteAllLines("numbers.txt", lines);
but that made my program run considerably slower.
Is there a way I can replace text in a .txt file by using already open filestream?
I'm new to c# so my whole program is basically a Frankenstein of a code.
I'm using a file to store the next number needed to run because I turn off my pc overnight.
Honestly the quickest solutiion to this is the following: Read the file once, do several (like 100) calculations without saving and then store the current number back into the file.
Tune the interval so that you store the current state once every 5 seconds or so.
That gives you still a good starting point (at most 5 seconds lost work) but also reduces disk IO to the point where it won't slow down the calculation any more.
I'm trying to process large concatenated files and split them back into individual files. For anyone who wants to know, searchString contains the magic numbers that make up the file headers and is used to tell where each file begins.
This line
string dataString = String.Concat(dataByte.Select(b => b.ToString("x2")));
...results in a:
Out of Memory exception
I have seen a few different ways that large files can be processed, but none of those approaches seem to handle the data the way I need this program to. Is there any way to correct the Out of Memory exception without changing anything inside the foreach loop?
byte[] dataByte = File.ReadAllBytes(pathString);
string dataString = String.Concat(dataByte.Select(b => b.ToString("x2")));
string[] LineArray = Regex.Split(dataString, searchString);
foreach (string LineResult in LineArray)
{
//The string processing operations go here.
//The individual files are created from the output.
}
Strings are an awful way to handle large quantities of data, since they are immutable, the code as you have it will create a ton of copies of each as it processes each.
Try to use everything as bytes until you cannot, and deal in chunks as opposed to all at one time.
For example, read bytes from the file until you see the magic bytes (that you created from the magic string) the process that, save it, then continue on. Arrays must be contiguous, so limit their size to avoid running out of large chunks of memory even though there is plenty available overall.
I have a console app that reads in a large text file with 40k+ lines, each line is a key that I use in a search for which the results are written to a output file. Issue is I leave this console app running for a while until it just suddenly closes and I realize that the process memory usage was really high was sitting at 1.6gb when I last saw it crash.
I looked around and didn't find many answers I did try to use the gcAllowVeryLargeObjects but that seems like I'm just dodging the problem.
Below is a snippet from my main() of where I write out to the file. I can't seem to understand why the memory usage gets so high. I flush the writer after every write (could it be because I'm keeping the file open for such a long period of time?).
TextWriter writer = new StreamWriter("output.csv", false));
foreach (var item in list)
{
Console.WriteLine("{0}/{1}", count, numofitem);
var result = TableServiceContext.Read(p.id);
if (result != null)
{
writer.WriteLine(String.Join(",", result.id,
result.code,
result.hash));
}
count++;
writer.Flush();
}
writer.Close();
Edit: I have 32gb of ram on my computer so I am sure it's not running out of memory because I don't have enough ram.
Edit2: changed the name of the repository as that was misleading.
If the average line length is 1KB then 40K lines is 40MB, and it nothing. That's why, I'm pretty sure problem is in your repository class. If it is EF repository, try to recreate DbContext for each line.
If you want to tune up your program, then, you can use the following method: Try to put timestamps to Console output, you can use Stopwatch class, and try to recreate your repository each 10 or 100 or N lines. Then, looking at timestamps, you can find optimal N to use.
var timer = Stopwatch.StartNew();
...
Console.WriteLine(timer.ElapsedMilliseconds);
From looking at the code I think the problem isN't the Streamwriter but some memory leak in your repository. Suggestions to check:
replace the repository by some dummy e.g. class dummy_repository with just the three properties id, value, hash.
likewise create a long "list" e.g. 40k small entries.
run your program and see if it still consumes memory (I am pretty sure it will not)
then step by step add back your original parts. See what step causes the memory leak.
I'm trying to process large amount of text files via Parallel.ForEach adding processed data to BlockingCollection.
The problem is that I want the Task taskWriteMergedFile to consume the collection and write them to result file at least every 800000 lines.
I guess that I can't test the collection size within the iteration because it is paralleled so I created the Task.
Can I convert while(true) loop in the task to EventWaitHandle in this case?
const int MAX_SIZE = 1000000;
static BlockingCollection<string> mergeData;
mergeData = new BlockingCollection<string>(new ConcurrentBag<string>(), MAX_SIZE);
string[] FilePaths = Directory.GetFiles("somepath");
var taskWriteMergedFile = new Task(() =>
{
while ( true )
{
if ( mergeData.Count > 800000)
{
String.Join(System.Environment.NewLine, mergeData.GetConsumingEnumerable());
//Write to file
}
Thread.Sleep(10000);
}
}, TaskCreationOptions.LongRunning);
taskWriteMergedFile.Start();
Parallel.ForEach(FilePaths, FilePath => AddToDataPool(FilePath));
mergeData.CompleteAdding();
You probably don't want to do it that way. Instead, have your task write each line to the file as it's received. If you want to limit the file size to 80,000 lines, then after the 80,000th line is written, close the current file and open a new one.
Come to think of it, what you have can't work because GetConsumingEnumerable() won't stop until the collection is marked as complete for adding. What would happen is that the thing would go through the sleep loop until there were 80,000 items in the queue, and then it would block on the String.Join until the main thread calls CompleteAdding. With enough data, you'd run out of memory.
Also, unless you have a very good reason, you shouldn't use ConcurrentBag here. Just use the default for BlockingCollection, which is ConcurrentQueue. ConcurrentBag is a rather special purpose data structure that won't perform as well as ConcurrentQueue.
So your task becomes:
var taskWriteMergedFile = new Task(() =>
{
int recordCount = 0;
foreach (var line in mergeData.GetConsumingEnumerable())
{
outputFile.WriteLine(line);
++recordCount;
if (recordCount == 80,000)
{
// If you want to do something after 80,000 lines, do it here
// and then reset the record count
recordCount = 0;
}
}
}, TaskCreationOptions.LongRunning);
That assumes, of course, that you've opened the output file somewhere else. It's probably better to open the output at the start of the task, and close it after the foreach has exited.
On another note, you probably don't want your producer loop to be parallel. You have:
Parallel.ForEach(FilePaths, FilePath => AddToDataPool(FilePath));
I don't know for sure what AddToDataPool is doing, but if it's reading a file and writing the data to the collection, you have a couple of problems. First, the disk drive can only do one thing at a time, so it ends up reading part of one file, then part of another, then part of another, etc. In order to read each chunk of the next file, it has to seek the head to the proper position. Disk head seeks are incredibly expensive--5 milliseconds or more. An eternity in CPU time. Unless you're doing heavy duty processing that takes much longer than reading the file, you're almost always better off processing one file at a time. Unless you can guarantee that the input files are on separate physical disks . . .
The second potential problem is that with multiple threads running, you can't guarantee the order in which things are written to the collection. That might not be a problem, of course, but if you expect all of the data from a single file to be grouped together in the output, that's not going to happen with multiple threads each writing multiple lines to the collection.
Just something to keep in mind.
I want to write a fast multi thread program using c# that read a file.
so the file must be split into some parts and each part process in different thread. for ex:
Line1
Line2
Line3
Line4
must split to 4 lines like this:
Line1 => thread 1
Line2 => thread 2
Line3 => thread 3
Line4 = > thread 4
i used the StreamReader.readLine() but it cant read specify line.
Comment: its necessary to speedup the program so i want to read file in separate threads.
Unless you're using fixed-length lines, this isn't possible.
Why? Because in order to determine where the "lines" split, you need to find the newline characters... which means you need to read the file first.
Now, if you simply want to perform some extra "processing" after you read in each line - that is possible and relatively straight-forward using a ThreadPool.
You should read the file in a single thread - but then spawn the processing of each line to a different thread, e.g. by adding it to a producer/consumer queue.
Even if you could seek to a specific line in a text file (which in general you can't) you really don't want the disk thrashing around - that'll only slow things down. The fastest way to get the data off the disk is to read it sequentially. By all means defer everything about handling the line beyond "decoding the binary data to text" to other threads, but you really don't want the IO to be in multiple threads.
AFAIK .NET doesn't support parallel stream reading. If you want to process every line you may use File.ReadAllLines. It returns an array of strings. Then use you can use PLINQ.
var result = File.ReadAllLine("path")
.AsParallel()
.Select(s => DoSthWithString(s))
.ToList();
You're not going to be able to speed up the actual reading because you're going to have tremendous locking issues keeping everything straight.
Since a text file is an unstructured file, ie. each line can be of different length, you have no choice but to read each line after the other, one by one.
Now, what you can do is process those lines on different threads, but the actual reading, keep that to one thread.
But, before you do that, are you sure you even have to do this? Is this a bottleneck? If not, fix the bottleneck first and see how far you get.
Your StreamReader is connected to a stream class. Using the stream class you can .Seek to a particular byte location.
Like others have said, this probably isn't a good idea, but it can be done.
I would split the file before hand. Say the file is 1000 lines. Split it into 10 files of 100 lines. Have a thread process each file.