I'm using C# and I get an out System.OutOfMemoryException error after I read in 50,000 records, what is best practice for handling such large datasets? Will paging help?
I might recommend creating the MDB file and using a DataReader to stream the records into the MDB rather than trying to read in and cache the entire set of data locally. With a DataReader, the process is more manual, but you only get one record at a time so you won't fill up your memory.
You still shouldn't read everything in at once. Read in chunks, then write the chunk out to the mdb file, then read another chunk and add that to the file. Reading in 50,000 records at once is just asking for trouble.
Obviously, you can't read all the data in the memory before creating the MDB file, otherwise you wouldn't be getting out of memory exception. :-)
You have two options:
- partitioning - read the data in smaller chunks using filtering
- virtualizing - split the data in pages and load only the current page
In any case, you have to create the MDB file and transfer the data after that in chunks.
If you're using xml, just read a few nodes at at time. If you're using some other format, just read a few lines (or whatever) at a time. Don't load the entire thing into memory before you start working on it.
I would suggest using a generator:
"...instead of building an array containing all the values and returning them all at once, a generator yields the values one at a time, which requires less memory and allows the caller to get started processing the first few values immediately. In short, a generator looks like a function but behaves like an iterator."
The wikipedia article also has few good examples
Related
I am trying to download tables from sql server and write each downloaded table to a csv file and then gzip it.
My problem now is that the table is so large (1 million lines above and I was using python pandas dataframe to do it), that it gives memory error.
Is there a way to do this lazily in C# so that the memory usage is minimized, and then I can run 2-3 processes in parallel for this task?
Yes and yes.
You have to retrieve data in a loop to ensure that you are not holding all that million records in a memory. Use StreamWriter to write lines into file instead of holding them in memory: OutOfMemory exception thrown while writing large text file
Create your software so that writing method takes table name as parameter. Then you can run all the tables parallel if you want to. Use separate files for database tables to ensure faster performance. If you want to perform database writings as separate handleable threads, use thread start: https://msdn.microsoft.com/en-us/library/6x4c42hc(v=vs.110).aspx
Or make that writing async and call it with await keyword. https://msdn.microsoft.com/en-us/library/hh193364(v=vs.110).aspx
I am in the design phase of a simple tool I want to write where I need to read large log files. To give you guys some context I will first explain you something about it.
The log files I need to read consists of log entries which always consist of the following 3-line format:
statistics : <some data which is more of less of the same length about 100 chars>
request : <some xml string which can be small (10KB) or big (25MB) and anything in between>
response : <ditto>
The log files can be about 100-600MB of size which means a lot of log entries. Now these log entries can have a relation with each other, for this I need to start reading the file from the end to the beginning. These relationship can be deduced from the statistics line.
I want to use the info in the statistics line to build up some datagrid which the users can use to search through the data and do some filtering operations. Now I don't want to load the request / response lines into memory until the user actually needs it. In addition I want to keep the memory load small by limiting the maximum of loaded request/response entries.
So I think I need to save the offsets of the statistics line when I am parsing the file for the first time and creating a index of statistics. Then when the user clicks on some statistic which is a element of a log entry then I read the request / response from the file by using this offset. I can then hold it some memory pool which takes care that there are not to much loaded request / response entries (see earlier req).
The problem is that I don't know how often the user is going to need the request/response data. It could be a lot it could be a few times. In addition the log file could be loaded from a network share.
The question I have is:
Is this a scenario when you should use a memory mapped file because of the fact there could be a lot of read operations? Or is it better to use a plain filestream. BTW. I don't need write operations to the log file at this stage but it could be in the future!
If you have other tips or see flaws in my thinking so far please let me know as well. I am open for any approach.
Update:
To clarify some more:
The tool itself has to do the parsing when the user loads a log file from a drive or network share.
The tool will be written as WinForms application.
The user can export a made selection of log entries. At this moment the format of this export is unknown (binary, file db, textfile). This export can be imported by the application itself which then only shows the selection made by the user.
You're talking about some stored data that has some defined relationships between actual entries... Maybe it's just me, but this scenario just calls for some kind of a relational database. I'd suggest to consider some portable db, like SQL Server CE for instance. It'll make your life much easier and provide exactly the functionality you need. If you use db instead, you can query exactly the data you need, without ever needing to handle large files like this.
If you're sending the request/response chunk over the network, the network send() time is likely to be so much greater than the difference between seek()/read() and using memmap that it won't matter. To really make this scale, a simple solution is to just breakup the file into many files, one for each chunk you want to serve (since the "request" can be up to 25 MB). Then your HTTP server will send that chunk as effeciently as possible (perhaps even using zerocopy, depending on your webserver). If you have many small "request" chunks, and only a few giant ones, you could break-out only the ones past a certain threshold.
I don't disagree with with answer from walther. I would go db or all memory.
Why are you so concerned about saving memory as 600 MB is not that much. Are you going to be running on machines with less than 2 GB of memory?
Load into a dictionary with statistics as a key and the value a class with two properties - request and response. Dictionary is fast. LINQ is powerful and fast.
My application currently implements a FileSystemWatcher object to monitor a directory (C:\Incoming) for the creation of a file (Input.xml).
I'm currently using a Streamreader object to read the file into my application, however I'm concerned about performance considering that the data will be used to perform operations in a SQL Server database.
What would be the FASTEST way to read the file into memory (or am I already using it)?
You are using the standard approach to the task.
If you need to load the data (large data, that is) into a database, the bottleneck will be between your code and the database, unless you use a BULK INSERT technique, as opposed to inserting rows one by one. The details of that depend on the particular database server.
This is however no concern if the files are relatively small in which case the load will be more evenly distributed. Even then I would not care much about the speed of disk access.
Make sure that the file is completely written before you start reading it. For example, try opening it for exclusive reading first. It does help a bit if the file is make accessible to you through a rename operation as opposed to a create operation, especially if you expect many files to be arriving, because your server side then does not have to busy wait until the file is completely written. This basically means that the file should first be written with a file name for which you are NOT watching.
.NET has the SqlBulkCopy class too
I have a file which contains a certain number of fixed length rows having some numbers. I need to read each row in order to get that number and process them and write to a file.
Since I need to read each row, as the number of rows increases it becomes time consuming.
Is there an efficient way of reading each row of the file? I'm using C#.
File.ReadLines (.NET 4.0+) is probably the most memory efficient way to do this.
It returns an IEnumerable<string> meaning that lines will get read lazily in a streaming fashion.
Previous versions do not have the streaming option available in this manner, but using StreamReader to read line by line would achieve the same.
Reading all rows from a file is always at least O(n). When file size starts becoming an issue then its probably a good time to look at creating a database for the information instead of flat files.
Not sure this is the most efficient, but it works well for me:
http://msdn.microsoft.com/en-us/library/system.io.fileinfo.aspx
//Declare a new file and give it the path to your file
FileInfo fi1 = new FileInfo(path);
//Open the file and read the text
using (StreamReader sr = fi1.OpenText())
{
string s = "";
// Loop through each line
while ((s = sr.ReadLine()) != null)
{
//Here is where you handle your row in the file
Console.WriteLine(s);
}
}
No matter which operating system you're using, there will be several layers between your code and the actual storage mechanism. Hard drives and tape drives store files in blocks, which these days are usually around 4K each. If you want to read one byte, the device will still read the entire block into memory -- it's just faster that way. The device and the OS also may each keep a cache of blocks. So there's not much you can do to change the standard (highly optimized) file reading behavior; just read the file as you need it and let the system take care of the rest.
If the time to process the file is becoming a problem, two options that might help are:
Try to arrange to use shorter files. It sounds like you're processing log files or something -- running your program more frequently might help to at least give the appearance of better performance.
Change the way the data is stored. Again, I understand that the file comes from some external source, but perhaps you can arrange for a job to run that periodically converts the raw file to something that you can read more quickly.
Good luck.
Was wondering if anyone had any favourite methods/ useful libraries for processing a tab-delimited text file? This file is going to have on average 30,000 - 50,000 rows in it. Just need to read through each row and throw it into a database. However, i'd need to temporarily store all the data, the reason being that if the table holding the data gets to more than 1,000,00 rows, i'll need to create a new table and put the data in there. The code will be run in a windows service so i'm not worried about processing time.
Was thinking about just doing a standard while(sr.ReadLine()) ... any suggestions?
Cheers,
Sean.
filehelpers
This library is very flexible and fast. I never get tired recommending it. Defaults to ',' as a delimiter, but you can change it to '\t' easily.
I suspect "throwing it into a database" will take at least 1 order of magnitude longer than reading a line into a buffer, so you could pre-scan the data just to count the number of rows (without parsing them). Then make your database decisions. Then re-read the data doing the real work. With luck, the OS will have cached the file so it reads even quicker.