Was wondering if anyone had any favourite methods/ useful libraries for processing a tab-delimited text file? This file is going to have on average 30,000 - 50,000 rows in it. Just need to read through each row and throw it into a database. However, i'd need to temporarily store all the data, the reason being that if the table holding the data gets to more than 1,000,00 rows, i'll need to create a new table and put the data in there. The code will be run in a windows service so i'm not worried about processing time.
Was thinking about just doing a standard while(sr.ReadLine()) ... any suggestions?
Cheers,
Sean.
filehelpers
This library is very flexible and fast. I never get tired recommending it. Defaults to ',' as a delimiter, but you can change it to '\t' easily.
I suspect "throwing it into a database" will take at least 1 order of magnitude longer than reading a line into a buffer, so you could pre-scan the data just to count the number of rows (without parsing them). Then make your database decisions. Then re-read the data doing the real work. With luck, the OS will have cached the file so it reads even quicker.
Related
I am in the design phase of a simple tool I want to write where I need to read large log files. To give you guys some context I will first explain you something about it.
The log files I need to read consists of log entries which always consist of the following 3-line format:
statistics : <some data which is more of less of the same length about 100 chars>
request : <some xml string which can be small (10KB) or big (25MB) and anything in between>
response : <ditto>
The log files can be about 100-600MB of size which means a lot of log entries. Now these log entries can have a relation with each other, for this I need to start reading the file from the end to the beginning. These relationship can be deduced from the statistics line.
I want to use the info in the statistics line to build up some datagrid which the users can use to search through the data and do some filtering operations. Now I don't want to load the request / response lines into memory until the user actually needs it. In addition I want to keep the memory load small by limiting the maximum of loaded request/response entries.
So I think I need to save the offsets of the statistics line when I am parsing the file for the first time and creating a index of statistics. Then when the user clicks on some statistic which is a element of a log entry then I read the request / response from the file by using this offset. I can then hold it some memory pool which takes care that there are not to much loaded request / response entries (see earlier req).
The problem is that I don't know how often the user is going to need the request/response data. It could be a lot it could be a few times. In addition the log file could be loaded from a network share.
The question I have is:
Is this a scenario when you should use a memory mapped file because of the fact there could be a lot of read operations? Or is it better to use a plain filestream. BTW. I don't need write operations to the log file at this stage but it could be in the future!
If you have other tips or see flaws in my thinking so far please let me know as well. I am open for any approach.
Update:
To clarify some more:
The tool itself has to do the parsing when the user loads a log file from a drive or network share.
The tool will be written as WinForms application.
The user can export a made selection of log entries. At this moment the format of this export is unknown (binary, file db, textfile). This export can be imported by the application itself which then only shows the selection made by the user.
You're talking about some stored data that has some defined relationships between actual entries... Maybe it's just me, but this scenario just calls for some kind of a relational database. I'd suggest to consider some portable db, like SQL Server CE for instance. It'll make your life much easier and provide exactly the functionality you need. If you use db instead, you can query exactly the data you need, without ever needing to handle large files like this.
If you're sending the request/response chunk over the network, the network send() time is likely to be so much greater than the difference between seek()/read() and using memmap that it won't matter. To really make this scale, a simple solution is to just breakup the file into many files, one for each chunk you want to serve (since the "request" can be up to 25 MB). Then your HTTP server will send that chunk as effeciently as possible (perhaps even using zerocopy, depending on your webserver). If you have many small "request" chunks, and only a few giant ones, you could break-out only the ones past a certain threshold.
I don't disagree with with answer from walther. I would go db or all memory.
Why are you so concerned about saving memory as 600 MB is not that much. Are you going to be running on machines with less than 2 GB of memory?
Load into a dictionary with statistics as a key and the value a class with two properties - request and response. Dictionary is fast. LINQ is powerful and fast.
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Removing the first line of a text file in C#
What would be the fastest and smartest way to remove the first line from a huge (think 2-3 GB) file?
I think, that you probably can't avoid rewriting the whole file chunk-by-chunk, but I might be wrong.
Could using memory-mapped files somehow help to solve this issue?
Is it possible to achieve this behavior by operating directly on the file system (NTFS, for example) - say, update the corresponding inode data and change the file starting sector, so that the first line is ignored? If yes, would this approach be really fragile or there are many other applications, except the OS itself that do something similiar?
NTFS by default on most volumes (but importantly not all!) stores data in 4096 byte chunks. These are referenced by the $MFT record, which you cannot edit directly because it's disallowed by the Operating System (for reasons of sanity). As a result, there is no trick available to operate on the filesystem to do something approaching what you want (in other words, you cannot directly reverse truncate a file on NTFS, even in filesystem chunk sized amounts.)
Because of the way files are stored in a filesystem, the only answer is that you must rewrite the entire file directly. Or figure out a different way to store your data. a 2-3GB file is massive and crazy, especially considering you referred to lines meaning that this data is at least in part text information.
You should look into putting this data into a database perhaps? Or organizing it a bit more efficiently at the very least.
You can overwrite every character that you want to erase with '\x7f'. Then, when reading in the file, your reader ignores that character. This assumes you have a text file that doesn't ever use the DEL character, of course.
std::istream &
my_getline (std::istream &in, std::string &s,
char del = '\x7f', char delim = '\n') {
std::getline(in, s, delim);
std::size_t beg = s.find(del);
while (beg != s.npos) {
std::size_t end = s.find_first_not_of(del, beg+1);
s.erase(beg, end-beg);
beg = s.find(del, beg+1);
}
return in;
}
As Henk points out, you could choose a different character to act as your DELETE. But, the advantage is that the technique works no matter which line you want to remove (it is not limited to the first line), and doesn't require futzing with the file system.
Using the modified reader, you can periodically "defragment" the file. Or, the defragmentation may occur naturally as the contents are streamed/merged into a different file or archived to a different machine.
Edit: You don't explicitly say it, but I am guessing this is for some kind of logging application, where the goal is to put an upper bound on the size of the log file. However, if that is the goal, it is much easier to just use a collection of smaller log files. Let's say you maintained roughly 10MB log files, with total logs bounded to 4GB. So that would be about 400 files. If the 401st file is started, for each line written there, you could use the DELETE marker on successive lines in the first file. When all lines have been marked for deletion, the file itself can be deleted, leaving you with about 400 files again. There is no hidden O(n2) behavior so long as the first file is not closed while the lines are being deleted.
But easier still is allow your logging system to keep the 1st and 401st file as is, and remove the 1st file when moving to the 402nd file.
Even if you could remove a leading block it would at least be a sector (512 bytes), probably not a match to the size of your line.
Consider a wrapper (maybe even a helper file) to just start reading from a certain offset.
Idea (no magic dust, only hard work below):
use user-mode file system such as http://www.eldos.com/cbfs/ or http://dokan-dev.net/en/ to WRAP around your real filesystem, and create a small book-keeping system to track how many of the file is 'eaten' at front. At certain time, when file grows too big, rewrite the file into another and start over.
How about that?
EDIT:
if you go with virtual file system, then you can use smaller (256mb) file fragments that you can glue into one 'virtual' file with desired offset. That way you won't ever need to re-write the file.
MORE:
Reflection on the idea on 'overwriting' first few lines with 'nothing' - don't do that, instead, add one 64-bit integer to the FRONT of the file, and use any method you like to skip that many bytes, for example Stream derivation that will wrap original stream and offset the reading in it.
I guess that might be better if you choose to use wrappers on the 'client' side.
Break the file in two , the first being the smaller chunk.
Remove the first line and then attach with the other.
I have a file which contains a certain number of fixed length rows having some numbers. I need to read each row in order to get that number and process them and write to a file.
Since I need to read each row, as the number of rows increases it becomes time consuming.
Is there an efficient way of reading each row of the file? I'm using C#.
File.ReadLines (.NET 4.0+) is probably the most memory efficient way to do this.
It returns an IEnumerable<string> meaning that lines will get read lazily in a streaming fashion.
Previous versions do not have the streaming option available in this manner, but using StreamReader to read line by line would achieve the same.
Reading all rows from a file is always at least O(n). When file size starts becoming an issue then its probably a good time to look at creating a database for the information instead of flat files.
Not sure this is the most efficient, but it works well for me:
http://msdn.microsoft.com/en-us/library/system.io.fileinfo.aspx
//Declare a new file and give it the path to your file
FileInfo fi1 = new FileInfo(path);
//Open the file and read the text
using (StreamReader sr = fi1.OpenText())
{
string s = "";
// Loop through each line
while ((s = sr.ReadLine()) != null)
{
//Here is where you handle your row in the file
Console.WriteLine(s);
}
}
No matter which operating system you're using, there will be several layers between your code and the actual storage mechanism. Hard drives and tape drives store files in blocks, which these days are usually around 4K each. If you want to read one byte, the device will still read the entire block into memory -- it's just faster that way. The device and the OS also may each keep a cache of blocks. So there's not much you can do to change the standard (highly optimized) file reading behavior; just read the file as you need it and let the system take care of the rest.
If the time to process the file is becoming a problem, two options that might help are:
Try to arrange to use shorter files. It sounds like you're processing log files or something -- running your program more frequently might help to at least give the appearance of better performance.
Change the way the data is stored. Again, I understand that the file comes from some external source, but perhaps you can arrange for a job to run that periodically converts the raw file to something that you can read more quickly.
Good luck.
What is the best way for me to check for new files added to a directory, I dont think the filesystemwatcher would be suitable as this is not an always on service but a method that runs when my program starts up.
there are over 20,000 files in the folder structure I am monitoring, at present I am checking each file individually to see if the filepath is in my database table, however this is taking around ten minutes and I would like to speed it up is possible,
I can store the date the folder was last checked - is it easy to get all files with createddate > last checked date.
anyone got any Ideas?
Thanks
Mark
Your approach is the only feasible (i.e. file system watcher allows you to see changes, not check on start).
Find out what takes so long. 20.000 checks should not take 10 minutes - maybe 1 maximum. Your program is written slowly. How do you test it?
Hint: do not ask the database, get a list of all files into memory, a list of all filesi n the database, check in memory. 20.000 SQL statements to the database are too slow, this way you need ONE to get the list.
10 minutes seems awfully long for 20,000 files. How are you going about doing the comparison? Your suggestion doesn't account for deleted files either. If you want to remove those from the database, you will have to do a full comparison.
Perhaps the problem is the database round trips. You can retrieve a known file list from the database in large chunks (or all at once), sorted alphabetically. Sort the local file list as well and walk the two lists, processing missing or new entries as you go along.
FileSystemWatcher is not reliable, so even if you could use a service, it would not necessarily work for you.
The two options I can see are:
Keep a list of files you know about and keep comparing to this list. This will allow you to see if files were added, deleted etc. Keep this list in memory, instead of querying the database for each file.
As you suggest, store a timestamp and compare to that.
You can write in somewhere the last timestamp that onfile was created, it is simple and can work for you.
Can you write a service that runs on that machine? The service can then use FileSystemWtcher
Having a FileSystemWatcher service like Kevin Jones suggests is probably the most pragmatic answer, but there are some other options.
You can watch the directory with inotify if you mount it with Samba on a linux box. That of course assumes you don't mind fragmenting your platform, but that's what inotify is there for.
And then more correctly but with correspondingly less chance of you getting a go-ahead, if you're sitting monitoring a directory with 20K files in it it is probably time to evolve your system architecture. Not knowing all that much more about your application, it sounds like a message queue might be worth looking at.
I'm using C# and I get an out System.OutOfMemoryException error after I read in 50,000 records, what is best practice for handling such large datasets? Will paging help?
I might recommend creating the MDB file and using a DataReader to stream the records into the MDB rather than trying to read in and cache the entire set of data locally. With a DataReader, the process is more manual, but you only get one record at a time so you won't fill up your memory.
You still shouldn't read everything in at once. Read in chunks, then write the chunk out to the mdb file, then read another chunk and add that to the file. Reading in 50,000 records at once is just asking for trouble.
Obviously, you can't read all the data in the memory before creating the MDB file, otherwise you wouldn't be getting out of memory exception. :-)
You have two options:
- partitioning - read the data in smaller chunks using filtering
- virtualizing - split the data in pages and load only the current page
In any case, you have to create the MDB file and transfer the data after that in chunks.
If you're using xml, just read a few nodes at at time. If you're using some other format, just read a few lines (or whatever) at a time. Don't load the entire thing into memory before you start working on it.
I would suggest using a generator:
"...instead of building an array containing all the values and returning them all at once, a generator yields the values one at a time, which requires less memory and allows the caller to get started processing the first few values immediately. In short, a generator looks like a function but behaves like an iterator."
The wikipedia article also has few good examples