I have a file which contains a certain number of fixed length rows having some numbers. I need to read each row in order to get that number and process them and write to a file.
Since I need to read each row, as the number of rows increases it becomes time consuming.
Is there an efficient way of reading each row of the file? I'm using C#.
File.ReadLines (.NET 4.0+) is probably the most memory efficient way to do this.
It returns an IEnumerable<string> meaning that lines will get read lazily in a streaming fashion.
Previous versions do not have the streaming option available in this manner, but using StreamReader to read line by line would achieve the same.
Reading all rows from a file is always at least O(n). When file size starts becoming an issue then its probably a good time to look at creating a database for the information instead of flat files.
Not sure this is the most efficient, but it works well for me:
http://msdn.microsoft.com/en-us/library/system.io.fileinfo.aspx
//Declare a new file and give it the path to your file
FileInfo fi1 = new FileInfo(path);
//Open the file and read the text
using (StreamReader sr = fi1.OpenText())
{
string s = "";
// Loop through each line
while ((s = sr.ReadLine()) != null)
{
//Here is where you handle your row in the file
Console.WriteLine(s);
}
}
No matter which operating system you're using, there will be several layers between your code and the actual storage mechanism. Hard drives and tape drives store files in blocks, which these days are usually around 4K each. If you want to read one byte, the device will still read the entire block into memory -- it's just faster that way. The device and the OS also may each keep a cache of blocks. So there's not much you can do to change the standard (highly optimized) file reading behavior; just read the file as you need it and let the system take care of the rest.
If the time to process the file is becoming a problem, two options that might help are:
Try to arrange to use shorter files. It sounds like you're processing log files or something -- running your program more frequently might help to at least give the appearance of better performance.
Change the way the data is stored. Again, I understand that the file comes from some external source, but perhaps you can arrange for a job to run that periodically converts the raw file to something that you can read more quickly.
Good luck.
Related
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Removing the first line of a text file in C#
What would be the fastest and smartest way to remove the first line from a huge (think 2-3 GB) file?
I think, that you probably can't avoid rewriting the whole file chunk-by-chunk, but I might be wrong.
Could using memory-mapped files somehow help to solve this issue?
Is it possible to achieve this behavior by operating directly on the file system (NTFS, for example) - say, update the corresponding inode data and change the file starting sector, so that the first line is ignored? If yes, would this approach be really fragile or there are many other applications, except the OS itself that do something similiar?
NTFS by default on most volumes (but importantly not all!) stores data in 4096 byte chunks. These are referenced by the $MFT record, which you cannot edit directly because it's disallowed by the Operating System (for reasons of sanity). As a result, there is no trick available to operate on the filesystem to do something approaching what you want (in other words, you cannot directly reverse truncate a file on NTFS, even in filesystem chunk sized amounts.)
Because of the way files are stored in a filesystem, the only answer is that you must rewrite the entire file directly. Or figure out a different way to store your data. a 2-3GB file is massive and crazy, especially considering you referred to lines meaning that this data is at least in part text information.
You should look into putting this data into a database perhaps? Or organizing it a bit more efficiently at the very least.
You can overwrite every character that you want to erase with '\x7f'. Then, when reading in the file, your reader ignores that character. This assumes you have a text file that doesn't ever use the DEL character, of course.
std::istream &
my_getline (std::istream &in, std::string &s,
char del = '\x7f', char delim = '\n') {
std::getline(in, s, delim);
std::size_t beg = s.find(del);
while (beg != s.npos) {
std::size_t end = s.find_first_not_of(del, beg+1);
s.erase(beg, end-beg);
beg = s.find(del, beg+1);
}
return in;
}
As Henk points out, you could choose a different character to act as your DELETE. But, the advantage is that the technique works no matter which line you want to remove (it is not limited to the first line), and doesn't require futzing with the file system.
Using the modified reader, you can periodically "defragment" the file. Or, the defragmentation may occur naturally as the contents are streamed/merged into a different file or archived to a different machine.
Edit: You don't explicitly say it, but I am guessing this is for some kind of logging application, where the goal is to put an upper bound on the size of the log file. However, if that is the goal, it is much easier to just use a collection of smaller log files. Let's say you maintained roughly 10MB log files, with total logs bounded to 4GB. So that would be about 400 files. If the 401st file is started, for each line written there, you could use the DELETE marker on successive lines in the first file. When all lines have been marked for deletion, the file itself can be deleted, leaving you with about 400 files again. There is no hidden O(n2) behavior so long as the first file is not closed while the lines are being deleted.
But easier still is allow your logging system to keep the 1st and 401st file as is, and remove the 1st file when moving to the 402nd file.
Even if you could remove a leading block it would at least be a sector (512 bytes), probably not a match to the size of your line.
Consider a wrapper (maybe even a helper file) to just start reading from a certain offset.
Idea (no magic dust, only hard work below):
use user-mode file system such as http://www.eldos.com/cbfs/ or http://dokan-dev.net/en/ to WRAP around your real filesystem, and create a small book-keeping system to track how many of the file is 'eaten' at front. At certain time, when file grows too big, rewrite the file into another and start over.
How about that?
EDIT:
if you go with virtual file system, then you can use smaller (256mb) file fragments that you can glue into one 'virtual' file with desired offset. That way you won't ever need to re-write the file.
MORE:
Reflection on the idea on 'overwriting' first few lines with 'nothing' - don't do that, instead, add one 64-bit integer to the FRONT of the file, and use any method you like to skip that many bytes, for example Stream derivation that will wrap original stream and offset the reading in it.
I guess that might be better if you choose to use wrappers on the 'client' side.
Break the file in two , the first being the smaller chunk.
Remove the first line and then attach with the other.
I have a xml file that needs to be read from many many times. I am trying to use the Parallel.ForEach to speed this processes up since none of that data being read in is relevant as to what order it is being read in. The data is just being used to populate objects. My problem is even though I am opening the file each time in the thread as read only it complains that it is open by another program. (I don't have it opened in a text editor or anything :))
How can I accomplish multi reads from the same file?
EDIT: The file is ~18KB pretty small. It is read from about 1,800 times.
Thanks
If you want multiple threads to read from the same file, you need to specify FileShare.Read:
using (var stream = File.Open("theFile.xml", FileMode.Open, FileAccess.Read, FileShare.Read))
{
...
}
However, you will not achieve any speedup from this, for multiple reasons:
Your hard disk can only read one thing at a time. Although you have multiple threads running at the same time, these threads will all end up waiting for each other.
You cannot easily parse a part of an XML file. You will usually have to parse the entire XML file every time. Since you have multiple threads reading it all the time, it seems that you are not expecting the file to change. If that is the case, then why do you need to read it multiple times?
Depending on the size of the file and the type of reads you are doing it might be faster to load the file into memory first, and then provide access to it directly to your threads.
You didnt provide any specifics on the file, the reads, etc so I cant say for sure if it would address your specific needs.
The general premise would be to load the file once in a single thread, and then either directly (via the Xml structure) or indirectly (via XmlNodes, etc) provide access to the file to each of your threads. I envision something similar to:
Load the file
For each Xpath query dispatch the matching nodes to your threads.
If the threads dont modify the XML directly, this might be a viable alternative.
When you open the file, you need to specify FileShare.Read :
using (var stream = new FileStream("theFile.xml", FileMode.Open, FileAccess.Read, FileShare.Read))
{
...
}
That way the file can be opened multiple times for reading
While an old post, it seems to be a popular one so I thought I would add a solution that I have used to good effect for multi-threaded environments that need read access to a file. The file must however be small enough to hold in memory at least for the duration of your processing, and the file must only be read and not written to during the period of shared access.
string FileName = "TextFile.txt";
string[] FileContents = File.ReadAllLines(FileName);
foreach (string strOneLine in FileContents)
{
// Do work on each line of the file here
}
So long as the file is only being read, multiple threads or programs can access and process it at the same time without treading on one another's toes.
I have a program that opens a large binary file, appends a small amount of data to it, and closes the file.
FileStream fs = File.Open( "\\\\s1\\temp\\test.tmp", FileMode.Append, FileAccess.Write, FileShare.None );
fs.Write( data, 0, data.Length );
fs.Close();
If test.tmp is 5MB before this program is run and the data array is 100 bytes, this program will cause over 5MB of data to be transmitted across the network. I would have expected that the data already in the file would not be transmitted across the network since I'm not reading it or writing it. Is there any way to avoid this behavior? This makes it agonizingly slow to append to very large files.
0xA3 provided the answer in a commment above. The poor performance was due to an on-access virus scan. Each time my program opened the file, the virus scanner read the entire contents of the file to check for viruses even though my program didn't read any of the existing content. Disabling the on-access virus scan eliminated the excessive network I/O and the poor performance.
Thanks to everyone for your suggestions.
I found this on MSDN (CreateFile is called internally):
When an application creates a file across a network, it is better to use GENERIC_READ | GENERIC_WRITE for dwDesiredAccess than to use GENERIC_WRITE alone. The resulting code is faster, because the redirector can use the cache manager and send fewer SMBs with more data. This combination also avoids an issue where writing to a file across a network can occasionally return ERROR_ACCESS_DENIED.
Using Reflector, FileAccess maps to dwDesiredAccess, so it would seem to suggest using FileAccess.ReadWrite instead of just FileAccess.Write.
I have no idea if this will help :)
You could cache your data into a local buffer and periodically (much less often than now) append to the large file. This would save on a bunch of network transfers but... This would also increase the risk of losing that cache (and your data) in case your app crashes.
Logging (if that's what it is) of this type is often stored in a db. Using a decent RDBMS would allow you to post that 100 bytes of data very frequently with minimal overhead. The caveat there is the maintenance of an RDBMS.
If you have system access or perhaps a friendly admin for the machine actually hosting the file you could make a small listener program that sits on the other end.
You make a call to it passing just the data to be written and it does the write locally, avoiding the extra network traffic.
The File object in .NET has quite a few static methods to handle this type of thing. I would suggest trying:
File file = File.AppendAllText("FilePath", "What to append", Encoding.UTF8);
When you reflect this method it turns out that it's using:
using (StreamWriter writer = new StreamWriter(path, true, encoding))
{
writer.Write(contents);
}
This StreamWriter method should allow you to simply append something to the end (at least this is the method I've seen used in every instance of logging that I've encountered so far).
Write the data to separate files, then join them (do it on the hosting machine if possible) only when necessary.
I did some googling and was looking more at how to read excessively large files quickly and found this link https://web.archive.org/web/20190906152821/http://www.4guysfromrolla.com/webtech/010401-1.shtml
The most interesting part there would be the part about byte reading:
Besides the more commonly used ReadAll and ReadLine methods, the TextStream object also supports a Read(n) method, where n is the number of bytes in the file/textstream in question. By instantiating an additional object (a file object), we can obtain the size of the file to be read, and then use the Read(n) method to race through our file. As it turns out, the "read bytes" method is extremely fast by comparison:
const ForReading = 1
const TristateFalse = 0
dim strSearchThis
dim objFS
dim objFile
dim objTS
set objFS = Server.CreateObject("Scripting.FileSystemObject")
set objFile = objFS.GetFile(Server.MapPath("myfile.txt"))
set objTS = objFile.OpenAsTextStream(ForReading, TristateFalse)
strSearchThis = objTS.Read(objFile.Size)
if instr(strSearchThis, "keyword") > 0 then
Response.Write "Found it!"
end if
This method could then be used by you to go to the end of the file and manually appending it instead of loading the entire file in append mode with a filestream.
Was wondering if anyone had any favourite methods/ useful libraries for processing a tab-delimited text file? This file is going to have on average 30,000 - 50,000 rows in it. Just need to read through each row and throw it into a database. However, i'd need to temporarily store all the data, the reason being that if the table holding the data gets to more than 1,000,00 rows, i'll need to create a new table and put the data in there. The code will be run in a windows service so i'm not worried about processing time.
Was thinking about just doing a standard while(sr.ReadLine()) ... any suggestions?
Cheers,
Sean.
filehelpers
This library is very flexible and fast. I never get tired recommending it. Defaults to ',' as a delimiter, but you can change it to '\t' easily.
I suspect "throwing it into a database" will take at least 1 order of magnitude longer than reading a line into a buffer, so you could pre-scan the data just to count the number of rows (without parsing them). Then make your database decisions. Then re-read the data doing the real work. With luck, the OS will have cached the file so it reads even quicker.
I'm using C# and I get an out System.OutOfMemoryException error after I read in 50,000 records, what is best practice for handling such large datasets? Will paging help?
I might recommend creating the MDB file and using a DataReader to stream the records into the MDB rather than trying to read in and cache the entire set of data locally. With a DataReader, the process is more manual, but you only get one record at a time so you won't fill up your memory.
You still shouldn't read everything in at once. Read in chunks, then write the chunk out to the mdb file, then read another chunk and add that to the file. Reading in 50,000 records at once is just asking for trouble.
Obviously, you can't read all the data in the memory before creating the MDB file, otherwise you wouldn't be getting out of memory exception. :-)
You have two options:
- partitioning - read the data in smaller chunks using filtering
- virtualizing - split the data in pages and load only the current page
In any case, you have to create the MDB file and transfer the data after that in chunks.
If you're using xml, just read a few nodes at at time. If you're using some other format, just read a few lines (or whatever) at a time. Don't load the entire thing into memory before you start working on it.
I would suggest using a generator:
"...instead of building an array containing all the values and returning them all at once, a generator yields the values one at a time, which requires less memory and allows the caller to get started processing the first few values immediately. In short, a generator looks like a function but behaves like an iterator."
The wikipedia article also has few good examples