Remove first line from a file [duplicate] - c#

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Removing the first line of a text file in C#
What would be the fastest and smartest way to remove the first line from a huge (think 2-3 GB) file?
I think, that you probably can't avoid rewriting the whole file chunk-by-chunk, but I might be wrong.
Could using memory-mapped files somehow help to solve this issue?
Is it possible to achieve this behavior by operating directly on the file system (NTFS, for example) - say, update the corresponding inode data and change the file starting sector, so that the first line is ignored? If yes, would this approach be really fragile or there are many other applications, except the OS itself that do something similiar?

NTFS by default on most volumes (but importantly not all!) stores data in 4096 byte chunks. These are referenced by the $MFT record, which you cannot edit directly because it's disallowed by the Operating System (for reasons of sanity). As a result, there is no trick available to operate on the filesystem to do something approaching what you want (in other words, you cannot directly reverse truncate a file on NTFS, even in filesystem chunk sized amounts.)
Because of the way files are stored in a filesystem, the only answer is that you must rewrite the entire file directly. Or figure out a different way to store your data. a 2-3GB file is massive and crazy, especially considering you referred to lines meaning that this data is at least in part text information.
You should look into putting this data into a database perhaps? Or organizing it a bit more efficiently at the very least.

You can overwrite every character that you want to erase with '\x7f'. Then, when reading in the file, your reader ignores that character. This assumes you have a text file that doesn't ever use the DEL character, of course.
std::istream &
my_getline (std::istream &in, std::string &s,
char del = '\x7f', char delim = '\n') {
std::getline(in, s, delim);
std::size_t beg = s.find(del);
while (beg != s.npos) {
std::size_t end = s.find_first_not_of(del, beg+1);
s.erase(beg, end-beg);
beg = s.find(del, beg+1);
}
return in;
}
As Henk points out, you could choose a different character to act as your DELETE. But, the advantage is that the technique works no matter which line you want to remove (it is not limited to the first line), and doesn't require futzing with the file system.
Using the modified reader, you can periodically "defragment" the file. Or, the defragmentation may occur naturally as the contents are streamed/merged into a different file or archived to a different machine.
Edit: You don't explicitly say it, but I am guessing this is for some kind of logging application, where the goal is to put an upper bound on the size of the log file. However, if that is the goal, it is much easier to just use a collection of smaller log files. Let's say you maintained roughly 10MB log files, with total logs bounded to 4GB. So that would be about 400 files. If the 401st file is started, for each line written there, you could use the DELETE marker on successive lines in the first file. When all lines have been marked for deletion, the file itself can be deleted, leaving you with about 400 files again. There is no hidden O(n2) behavior so long as the first file is not closed while the lines are being deleted.
But easier still is allow your logging system to keep the 1st and 401st file as is, and remove the 1st file when moving to the 402nd file.

Even if you could remove a leading block it would at least be a sector (512 bytes), probably not a match to the size of your line.
Consider a wrapper (maybe even a helper file) to just start reading from a certain offset.

Idea (no magic dust, only hard work below):
use user-mode file system such as http://www.eldos.com/cbfs/ or http://dokan-dev.net/en/ to WRAP around your real filesystem, and create a small book-keeping system to track how many of the file is 'eaten' at front. At certain time, when file grows too big, rewrite the file into another and start over.
How about that?
EDIT:
if you go with virtual file system, then you can use smaller (256mb) file fragments that you can glue into one 'virtual' file with desired offset. That way you won't ever need to re-write the file.
MORE:
Reflection on the idea on 'overwriting' first few lines with 'nothing' - don't do that, instead, add one 64-bit integer to the FRONT of the file, and use any method you like to skip that many bytes, for example Stream derivation that will wrap original stream and offset the reading in it.
I guess that might be better if you choose to use wrappers on the 'client' side.

Break the file in two , the first being the smaller chunk.
Remove the first line and then attach with the other.

Related

Is it possible to read from and write to the same text file in C#

Here is the situation. I have over 1500 SQL (text) files that contain a "GRANT some privilege ON some table TO some user after the CREATE TABLE statement. I need to remove them from the original file and put the GRANT statements in their own file. Sometimes the Grants are on a single line and sometimes they are split across multiple lines. For example:
GRANT SELECT ON XYZ.TABLE1 TO MYROLE1 ;
GRANT
SELECT ON XYZ.TABLE1 TO MYROLE2 ;
GRANT
DELETE,
INSERT,
SELECT,
UPDATE ON XYZ.TABLE1 TO MYROLE3;
I am reading through the file until I get to the GRANT and then building a string containing the text from the GRANT to the semicolon which I then write out to another file. I have an app I wrote in Delphi (Pascal) and this part works great. What I would like to do is after I read and have processed the line I want, I would like to delete the line from the original text file. I can't do this in the Delphi. The only solution there is to read the file line by line and write the file back out to another file excluding the lines I don't want while also writing the GRANTS to yet another file. Then delete the original and rename the new. Way too much processing and risk.
I looked at using the StreamReader and StreamWriter in C# but it appears to be similar situation to Delphi. I can Read or I can Write but I can't do both to the same file at the same time.
I would appreciate any suggestions or recommendations.
Thanks
If you think there's "way too much processing and risk" in generating the a new temporary file without the lines you don't want and replacing the original: then consider the alternative you're hoping to achieve.
Line 1
Line 2
Delete this line
+-->Line 4
| Line 5
|
+- Read position marker after reading line to be deleted
If you immediately delete the line while reading, the later lines have to be moved back into the "empty space" left behind after the 3rd line is deleted. In order to ensure you next read "Line 4", you'd have to backtrack your read-position-marker. What's the correct amount to backtrack? A "line" of variable length, or the number of characters of the deleted line?
What you perceive to be the "risky" option is actually the safe option!
If you want to delete while processing you can use an abstraction that gives you that impression. But you lose the benefits of stream processing and don't really eliminate any of the risk you were worried about in the first place.
E.g. Load your entire file into a list of strings; such as an array, vector or TStringList (in Delphi). Iterate the list and delete the items you don't want. Finally save the list back to file.
This approach has the following disadvantages:
Potential high memory overhead because you load the entire file instead of small buffer for the stream.
You're at risk of mid-process failure with no recovery, because your job is all-or-nothing.
You have to deal with the nuances of the particular container you choose to hold your list of strings.
In some cases (e.g. TStringList) you might still need to backtrack your position marker in a similar fashion to the earlier description.
For arrays you'd have to copy all lines back 1 position every time you delete something with a huge performance cost. (The same happens in TStringList though it's hidden from you.)
Iterators for some containers are invalidated whenever you modify the list. This means you'd have to copy to a new list without the 'deleted lines' in any case. More memory overhead.
In conclusion, take the safe option. Use a separate read and write stream; write to a temporary file, and rename when done. It will save you headaches.

FIFO log file in C#

I am making an evolution for a log file system I have had in place for a few builds of a service I develop on. I had previously been opening the file, appending data, and prior to writing checking to see if the log file had grown over a predetermined size, if so starting a new log.
So say the log size was 100mb, at that size I delete, and start a new file, but I loose history, functional, but not the best model.
What I want to do is a FIFO model that would chop off the top and add to the end while keeping it consistently no larger than 100mb, and at least as far back as that represents.
The data is high speed in a failure prone industrial environment, so keeping it all in memory and writing the whole file at interval has proven unreliable. (SSD, fast enough to do it reasonably most of the time, spinners fail too often to tolerate)
Likewise the records are of greatly variable length (formatted as XML nodes, so parsing them back out accounts for this easily)
So the only workable model I have come up with thus far is to keep smaller slices (say 10mb) chunks, create new ones then delete the oldest 10mb slice on count >= 10.
What I would prefer to do is be able to keep the file on disk and work with the tag ends.
Open to suggestions on how this might be best achieved in a reasonable manner, or is there no reasonable manner and the layered multi log approach will be the best option?
The biggest issue with expiring old log entries in a single file is that you have to rewrite the file's content in order to expire older entries. This isn't too bad for small files (up to a few MB in size), but once you get to the point where rewriting takes a significant period of time it becomes problematic.
One of the more common ways to retire logs is to rename the existing log file and/or start a new file. Lots of programs do it that way, with either dated log file names or by using a sequential numbering system - logfile, logfile.1, logfile.2, etc. with higher-numbered files being older. You can add compression to the process to further reduce the storage requirements for expired files, etc.
Another option is to use a more database-like format, or an out-and-out database like SQLite to store your log entries. The primary downside of this of course is that your log files become more difficult to read, since they're not just in plain text form. It's simple enough to write a dump-to-text program whose output can be piped to a log parser... but even this will probably require a change in the way your consumers are interfacing with the log file.
The problem as stated is unlikely to be realistically solvable, I suspect. On the one hand you have the limitations of file manipulation, and on the other the fact that your log consumers are many and varied and therefore changes to the logging structure will be an involved process.
About all I can suggest is that you trial a log aging process similar to this:
Rename current log file
Walk renamed file and copy desired contents to new log file
Discard or archive renamed log
Beware duplication or data loss.
i dunno why u need this feature "chop off the top and add to the end while keeping it consistently no larger than 100mb".
general design approach is archiving. simply rename the oversized file to another file, or move it to somwhere else, then using back the same filename as new file.
simple as this is.

Is there any alternative to "truncate" which sounds safer?

I have an application that reads a linked list1 from a file when it starts, and write it back to the file when it ends. I choose truncate as the file mode when writing back. However, truncate sounds a little bit dangerous to me as it clears the whole content first. Thus if something goes wrong, I cannot get my old stuff back. Is there any better alternative?
1: I use a linked list because the order of items may change. Thus I later use truncate to update the whole file.
The right answer reputation goes to Hans as he first pointed out File.Replace(), though it is not available for Silverlight for now.
Write to a new temporary file. When finished and satisfied with the result, delete the old file and rename/copy the new temporary file into the original file's location. This way, should anything go wrong, you are not losing data.
As pointed out in Hans Passants answer, you should use File.Replace for maximum robustness when replacing the original file.
This is covered well by the .NET framework. Use the File.Replace() method. It securely replaces the content of your original file with the content of another file, leaving the original in tact if there's any problem with the file system. It is a better mouse trap than the upvoted answers, they'll fail when there's a pending delete on the original file.
There's an overload that lets you control whether the original file is preserved as a backup file. It is best if you let the function create the backup, it significantly increases the odds that the function will succeed when another process has a lock on your file, the most typical failure mode. They'll get to keep the lock on the backup file. The method also works best when you create the intermediate file on the same drive as the original so you'll want to avoid GetTempFileName(). A good way to generate a filename is Guid.NewGuid().ToString().
The "best" alternative for robustness would be to do the following:
Create a new file for the data you're persisting to disk
Write the data out to the new file
Perform any necessary data verification
Delete the original file
Move the new file to the original file location
You can use System.IO.Path.GetTempFileName to provide you with a uniquely named temporary file to use for step 1.
You have thought to use truncate, so I assume your input data is always anew, therefore....
try ... catch to rename your original file to something like 'originalname_day_month_year.bak'
Write ex-novo your file with new data.
In this way you don't have to worry to loose anything and, as a side effect, you have a backup copy of your previous data. If that backup is not needed, you can always delete the backup file.

Most efficient way of reading file

I have a file which contains a certain number of fixed length rows having some numbers. I need to read each row in order to get that number and process them and write to a file.
Since I need to read each row, as the number of rows increases it becomes time consuming.
Is there an efficient way of reading each row of the file? I'm using C#.
File.ReadLines (.NET 4.0+) is probably the most memory efficient way to do this.
It returns an IEnumerable<string> meaning that lines will get read lazily in a streaming fashion.
Previous versions do not have the streaming option available in this manner, but using StreamReader to read line by line would achieve the same.
Reading all rows from a file is always at least O(n). When file size starts becoming an issue then its probably a good time to look at creating a database for the information instead of flat files.
Not sure this is the most efficient, but it works well for me:
http://msdn.microsoft.com/en-us/library/system.io.fileinfo.aspx
//Declare a new file and give it the path to your file
FileInfo fi1 = new FileInfo(path);
//Open the file and read the text
using (StreamReader sr = fi1.OpenText())
{
string s = "";
// Loop through each line
while ((s = sr.ReadLine()) != null)
{
//Here is where you handle your row in the file
Console.WriteLine(s);
}
}
No matter which operating system you're using, there will be several layers between your code and the actual storage mechanism. Hard drives and tape drives store files in blocks, which these days are usually around 4K each. If you want to read one byte, the device will still read the entire block into memory -- it's just faster that way. The device and the OS also may each keep a cache of blocks. So there's not much you can do to change the standard (highly optimized) file reading behavior; just read the file as you need it and let the system take care of the rest.
If the time to process the file is becoming a problem, two options that might help are:
Try to arrange to use shorter files. It sounds like you're processing log files or something -- running your program more frequently might help to at least give the appearance of better performance.
Change the way the data is stored. Again, I understand that the file comes from some external source, but perhaps you can arrange for a job to run that periodically converts the raw file to something that you can read more quickly.
Good luck.

Best way of reading & processing a text file

Was wondering if anyone had any favourite methods/ useful libraries for processing a tab-delimited text file? This file is going to have on average 30,000 - 50,000 rows in it. Just need to read through each row and throw it into a database. However, i'd need to temporarily store all the data, the reason being that if the table holding the data gets to more than 1,000,00 rows, i'll need to create a new table and put the data in there. The code will be run in a windows service so i'm not worried about processing time.
Was thinking about just doing a standard while(sr.ReadLine()) ... any suggestions?
Cheers,
Sean.
filehelpers
This library is very flexible and fast. I never get tired recommending it. Defaults to ',' as a delimiter, but you can change it to '\t' easily.
I suspect "throwing it into a database" will take at least 1 order of magnitude longer than reading a line into a buffer, so you could pre-scan the data just to count the number of rows (without parsing them). Then make your database decisions. Then re-read the data doing the real work. With luck, the OS will have cached the file so it reads even quicker.

Categories