I have a large txt file and want to search through it and output certain strings, for example, let's say two lines are:
oNetwork.MapNetworkDrive "Q:", xyz & "\one\two\three\four"
oNetwork.MapNetworkDrive "G:", zzz
From this I'd like to copy and output the Q:, G:, and the "\one\two\three\four" to another file.
What's the most efficient way of doing this?
There is ultimately only one way to read a text file. You're going to have to go line-by-line and parse the entire file to pick out the pieces you care about.
Your best bet is to read the file using a StreanReader (File.OpenText is a good way to get one). From there, just keep calling ReadLine and picking out the bits you care about.
The main way to increase efficiency is to make sure you only have to parse the file once. Save everything you care about, and only what you care about. As much as you can, act on the information in the file right away then throw it away - the less you have to store, the better. Do not use File.ReadAllText since it will read the entirety of the file into memory all at once.
Related
I am implementing a file-based queue of serialized objects, using C#.
Push() will serialize an object as binary and append it to the end of the file.
Pop() should deserialize an object from the beginning of the file (this part I got working). Then, the deserialized part should be removed from the file, making the next object to be "first".
From the standpoint of file system, that would just mean copying file header several bytes further on the disk, and then moving the "beginning of the file" pointer. The question is how to implement this in C#? Is it at all possible?
Easiest that I can see
1) stream out (like a log, dump it into file),
(note: you'd need some delimiters and a 'consistent format' of your 'file' - based on what your data is)
2) and later stream in (just read file from start, in one go, and process w/o removing anything)
and that'd work fine, FIFO (first in first out).
So, my suggestion - don't try to optimize that by removing, skipping
etc. (rather regroup and use more files.
3) If you worry about the scale of things - then just 'partition' that into small enough files, e.g. each 100 or 1,000 records (depends, do some calculations).
You may need to make some sort of 'virtualizer' here, which maps files, keeps track of your 'database' as, if it's over multiple files. The simplest is to just use the file-system and check file times etc. Or add some basic code to improve that.
However, I think you may have problems if you have to ensure
'transactions' - i.e. what if things fail so you need to keep track of
where the file left off, retrace etc.
That might be an issue, but you know best if it's really necessary to have that (how critical). You can always work 'per file' and per smaller files. If it fails, rollback and do the file again (or log problems). If it succeeds you can delete file (after success) and go on like that.
This is very 'hand made' approach but should get you going with a simple and not too demanding solution (like you're describing). Or something along those lines.
I should probably add...
You could also save you some trouble and use some portable database for that or something similar. This is was purely based on the idea of hand-coding a simplest solution (and we could probably come up with something smarter, but being late this is what I have :).
Files don't work that way. You can trim off the end, but not the beginning. In order to mutate a file to remove content at the beginning you need to re-write the entire file.
I expect you'll want to find some other way to solve your problem. But a linear file is totally inappropriate for representing a FIFO queue.
Suppose I have the following XML File:
<book>
<name>sometext</name>
<name>sometext</name>
<name>sometext</name>
<name>Dometext</name>
<name>sometext</name>
</book>
If I wanted to modify the content by changing D to s (As shown in the fourth "name" node) without having to read/write the entire file, would this be possible?
A 10 MB file is not a problem. Slurp it up. Modify the DOM. Write it back to the filesystem. 10 GB is more of a problem. In that case:
Assumption: You are not changing the length of the file. Think of the file as an array of characters and not a (linked) list of characters: You cannot add characters in the middle, only change them.
You need to seek the position in the file to change and then write that character to disk.
In the .NET world, with a FileStream object, you what to set the Position attribute to the index of the D character and then write a single s character. Check out this question on random access of text files.
Also read this question: How to insert characters to a file using C#. It looks like you can't really use the FileStream object, but instead will have to resort to writing individual bytes.
Good luck. But really, if we are only talking 10 MB, then just slurp it up. The computer should be doing your work.
I would just read in the file, process, and spit it back out.
This can be done in a streaming fashion with XmlReader -- it's more manual work than XmlDocument or XDocument, but it does avoid creating an in-memory DOM (XmlDocument/XDocument can be used with this same read/write pattern, but generally require the full reconstruction in-memory):
Open file input file stream (XmlReader)
Open output file stream (XmlWriter, to a different file)
Read from XmlReader and write to XmlWriter performing any transformations as neccessary.
Close streams
Move new file to old file (overwrite, an atomic action)
While this can be setup to process input and output on the same open file with a bunch of really clever work nothing will be saved and there any many edge cases including increasing on decreasing file lengths. In fact, it might be slower to try and simply shift the contents of a file backwards to fill in gaps or shift the file contents forward to make new room. The filesystem cache will likely make any "gains" minimal/moot for anything but the most basic length-preserving operation. In addition, modifying a file in place is not an atomic action and is generally non-recoverable in case of an error: at the expense of a temporary file, the read/write/move approach is atomic wrt the final file contents.
Or, consider XSLT -- it was designed for this ;-)
Happy coding.
The cleanest (and best) way would be to use the XmlDocument object to manipulate, but a quick and dirty solution is to just read the XML to a string and then:
xmlText = xmlText.Replace("Dometext", "sometext");
An XML file is a text file and does not allow for insertion/deletions. The only mutations supported are OverWrite and Append. Not a good match for XML.
So, first make very sure you really need this. It's a complicated operation, only worth it on very large files.
Since there could be a change in length you will at least have to move everything after the first replacement. The possibility of multiple replacements means you may need a big buffer to accommodate the changes.
It's easier to copy the whole file. That is expensive in I/O but you save on memory use.
I'm busy with a little project which has a lot of data like images text files and other things and I'm trying to pack it all up in one big file or multiple big files so the program folder doesn't look messy.
But the problem is how can I edit these files. I've thought about the file structure and it's going to be something like this:
[DWORD] Number of files
[DWORD]FileId
[STRING]FileName
[DWORD]FileSize
[DWORD]FileIndex
[BYTES]All the files
So the first part is too quickly get a list of all the files and the FileIndex is the Position in the binary file so I can set the pointer too for example 300 and read the file.
But if I want to create a patch and edit it I would have to read all the bytes after the file i'm editing and copy them all back which could take ages with a couple of files.
The binary file could be a few 100 mb's when all the files are inserted.
So how do other programs do this for example games use these big files and also patch a lot is there some kind of trick to insert extra bytes more quickly?
There is no "trick" to inserting bytes in the middle of a file.
Usually solutions involve adding files to the end of the file, then switching their position in the index. Then you run into the problem of having to defragment the file. You can break files into large chunks which can mitigate some of the defragmentation woes, but then the files are not contiguous.
If you are dealing with non-static data, I would not recommend doing this unless you absolutely have to. I've seen absolutely brilliant software engineers take a considerable amount of time to write a reasonable implementation of this.
Using sqlite as a virtual file system can be a viable solution to this. But then again, so is putting the data files in another folder so it doesn't look "messy".
If at all possible, I'd probably package the data up into a zip file. This will not only clean up your directory, but (especially for the text files you mention) throw in some compression essentially for free. There are also, of course, quite a few existing tools and libraries for creating, examining, modifying, etc., a zip file.
Using zlib (for one example), most of the work is handled for you (e.g., as demonstrated in minizip).
The trick is to make patches by overwriting the data. Otherwise, there are systems available to manage large volumes of data, for example databases.
You can create a database file that will accompany your program, and hold all your data there, and not in files. You can even embed the database code in your application, with SQLite, for example, or use external DB's like Sql Server, Oracle SQL, or MySql.
What you're describing is basically implementing your own file system. Its a tricky and a very difficult task to make that effective.
You could treat the packing and editing program sort of like a custom memory allocator:
Use a minimum block size - When you add a file, use enough whole
blocks to fit the file. This automatically gives the files some room
to grow without effecting the others.
When a file gets too big for its current allocation, move it to the end of the package.
Mark the free blocks as free, and keep the offset to the head of the
free list in the package header. When adding other files, first
check to see if there is a free block big enough for them.
When extending files past their current block, check to see if the following block is on the free list.
If the free list gets too long (too much fragmentation), consolodate the package. Move each file forward to start in the first free block. This will have to re-write the whole file, but it would happen rarely.
Alternately, instead of the simple directory you have, use something like a FAT. For each file, store a list of chunks and sizes. When you extend a file past its current allocation, add another chunk with the remainder. Defragment occasionaly as needed.
Both of these would add a little overhead to the package, but leaving gaps is really the only alternative to rewriting the whole thing on every insert.
The is not way to insert bytes into a file other than the one you described. This is independent of the programming language. It's just how file systems work...
You can overwrite parts of the file, but only as long as you respect the byte count.
Have you thought about using a .zip file? I keep seeing formats out there where multiple files are stored as one, and the underlying file is really a zip file. The nice thing about this is that the zip library handles the low-level bit-tracking stuff for you.
A couple examples that come to mind:
A Word .docx file is really a zip (rename one to .zip, and you can open it -- it has whole folders in it)
The .xap file that Silverlight packages use is another one.
You can use a managed shared memory, supported by memory mapped file. You still have to have sufficient address space for the whole file, but you don't need to copy the whole file into memory. You can use most standard facilities with shared memory allocator, though you can quickly find that specifying custom allocator everywhere is a chore. But the good news is that you don't need to implement it all yourself, you can take Boost.Interprocess and it already has all necessary facilities for both unix and windows.
I'm working in C#/.NET and I'm parsing a file to check if one line matches a particular regex. Actually, I want to find the last line that matches.
To get the lines of my file, I'm currently using the System.IO.StreamReader.ReadLine() method but as my files are very huge, I would like to optimize a bit the code and start from the end of the file.
Does anyone know if there is in C#/.NET a similar function to ReadLine() starting from the end of the stream? And if not, what would be, to your mind, the easiest and most optimized way to do the job described above?
Funny you should mention it - yes I have. I wrote a ReverseLineReader a while ago, and put it in MiscUtil.
It was in answer to this question on Stack Overflow - the answer contains the code, although it uses other bits of MiscUtil too.
It will only cope with some encodings, but hopefully all the ones you need. Note that this will be less efficient than reading from the start of the file, if you ever have to read the whole file - all kinds of things may assume a forward motion through the file, so they're optimised for that. But if you're actually just reading lines near the end of the file, this could be a big win :)
(Not sure whether this should have just been a close vote or not...)
Since you are using a regular expression I think your best option is going to be to read the entire line into memory and then attempt to match it.
Perhaps if you provide us with the regular expression and a sample of the file contents we could find a better way to solve your problem.
"Easiest" -vs- "Most optimized"... I don't think you're going to get both
You could open the file and read each line. Each time you find one that fits your criteria, store it in a variable (replacing any earlier instance). When you finish, you will have the last line that matches.
You could also use a FileStream to set the position near the end of your file. Go through the steps above, and if no match is found, set your FileStream position earlier in your file, until you DO find a match.
This ought to do what you're looking for, it might be memory heavy for what you need, but I don't know what your needs are in that area:
string[] lines = File.ReadAllLines("C:\\somefilehere.txt");
IEnumerable<string> revLines = lines.Reverse();
foreach(string line in revLines) {
/*do whatever*/
}
It would still require reading every line at the outset, but it might be faster than doing a check on each one as you do so.
I know it might seem ridiculous that you would purposely want to corrupt a file, but I assure you its for a good reason.
In my app, I have a lot of xml serialization going on. This in turn also means, I have a lot of deserialization.
Today I tried some disaster scenarios. I reset the server during a serialization operation, as expected it corrupted the xml file.
The problem is, trying to "shut down" the server at exactly the right time to corrupt the file is not really optimal, firstly its luck to catch the operation during its .0001 ms write time, and secondly the server then needs to reboot.Also its just a bad idea period to be pulling the plug from the server for other reasons.
Is there an app that can effectively corrupt a file, so that this file can be used for testing in my app?
Open it up in a hex editor and have fun twiddling bits?
This is kind of the approach behind Fuzz Testing, i.e. introduce random variations and see how your application copes. You might look at some of the fuzz testing frameworks mentioned in the cited link. But in your case, it would be just as easy to use a random generator and insert bits in those positions to corrupt it. If you have a known case, then you can just use an existing corrupt file, of course.
Are you attempting to test for a partially degraded file?
If you want to test how your program reacts to bad data, why not just use any random text file as input?
There are several ways of currupting an XML file. Thinking on some: - Incomplete XML tags (truncated XML). - Unexpected content on data (Binary / more text).
For the first, I would copy a "correct/complete" XML file and would modify it by hand. For the second one I would concatenate a partial XML file with any binary file on the filesystem.
Hex editor seems a little too-much for me ;)
I would highly recommend you dont do 'random byte' corruption for testing. Not only do you not know exactly what testing state you're doing, if you do find a bug you'll be hard pressed to guarantee that the next test will verify the fix.
My recommendation is to either manually (or programatically) corrupt the file in a predictable way so that you know what you're testing and how to reproduce the test if you must. (of course, you'll probably want multiple predictable ways to ensure protection against corruption anywhere in the file)
Agree with the Hex editor option, as this will allow you to introduce non-text values into the file, such as nulls (0x00), etc.
If you're trying to simulate an interrupted write, you might want to just truncate the string representing the serialized data. This would be especially easy if you're using unit tests, but still quite feasible with Notepad.
Of course, that's just one kind of bad data, but it's worth noting that XML that's malformed in any way is essentially no longer XML, and most parsers will reject it out-of-hand at the first sign of a syntax error.