I'm working in C#/.NET and I'm parsing a file to check if one line matches a particular regex. Actually, I want to find the last line that matches.
To get the lines of my file, I'm currently using the System.IO.StreamReader.ReadLine() method but as my files are very huge, I would like to optimize a bit the code and start from the end of the file.
Does anyone know if there is in C#/.NET a similar function to ReadLine() starting from the end of the stream? And if not, what would be, to your mind, the easiest and most optimized way to do the job described above?
Funny you should mention it - yes I have. I wrote a ReverseLineReader a while ago, and put it in MiscUtil.
It was in answer to this question on Stack Overflow - the answer contains the code, although it uses other bits of MiscUtil too.
It will only cope with some encodings, but hopefully all the ones you need. Note that this will be less efficient than reading from the start of the file, if you ever have to read the whole file - all kinds of things may assume a forward motion through the file, so they're optimised for that. But if you're actually just reading lines near the end of the file, this could be a big win :)
(Not sure whether this should have just been a close vote or not...)
Since you are using a regular expression I think your best option is going to be to read the entire line into memory and then attempt to match it.
Perhaps if you provide us with the regular expression and a sample of the file contents we could find a better way to solve your problem.
"Easiest" -vs- "Most optimized"... I don't think you're going to get both
You could open the file and read each line. Each time you find one that fits your criteria, store it in a variable (replacing any earlier instance). When you finish, you will have the last line that matches.
You could also use a FileStream to set the position near the end of your file. Go through the steps above, and if no match is found, set your FileStream position earlier in your file, until you DO find a match.
This ought to do what you're looking for, it might be memory heavy for what you need, but I don't know what your needs are in that area:
string[] lines = File.ReadAllLines("C:\\somefilehere.txt");
IEnumerable<string> revLines = lines.Reverse();
foreach(string line in revLines) {
/*do whatever*/
}
It would still require reading every line at the outset, but it might be faster than doing a check on each one as you do so.
Related
I am implementing a file-based queue of serialized objects, using C#.
Push() will serialize an object as binary and append it to the end of the file.
Pop() should deserialize an object from the beginning of the file (this part I got working). Then, the deserialized part should be removed from the file, making the next object to be "first".
From the standpoint of file system, that would just mean copying file header several bytes further on the disk, and then moving the "beginning of the file" pointer. The question is how to implement this in C#? Is it at all possible?
Easiest that I can see
1) stream out (like a log, dump it into file),
(note: you'd need some delimiters and a 'consistent format' of your 'file' - based on what your data is)
2) and later stream in (just read file from start, in one go, and process w/o removing anything)
and that'd work fine, FIFO (first in first out).
So, my suggestion - don't try to optimize that by removing, skipping
etc. (rather regroup and use more files.
3) If you worry about the scale of things - then just 'partition' that into small enough files, e.g. each 100 or 1,000 records (depends, do some calculations).
You may need to make some sort of 'virtualizer' here, which maps files, keeps track of your 'database' as, if it's over multiple files. The simplest is to just use the file-system and check file times etc. Or add some basic code to improve that.
However, I think you may have problems if you have to ensure
'transactions' - i.e. what if things fail so you need to keep track of
where the file left off, retrace etc.
That might be an issue, but you know best if it's really necessary to have that (how critical). You can always work 'per file' and per smaller files. If it fails, rollback and do the file again (or log problems). If it succeeds you can delete file (after success) and go on like that.
This is very 'hand made' approach but should get you going with a simple and not too demanding solution (like you're describing). Or something along those lines.
I should probably add...
You could also save you some trouble and use some portable database for that or something similar. This is was purely based on the idea of hand-coding a simplest solution (and we could probably come up with something smarter, but being late this is what I have :).
Files don't work that way. You can trim off the end, but not the beginning. In order to mutate a file to remove content at the beginning you need to re-write the entire file.
I expect you'll want to find some other way to solve your problem. But a linear file is totally inappropriate for representing a FIFO queue.
I have a large txt file and want to search through it and output certain strings, for example, let's say two lines are:
oNetwork.MapNetworkDrive "Q:", xyz & "\one\two\three\four"
oNetwork.MapNetworkDrive "G:", zzz
From this I'd like to copy and output the Q:, G:, and the "\one\two\three\four" to another file.
What's the most efficient way of doing this?
There is ultimately only one way to read a text file. You're going to have to go line-by-line and parse the entire file to pick out the pieces you care about.
Your best bet is to read the file using a StreanReader (File.OpenText is a good way to get one). From there, just keep calling ReadLine and picking out the bits you care about.
The main way to increase efficiency is to make sure you only have to parse the file once. Save everything you care about, and only what you care about. As much as you can, act on the information in the file right away then throw it away - the less you have to store, the better. Do not use File.ReadAllText since it will read the entirety of the file into memory all at once.
I'm extremely familiar with regex before you all start answering with variations of: /d+
I want to know if there are alternatives to regex for parsing numbers out of a large text file.
I'm parsing through tons of huge files and need to do some group/location analysis on the positions of keywords. I'm now at the point where i need to start finding groups of numbers as well nested closely to my content of interest. I want to avoid regex if at all possible because this needs to be a speedy process.
It is possible to take chunks of a file to inspect for the numbers of interest. That however would require more work and add hard coded limits for searching. (i'd like to avoid this)
I'm open to any suggestions.
UPDATE
Sorry for the lack of sample data. For HIPAA reasons I'd rather not even consider scrambling the text and posting it.
A great substitute would be the HTML source of any stackoverflow.com question page. Imagine I needed to grab the reputation (score) of all people that posted an answer to a question. This also means that the comma (,) is needed as well. I can't remove the html to simplify the content because I'm using some density analysis to weed out unrelated content. Removing the HTML would mix content too close together.
Unless the file is some sort of SGML, then I don't know of any method (which is not to say there isn't, I just don't know of one)
However, it's not to say that you can't create your own parser; you could eliminate some of the overheads of the .Net regex library by writing something that only finds ranges of numbers.
Fundamentally, I guess that that's all any library would do, at the most basic level.
Might help if you can post a sample of the sort of data you'll be processing?
I have a large text file (~10mb) that has more or less every dictionary in a specific language, and each word is new line deliminated.
I want to do a really fast lookup to see if a word exists in a file -
What is the fastest way to do this without looping through each line?
It is sorted, and I can do all the pre-processing I want.
I considered doing some sort of Binary search, but I didnt know how I could do this, since all my lines are not a fixed number of bytes (and thus I wouldn't know where to jump the stream to). And surprisingly, I couldnt find a tool to do the fixed-width thing for me.
Any suggestions?
Thanks!
I'd suggest building a Trie from the dictionary. That gives you very quick lookups to see whether a word is in there.
A trie is a good bet if you don't mind using some more storage: http://en.wikipedia.org/wiki/Trie
I know it might seem ridiculous that you would purposely want to corrupt a file, but I assure you its for a good reason.
In my app, I have a lot of xml serialization going on. This in turn also means, I have a lot of deserialization.
Today I tried some disaster scenarios. I reset the server during a serialization operation, as expected it corrupted the xml file.
The problem is, trying to "shut down" the server at exactly the right time to corrupt the file is not really optimal, firstly its luck to catch the operation during its .0001 ms write time, and secondly the server then needs to reboot.Also its just a bad idea period to be pulling the plug from the server for other reasons.
Is there an app that can effectively corrupt a file, so that this file can be used for testing in my app?
Open it up in a hex editor and have fun twiddling bits?
This is kind of the approach behind Fuzz Testing, i.e. introduce random variations and see how your application copes. You might look at some of the fuzz testing frameworks mentioned in the cited link. But in your case, it would be just as easy to use a random generator and insert bits in those positions to corrupt it. If you have a known case, then you can just use an existing corrupt file, of course.
Are you attempting to test for a partially degraded file?
If you want to test how your program reacts to bad data, why not just use any random text file as input?
There are several ways of currupting an XML file. Thinking on some: - Incomplete XML tags (truncated XML). - Unexpected content on data (Binary / more text).
For the first, I would copy a "correct/complete" XML file and would modify it by hand. For the second one I would concatenate a partial XML file with any binary file on the filesystem.
Hex editor seems a little too-much for me ;)
I would highly recommend you dont do 'random byte' corruption for testing. Not only do you not know exactly what testing state you're doing, if you do find a bug you'll be hard pressed to guarantee that the next test will verify the fix.
My recommendation is to either manually (or programatically) corrupt the file in a predictable way so that you know what you're testing and how to reproduce the test if you must. (of course, you'll probably want multiple predictable ways to ensure protection against corruption anywhere in the file)
Agree with the Hex editor option, as this will allow you to introduce non-text values into the file, such as nulls (0x00), etc.
If you're trying to simulate an interrupted write, you might want to just truncate the string representing the serialized data. This would be especially easy if you're using unit tests, but still quite feasible with Notepad.
Of course, that's just one kind of bad data, but it's worth noting that XML that's malformed in any way is essentially no longer XML, and most parsers will reject it out-of-hand at the first sign of a syntax error.