Copy Large Text File into Arrays by Matching Regex

Copy Large Text File into Arrays by Matching Regex - c#

Occasionally I need to look through a roughly 25 MB Oracle Datapump SQLFILE (plain text) for a few key strings of text. I currently use some handy features in UltraEdit that make this not so bad. However, I have some other users who do not have UltraEdit and aren't familiar with Reg Expressions to find the right values.
If I wanted to create two Collections and add only lines matching a certain RegEx to each, where should I start? Should I use the plain StreamReader and StreamReader.ReadLine() to move through the file? Or would the size of the file suggest a different option?
The end result would be to output the contents of the Collections to the screen or a new text file, but I'm not too worried about that detail yet.
Please be as general or specific as you can be, I'm not immune to filling in what details I can for myself.

Starting with .NET Framework 4 you can use the File.ReadLines method that returns an IEnumerable<string> and thus does not hold the whole file in memory.
var lines = File.ReadLines(path).Where(s => myRegex.IsMatch(s));

Should I use the plain StreamReader and StreamReader.ReadLine() to move through the file? Or would the size of the file suggest a different option?
That's the approach I would take. Using a stream does not load the entire file into memory and so seems perfect for large files.
For each line, you can test if it matches and copy that line to the corresponding list. Or, if you are concerned about too much data, copy each line to one of two output files (also using streams).

Related

Need to parse C# textbox for objectionable words

I'm part of a small "message board" type project being built in a C# Web Form. I need to parse the user-entered text for objectionable words. This is my first C# project and I'm not sure how to split the words in the textbox.
It's been requested that I make an XML config file to contain the words to be screened for. Ideally, I would like to do a fark.com style replace. I've never made an XML config file and I really just need a place to start. All the config file information I've found has not been particularly applicable to this scenario.
Edit:
I ended up using a .txt file and splitting it on whitespace, then parsing the textbox on whitespace and comparing words. The project leader wanted a config file, but I pitched him on the simple solution and we went for it. Thanks for the replies.

An XML file won't scale well, especially if accessed concurrently. You'd better be using a database engine for such a task.

Making an XML config file just to filter a bunch of words probably isn't the best way to go there, considering it's most-likely just going to be a giant list of strings...
If it's not, have a look at the XmlDocument Class and the System.Xml namespace I assume you're aware of the format for XML documents but, if not, here is a simple example. The format is pretty much open to whatever XML tags you want, but the XmlDocument class I linked you to does have some fairly annoying catches that you'll come across while implementing it.
In terms of splitting the user text, it's fairly easy to hide "bad" words in another string so I'm not sure String.Split() is even what you want either. You will probably want to Regex it.
With that said, I came across this blog post a while ago that offers a simple profanity filter for .NET using Regex. Perhaps it will suit your needs.

Depends on how large this "bad words list" will be, and whether you expect it to change.
If it's pretty static, I would load the list from your XML file into some kind of in-memory collection. Then for each line of text you receive, parse the line into words, and then check each word for its existence in the collection.
If it's going to change frequently, and you need to pick up on those changes quickly, then you want more random access...that means a database. Hitting an XML repeatedly would be a performance drag.
Either way, split the string and react to each hit.
The string can be split up using something like:
myLineOfText.Split(new String[] { " " }, StringSplitOptions.RemoveEmptyEntries);

How to resize a file, "trimming" its beginning?

I am implementing a file-based queue of serialized objects, using C#.
Push() will serialize an object as binary and append it to the end of the file.
Pop() should deserialize an object from the beginning of the file (this part I got working). Then, the deserialized part should be removed from the file, making the next object to be "first".
From the standpoint of file system, that would just mean copying file header several bytes further on the disk, and then moving the "beginning of the file" pointer. The question is how to implement this in C#? Is it at all possible?

Easiest that I can see
1) stream out (like a log, dump it into file),
(note: you'd need some delimiters and a 'consistent format' of your 'file' - based on what your data is)
2) and later stream in (just read file from start, in one go, and process w/o removing anything)
and that'd work fine, FIFO (first in first out).
So, my suggestion - don't try to optimize that by removing, skipping
etc. (rather regroup and use more files.
3) If you worry about the scale of things - then just 'partition' that into small enough files, e.g. each 100 or 1,000 records (depends, do some calculations).
You may need to make some sort of 'virtualizer' here, which maps files, keeps track of your 'database' as, if it's over multiple files. The simplest is to just use the file-system and check file times etc. Or add some basic code to improve that.
However, I think you may have problems if you have to ensure
'transactions' - i.e. what if things fail so you need to keep track of
where the file left off, retrace etc.
That might be an issue, but you know best if it's really necessary to have that (how critical). You can always work 'per file' and per smaller files. If it fails, rollback and do the file again (or log problems). If it succeeds you can delete file (after success) and go on like that.
This is very 'hand made' approach but should get you going with a simple and not too demanding solution (like you're describing). Or something along those lines.
I should probably add...
You could also save you some trouble and use some portable database for that or something similar. This is was purely based on the idea of hand-coding a simplest solution (and we could probably come up with something smarter, but being late this is what I have :).

Files don't work that way. You can trim off the end, but not the beginning. In order to mutate a file to remove content at the beginning you need to re-write the entire file.
I expect you'll want to find some other way to solve your problem. But a linear file is totally inappropriate for representing a FIFO queue.

Search in a file and write the matched content to another file

I have a large txt file and want to search through it and output certain strings, for example, let's say two lines are:
oNetwork.MapNetworkDrive "Q:", xyz & "\one\two\three\four"
oNetwork.MapNetworkDrive "G:", zzz
From this I'd like to copy and output the Q:, G:, and the "\one\two\three\four" to another file.
What's the most efficient way of doing this?

There is ultimately only one way to read a text file. You're going to have to go line-by-line and parse the entire file to pick out the pieces you care about.
Your best bet is to read the file using a StreanReader (File.OpenText is a good way to get one). From there, just keep calling ReadLine and picking out the bits you care about.
The main way to increase efficiency is to make sure you only have to parse the file once. Save everything you care about, and only what you care about. As much as you can, act on the information in the file right away then throw it away - the less you have to store, the better. Do not use File.ReadAllText since it will read the entirety of the file into memory all at once.

Modifying XML file in-place?

Suppose I have the following XML File:
<book>
<name>sometext</name>
<name>sometext</name>
<name>sometext</name>
<name>Dometext</name>
<name>sometext</name>
</book>
If I wanted to modify the content by changing D to s (As shown in the fourth "name" node) without having to read/write the entire file, would this be possible?

A 10 MB file is not a problem. Slurp it up. Modify the DOM. Write it back to the filesystem. 10 GB is more of a problem. In that case:
Assumption: You are not changing the length of the file. Think of the file as an array of characters and not a (linked) list of characters: You cannot add characters in the middle, only change them.
You need to seek the position in the file to change and then write that character to disk.
In the .NET world, with a FileStream object, you what to set the Position attribute to the index of the D character and then write a single s character. Check out this question on random access of text files.
Also read this question: How to insert characters to a file using C#. It looks like you can't really use the FileStream object, but instead will have to resort to writing individual bytes.
Good luck. But really, if we are only talking 10 MB, then just slurp it up. The computer should be doing your work.

I would just read in the file, process, and spit it back out.
This can be done in a streaming fashion with XmlReader -- it's more manual work than XmlDocument or XDocument, but it does avoid creating an in-memory DOM (XmlDocument/XDocument can be used with this same read/write pattern, but generally require the full reconstruction in-memory):
Open file input file stream (XmlReader)
Open output file stream (XmlWriter, to a different file)
Read from XmlReader and write to XmlWriter performing any transformations as neccessary.
Close streams
Move new file to old file (overwrite, an atomic action)
While this can be setup to process input and output on the same open file with a bunch of really clever work nothing will be saved and there any many edge cases including increasing on decreasing file lengths. In fact, it might be slower to try and simply shift the contents of a file backwards to fill in gaps or shift the file contents forward to make new room. The filesystem cache will likely make any "gains" minimal/moot for anything but the most basic length-preserving operation. In addition, modifying a file in place is not an atomic action and is generally non-recoverable in case of an error: at the expense of a temporary file, the read/write/move approach is atomic wrt the final file contents.
Or, consider XSLT -- it was designed for this ;-)
Happy coding.

The cleanest (and best) way would be to use the XmlDocument object to manipulate, but a quick and dirty solution is to just read the XML to a string and then:
xmlText = xmlText.Replace("Dometext", "sometext");

An XML file is a text file and does not allow for insertion/deletions. The only mutations supported are OverWrite and Append. Not a good match for XML.
So, first make very sure you really need this. It's a complicated operation, only worth it on very large files.
Since there could be a change in length you will at least have to move everything after the first replacement. The possibility of multiple replacements means you may need a big buffer to accommodate the changes.
It's easier to copy the whole file. That is expensive in I/O but you save on memory use.

StreamReader.ReadLine() starting from the end of the stream

I'm working in C#/.NET and I'm parsing a file to check if one line matches a particular regex. Actually, I want to find the last line that matches.
To get the lines of my file, I'm currently using the System.IO.StreamReader.ReadLine() method but as my files are very huge, I would like to optimize a bit the code and start from the end of the file.
Does anyone know if there is in C#/.NET a similar function to ReadLine() starting from the end of the stream? And if not, what would be, to your mind, the easiest and most optimized way to do the job described above?

Funny you should mention it - yes I have. I wrote a ReverseLineReader a while ago, and put it in MiscUtil.
It was in answer to this question on Stack Overflow - the answer contains the code, although it uses other bits of MiscUtil too.
It will only cope with some encodings, but hopefully all the ones you need. Note that this will be less efficient than reading from the start of the file, if you ever have to read the whole file - all kinds of things may assume a forward motion through the file, so they're optimised for that. But if you're actually just reading lines near the end of the file, this could be a big win :)
(Not sure whether this should have just been a close vote or not...)

Since you are using a regular expression I think your best option is going to be to read the entire line into memory and then attempt to match it.
Perhaps if you provide us with the regular expression and a sample of the file contents we could find a better way to solve your problem.

"Easiest" -vs- "Most optimized"... I don't think you're going to get both
You could open the file and read each line. Each time you find one that fits your criteria, store it in a variable (replacing any earlier instance). When you finish, you will have the last line that matches.
You could also use a FileStream to set the position near the end of your file. Go through the steps above, and if no match is found, set your FileStream position earlier in your file, until you DO find a match.

This ought to do what you're looking for, it might be memory heavy for what you need, but I don't know what your needs are in that area:
string[] lines = File.ReadAllLines("C:\\somefilehere.txt");
IEnumerable<string> revLines = lines.Reverse();
foreach(string line in revLines) {
/*do whatever*/
}
It would still require reading every line at the outset, but it might be faster than doing a check on each one as you do so.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.