Modifying XML file in-place? - c#

Suppose I have the following XML File:
<book>
<name>sometext</name>
<name>sometext</name>
<name>sometext</name>
<name>Dometext</name>
<name>sometext</name>
</book>
If I wanted to modify the content by changing D to s (As shown in the fourth "name" node) without having to read/write the entire file, would this be possible?

A 10 MB file is not a problem. Slurp it up. Modify the DOM. Write it back to the filesystem. 10 GB is more of a problem. In that case:
Assumption: You are not changing the length of the file. Think of the file as an array of characters and not a (linked) list of characters: You cannot add characters in the middle, only change them.
You need to seek the position in the file to change and then write that character to disk.
In the .NET world, with a FileStream object, you what to set the Position attribute to the index of the D character and then write a single s character. Check out this question on random access of text files.
Also read this question: How to insert characters to a file using C#. It looks like you can't really use the FileStream object, but instead will have to resort to writing individual bytes.
Good luck. But really, if we are only talking 10 MB, then just slurp it up. The computer should be doing your work.

I would just read in the file, process, and spit it back out.
This can be done in a streaming fashion with XmlReader -- it's more manual work than XmlDocument or XDocument, but it does avoid creating an in-memory DOM (XmlDocument/XDocument can be used with this same read/write pattern, but generally require the full reconstruction in-memory):
Open file input file stream (XmlReader)
Open output file stream (XmlWriter, to a different file)
Read from XmlReader and write to XmlWriter performing any transformations as neccessary.
Close streams
Move new file to old file (overwrite, an atomic action)
While this can be setup to process input and output on the same open file with a bunch of really clever work nothing will be saved and there any many edge cases including increasing on decreasing file lengths. In fact, it might be slower to try and simply shift the contents of a file backwards to fill in gaps or shift the file contents forward to make new room. The filesystem cache will likely make any "gains" minimal/moot for anything but the most basic length-preserving operation. In addition, modifying a file in place is not an atomic action and is generally non-recoverable in case of an error: at the expense of a temporary file, the read/write/move approach is atomic wrt the final file contents.
Or, consider XSLT -- it was designed for this ;-)
Happy coding.

The cleanest (and best) way would be to use the XmlDocument object to manipulate, but a quick and dirty solution is to just read the XML to a string and then:
xmlText = xmlText.Replace("Dometext", "sometext");

An XML file is a text file and does not allow for insertion/deletions. The only mutations supported are OverWrite and Append. Not a good match for XML.
So, first make very sure you really need this. It's a complicated operation, only worth it on very large files.
Since there could be a change in length you will at least have to move everything after the first replacement. The possibility of multiple replacements means you may need a big buffer to accommodate the changes.
It's easier to copy the whole file. That is expensive in I/O but you save on memory use.

Related

How to resize a file, "trimming" its beginning?

I am implementing a file-based queue of serialized objects, using C#.
Push() will serialize an object as binary and append it to the end of the file.
Pop() should deserialize an object from the beginning of the file (this part I got working). Then, the deserialized part should be removed from the file, making the next object to be "first".
From the standpoint of file system, that would just mean copying file header several bytes further on the disk, and then moving the "beginning of the file" pointer. The question is how to implement this in C#? Is it at all possible?
Easiest that I can see
1) stream out (like a log, dump it into file),
(note: you'd need some delimiters and a 'consistent format' of your 'file' - based on what your data is)
2) and later stream in (just read file from start, in one go, and process w/o removing anything)
and that'd work fine, FIFO (first in first out).
So, my suggestion - don't try to optimize that by removing, skipping
etc. (rather regroup and use more files.
3) If you worry about the scale of things - then just 'partition' that into small enough files, e.g. each 100 or 1,000 records (depends, do some calculations).
You may need to make some sort of 'virtualizer' here, which maps files, keeps track of your 'database' as, if it's over multiple files. The simplest is to just use the file-system and check file times etc. Or add some basic code to improve that.
However, I think you may have problems if you have to ensure
'transactions' - i.e. what if things fail so you need to keep track of
where the file left off, retrace etc.
That might be an issue, but you know best if it's really necessary to have that (how critical). You can always work 'per file' and per smaller files. If it fails, rollback and do the file again (or log problems). If it succeeds you can delete file (after success) and go on like that.
This is very 'hand made' approach but should get you going with a simple and not too demanding solution (like you're describing). Or something along those lines.
I should probably add...
You could also save you some trouble and use some portable database for that or something similar. This is was purely based on the idea of hand-coding a simplest solution (and we could probably come up with something smarter, but being late this is what I have :).
Files don't work that way. You can trim off the end, but not the beginning. In order to mutate a file to remove content at the beginning you need to re-write the entire file.
I expect you'll want to find some other way to solve your problem. But a linear file is totally inappropriate for representing a FIFO queue.

Copy Large Text File into Arrays by Matching Regex

Occasionally I need to look through a roughly 25 MB Oracle Datapump SQLFILE (plain text) for a few key strings of text. I currently use some handy features in UltraEdit that make this not so bad. However, I have some other users who do not have UltraEdit and aren't familiar with Reg Expressions to find the right values.
If I wanted to create two Collections and add only lines matching a certain RegEx to each, where should I start? Should I use the plain StreamReader and StreamReader.ReadLine() to move through the file? Or would the size of the file suggest a different option?
The end result would be to output the contents of the Collections to the screen or a new text file, but I'm not too worried about that detail yet.
Please be as general or specific as you can be, I'm not immune to filling in what details I can for myself.
Starting with .NET Framework 4 you can use the File.ReadLines method that returns an IEnumerable<string> and thus does not hold the whole file in memory.
var lines = File.ReadLines(path).Where(s => myRegex.IsMatch(s));
Should I use the plain StreamReader and StreamReader.ReadLine() to move through the file? Or would the size of the file suggest a different option?
That's the approach I would take. Using a stream does not load the entire file into memory and so seems perfect for large files.
For each line, you can test if it matches and copy that line to the corresponding list. Or, if you are concerned about too much data, copy each line to one of two output files (also using streams).

Search in a file and write the matched content to another file

I have a large txt file and want to search through it and output certain strings, for example, let's say two lines are:
oNetwork.MapNetworkDrive "Q:", xyz & "\one\two\three\four"
oNetwork.MapNetworkDrive "G:", zzz
From this I'd like to copy and output the Q:, G:, and the "\one\two\three\four" to another file.
What's the most efficient way of doing this?
There is ultimately only one way to read a text file. You're going to have to go line-by-line and parse the entire file to pick out the pieces you care about.
Your best bet is to read the file using a StreanReader (File.OpenText is a good way to get one). From there, just keep calling ReadLine and picking out the bits you care about.
The main way to increase efficiency is to make sure you only have to parse the file once. Save everything you care about, and only what you care about. As much as you can, act on the information in the file right away then throw it away - the less you have to store, the better. Do not use File.ReadAllText since it will read the entirety of the file into memory all at once.

Editing large binary files

I'm busy with a little project which has a lot of data like images text files and other things and I'm trying to pack it all up in one big file or multiple big files so the program folder doesn't look messy.
But the problem is how can I edit these files. I've thought about the file structure and it's going to be something like this:
[DWORD] Number of files
[DWORD]FileId
[STRING]FileName
[DWORD]FileSize
[DWORD]FileIndex
[BYTES]All the files
So the first part is too quickly get a list of all the files and the FileIndex is the Position in the binary file so I can set the pointer too for example 300 and read the file.
But if I want to create a patch and edit it I would have to read all the bytes after the file i'm editing and copy them all back which could take ages with a couple of files.
The binary file could be a few 100 mb's when all the files are inserted.
So how do other programs do this for example games use these big files and also patch a lot is there some kind of trick to insert extra bytes more quickly?
There is no "trick" to inserting bytes in the middle of a file.
Usually solutions involve adding files to the end of the file, then switching their position in the index. Then you run into the problem of having to defragment the file. You can break files into large chunks which can mitigate some of the defragmentation woes, but then the files are not contiguous.
If you are dealing with non-static data, I would not recommend doing this unless you absolutely have to. I've seen absolutely brilliant software engineers take a considerable amount of time to write a reasonable implementation of this.
Using sqlite as a virtual file system can be a viable solution to this. But then again, so is putting the data files in another folder so it doesn't look "messy".
If at all possible, I'd probably package the data up into a zip file. This will not only clean up your directory, but (especially for the text files you mention) throw in some compression essentially for free. There are also, of course, quite a few existing tools and libraries for creating, examining, modifying, etc., a zip file.
Using zlib (for one example), most of the work is handled for you (e.g., as demonstrated in minizip).
The trick is to make patches by overwriting the data. Otherwise, there are systems available to manage large volumes of data, for example databases.
You can create a database file that will accompany your program, and hold all your data there, and not in files. You can even embed the database code in your application, with SQLite, for example, or use external DB's like Sql Server, Oracle SQL, or MySql.
What you're describing is basically implementing your own file system. Its a tricky and a very difficult task to make that effective.
You could treat the packing and editing program sort of like a custom memory allocator:
Use a minimum block size - When you add a file, use enough whole
blocks to fit the file. This automatically gives the files some room
to grow without effecting the others.
When a file gets too big for its current allocation, move it to the end of the package.
Mark the free blocks as free, and keep the offset to the head of the
free list in the package header. When adding other files, first
check to see if there is a free block big enough for them.
When extending files past their current block, check to see if the following block is on the free list.
If the free list gets too long (too much fragmentation), consolodate the package. Move each file forward to start in the first free block. This will have to re-write the whole file, but it would happen rarely.
Alternately, instead of the simple directory you have, use something like a FAT. For each file, store a list of chunks and sizes. When you extend a file past its current allocation, add another chunk with the remainder. Defragment occasionaly as needed.
Both of these would add a little overhead to the package, but leaving gaps is really the only alternative to rewriting the whole thing on every insert.
The is not way to insert bytes into a file other than the one you described. This is independent of the programming language. It's just how file systems work...
You can overwrite parts of the file, but only as long as you respect the byte count.
Have you thought about using a .zip file? I keep seeing formats out there where multiple files are stored as one, and the underlying file is really a zip file. The nice thing about this is that the zip library handles the low-level bit-tracking stuff for you.
A couple examples that come to mind:
A Word .docx file is really a zip (rename one to .zip, and you can open it -- it has whole folders in it)
The .xap file that Silverlight packages use is another one.
You can use a managed shared memory, supported by memory mapped file. You still have to have sufficient address space for the whole file, but you don't need to copy the whole file into memory. You can use most standard facilities with shared memory allocator, though you can quickly find that specifying custom allocator everywhere is a chore. But the good news is that you don't need to implement it all yourself, you can take Boost.Interprocess and it already has all necessary facilities for both unix and windows.

How to build xml element in memory and then save to file?

I want to make an xml file but add certain elements selectively. I would like to build those elements in memory and be able to choose whether to be written to the xml file or not. Is this possible? I'm using C# WinForms. I already looked at XmlDocument which will allow me to build an entire xml in memory but I don't want this since it's not good for large data and also takes too much memory.
If large data volume is your main concern, XmlWriter may be ideal. The API is perhaps not as elegant as XElement etc, but it is the most direct and efficient mechanism, and is designed for firehosing a one-way stream of data.
It has a twin, XmlReader, but that is much harder to get right - due to the complexities of processing incoming xml and accounting for child trees appropriately.
If you have a stream, named say stream, then you can create an XmlWriter by
XmlWriter.Create(stream);
Then, you can create your elements, (which are type XmlElement), and call the WriteTo method on each element you want to add, passing as an argument the XmlWriter created by XmlWriter.Create.

Categories