Editing large binary files

Editing large binary files - c#

I'm busy with a little project which has a lot of data like images text files and other things and I'm trying to pack it all up in one big file or multiple big files so the program folder doesn't look messy.
But the problem is how can I edit these files. I've thought about the file structure and it's going to be something like this:
[DWORD] Number of files
[DWORD]FileId
[STRING]FileName
[DWORD]FileSize
[DWORD]FileIndex
[BYTES]All the files
So the first part is too quickly get a list of all the files and the FileIndex is the Position in the binary file so I can set the pointer too for example 300 and read the file.
But if I want to create a patch and edit it I would have to read all the bytes after the file i'm editing and copy them all back which could take ages with a couple of files.
The binary file could be a few 100 mb's when all the files are inserted.
So how do other programs do this for example games use these big files and also patch a lot is there some kind of trick to insert extra bytes more quickly?

There is no "trick" to inserting bytes in the middle of a file.
Usually solutions involve adding files to the end of the file, then switching their position in the index. Then you run into the problem of having to defragment the file. You can break files into large chunks which can mitigate some of the defragmentation woes, but then the files are not contiguous.
If you are dealing with non-static data, I would not recommend doing this unless you absolutely have to. I've seen absolutely brilliant software engineers take a considerable amount of time to write a reasonable implementation of this.
Using sqlite as a virtual file system can be a viable solution to this. But then again, so is putting the data files in another folder so it doesn't look "messy".

If at all possible, I'd probably package the data up into a zip file. This will not only clean up your directory, but (especially for the text files you mention) throw in some compression essentially for free. There are also, of course, quite a few existing tools and libraries for creating, examining, modifying, etc., a zip file.
Using zlib (for one example), most of the work is handled for you (e.g., as demonstrated in minizip).

The trick is to make patches by overwriting the data. Otherwise, there are systems available to manage large volumes of data, for example databases.
You can create a database file that will accompany your program, and hold all your data there, and not in files. You can even embed the database code in your application, with SQLite, for example, or use external DB's like Sql Server, Oracle SQL, or MySql.
What you're describing is basically implementing your own file system. Its a tricky and a very difficult task to make that effective.

You could treat the packing and editing program sort of like a custom memory allocator:
Use a minimum block size - When you add a file, use enough whole
blocks to fit the file. This automatically gives the files some room
to grow without effecting the others.
When a file gets too big for its current allocation, move it to the end of the package.
Mark the free blocks as free, and keep the offset to the head of the
free list in the package header. When adding other files, first
check to see if there is a free block big enough for them.
When extending files past their current block, check to see if the following block is on the free list.
If the free list gets too long (too much fragmentation), consolodate the package. Move each file forward to start in the first free block. This will have to re-write the whole file, but it would happen rarely.
Alternately, instead of the simple directory you have, use something like a FAT. For each file, store a list of chunks and sizes. When you extend a file past its current allocation, add another chunk with the remainder. Defragment occasionaly as needed.
Both of these would add a little overhead to the package, but leaving gaps is really the only alternative to rewriting the whole thing on every insert.

The is not way to insert bytes into a file other than the one you described. This is independent of the programming language. It's just how file systems work...
You can overwrite parts of the file, but only as long as you respect the byte count.

Have you thought about using a .zip file? I keep seeing formats out there where multiple files are stored as one, and the underlying file is really a zip file. The nice thing about this is that the zip library handles the low-level bit-tracking stuff for you.
A couple examples that come to mind:
A Word .docx file is really a zip (rename one to .zip, and you can open it -- it has whole folders in it)
The .xap file that Silverlight packages use is another one.

You can use a managed shared memory, supported by memory mapped file. You still have to have sufficient address space for the whole file, but you don't need to copy the whole file into memory. You can use most standard facilities with shared memory allocator, though you can quickly find that specifying custom allocator everywhere is a chore. But the good news is that you don't need to implement it all yourself, you can take Boost.Interprocess and it already has all necessary facilities for both unix and windows.

Related

File.Delete or File.Encrypt to wipe files?

is it possible to use either File.Delete or File.Encrypt to shred files? Or do both functions not overwrite the actual content on disk?
And if they do, does this also work with wear leveling of ssds and similar techniques of other storages? Or is there another function that I should use instead?
I'm trying to improve an open source project which currently stores credentials in plaintext within a file. Because of reasons they are always written to that file (I don't know why Ansible does this, but for now I don't want to touch that part of the code, there may be some valid reason, why that is that way, at least for now) and I can just delete that file afterwards. So is using File.Delete or File.Encrypt the right approach to purge that information off the disk?
Edit: If it is only possible using native API and pinvoke, I'm also fine with that. I'm not limited to only .net, but to C#.
Edit2: To provide some context: The plaintext credentials are saved by the ansible internals as they are passed as a variable for the modules that get executed on the target windows host. This file is responsible for retrieving the variables again: https://github.com/ansible/ansible/blob/devel/lib/ansible/module_utils/powershell/Ansible.ModuleUtils.Legacy.psm1#L287
https://github.com/ansible/ansible/blob/devel/lib/ansible/module_utils/csharp/Ansible.Basic.cs#L373

There's a possibility that File.Encrypt would do more to help shred data than File.Delete (which definitely does nothing in that regard), but it won't be a reliable approach.
There's a lot going on at both the Operating System and Hardware level that's a couple of abstraction layers separated from the .NET code. For example, your file system may randomly decide to move the location where it's storing your file physically on the disk, so overwriting the place where you currently think the file is might not actually remove traces from where the file was stored previously. Even if you succeed in overwriting the right parts of the file, there's often residual signal on the disk itself that could be picked up by someone with the right equipment. Some file systems don't truly overwrite anything: they just add information every time a change happens, so you can always find out what the disk's contents were at any given point in time.
So if you legitimately cannot prevent a file getting saved, any attempt to truly erase it is going to be imperfect. If you're willing to accept imperfection and only want to mitigate the potential for problems somewhat, you can use a strategy like the ones you've found to try to overwrite the file with garbage data several times and hope for the best.
But I wouldn't be too quick to give up on solving the problem at its source. For example, Ansible's docs mention:
A great alternative to the password lookup plugin, if you don’t need to generate random passwords on a per-host basis, would be to use Vault in playbooks. Read the documentation there and consider using it first, it will be more desirable for most applications.

should you include file in xml or have it in a two step process?

I have to implement a way to transfer between many organizations(unknown number) some information, name/address/etc, and a unknown number of files associated to that information.
when I'm saying unknown files, it could be a xml file of over 100 meg, if they are embedded
the transfer will be done over xml so the question is;
should i allow embedded files using base64 in elements or have a 2 steps process which would be
send me the xml file with a kind of pointer in a element, let say filenames
send the files with the specific filenames in the xml
or is there a third solution?
I have to deserialize the xml into an object, do some manipulation then saving it in a database.
(I currently have a throw away prototype using the 2 steps process)

Don't put the files in the XML, this would make it unwieldy. Instead, reference the file names from the XML and then zip the XML and files up into one bundle and send that.

Be sure to consider the expected evolution of the data, how change occurs across the parts of the document, and how many parties have an interest in the updates.
At the one end of the spectrum, the data will never change, the parts are all static, and updates aren't an issue to anyone. A one-shot broadcast of a single large file (or zipped set of files) is good enough. I'd lean toward a zipped archive with linked components over an embedding/encoding solution here.
The other end of the spectrum calls for a "third solution," as you say. The data changes frequently and independently, some parts of the massive document change while others remain constant, and many parties are interested in having access to the current version of the evolving data. Here, a linked representation of the various parts of the resource as references to network-shared parts, possibly independently version controlled, would have a major advantage. Linked data is a robust solution worth considering over monolithic distribution of a massive file.

How to resize a file, "trimming" its beginning?

I am implementing a file-based queue of serialized objects, using C#.
Push() will serialize an object as binary and append it to the end of the file.
Pop() should deserialize an object from the beginning of the file (this part I got working). Then, the deserialized part should be removed from the file, making the next object to be "first".
From the standpoint of file system, that would just mean copying file header several bytes further on the disk, and then moving the "beginning of the file" pointer. The question is how to implement this in C#? Is it at all possible?

Easiest that I can see
1) stream out (like a log, dump it into file),
(note: you'd need some delimiters and a 'consistent format' of your 'file' - based on what your data is)
2) and later stream in (just read file from start, in one go, and process w/o removing anything)
and that'd work fine, FIFO (first in first out).
So, my suggestion - don't try to optimize that by removing, skipping
etc. (rather regroup and use more files.
3) If you worry about the scale of things - then just 'partition' that into small enough files, e.g. each 100 or 1,000 records (depends, do some calculations).
You may need to make some sort of 'virtualizer' here, which maps files, keeps track of your 'database' as, if it's over multiple files. The simplest is to just use the file-system and check file times etc. Or add some basic code to improve that.
However, I think you may have problems if you have to ensure
'transactions' - i.e. what if things fail so you need to keep track of
where the file left off, retrace etc.
That might be an issue, but you know best if it's really necessary to have that (how critical). You can always work 'per file' and per smaller files. If it fails, rollback and do the file again (or log problems). If it succeeds you can delete file (after success) and go on like that.
This is very 'hand made' approach but should get you going with a simple and not too demanding solution (like you're describing). Or something along those lines.
I should probably add...
You could also save you some trouble and use some portable database for that or something similar. This is was purely based on the idea of hand-coding a simplest solution (and we could probably come up with something smarter, but being late this is what I have :).

Files don't work that way. You can trim off the end, but not the beginning. In order to mutate a file to remove content at the beginning you need to re-write the entire file.
I expect you'll want to find some other way to solve your problem. But a linear file is totally inappropriate for representing a FIFO queue.

What is faster Renaming files, Changing an attribute in them or Moving them between folders

I'm developing a file system manager module, and wondering what will be a more efficient approach.
This will be on a Windows machine with NTFS.
The module will need to notify a different module regarding new files created on a specific directory and also maintain some kind of state for this files so already processed files can be deleted, and in case of failure, the unprocessed files will be processed again.
I thought of either moving files between directories as their state changes, or renaming files according to their state or changing the files attributes as a sign of their state.
I'm wondering what would be the most efficient approach, considering the possibility of a large quantity of files being created over a short time span.

I can't fully answer your question, but give some general hints. Most important of all, the answer to your question might largely depend on the underlying file system (NTFS, FAT32, etc.).
Renaming or moving a file on the same partition generally means that directory entries are changed. The actual file contents need not be touched. Once you move a file to a different partition or hard disk drive, the actual file contents must be copied, too, which takes far more time.
That all being said, I would generally assume a rename to be slightly quicker than moving a file to another directory (on the same partition), since only one directory is affected instead of two. I'm also not quite sure what you mean by changing a file "attribute" -- however, if you're talking about e.g. setting the "archive" flag of a file, or making the file "read-only", that might again be slightly faster than a rename, if the directory entry can be changed in-place instead of being replaced with a new one of a different size.
Again: Do take my assumptions with caution, since this all depends on the particular file system. (For example, hiding a file on a UNIX file system usually means renaming it -- prefixing the name with a . --, but the same is not true for typical DOS/Windows file systems.)

Renaming took: 1498.8166
ApplyAttribute took: 340.5407
Transfer took: 2527.6837
Transfer took: 3933.4944
ApplyAttribute took: 419.635
Renaming took: 1384.0079
Tested with 1000 files.
Run tests twice in order to ensure no caching is in place.
EDITED: nasty bug was fixed, sorry.
Go with attributes.

Why do you want to store this information directly in the filesystem? I would recommend using a SQL database to keep track of the files. That way, you avoid modifying the filesystem, it's probably going to be faster, and you can easily have more information about the files if you need them.
Also, having one folder with large amount of files might be slow by itself, so you might consider having more folders for the files, if that makes sense for you.

how to read files from uncompressed zip in c#?

I´m creating a PDA app and I need to upload/download a lot of small files and my idea is to gather them in an uncompressed zip file.
The question is: It´s a good idea to read those files from the zip without separating them? How can I do so? Or is it better to unzip them? Since the files are not compressed my simple mind points that maybe reading them from the zip it´s more or less as efficient as reading them directly from the file system...
Thanks for you time!

Since there are two different Open-source libraries (SharpZipLib and DotNetZip Library) to handle writing & extracting files from a zip file, why worry about doing it yourself?

ewww - don't use J#.
The DotNetZip library, as of v1.7, runs on the .NET Compact Framework 2.0 and above. It can handle reading or writing compressed or uncompressed files within a ZIP archive. The source distribution includes a CF example app. It's really simple.

Sounds as if you want to use the archive to group your files.
From a reading the files point of view, it makes very little difference if the files are handled one way or the other. You would need to implement the ability to read zip files, though. Even if you use a lib like James Curran suggested, it means additional work, which can mean additional sources of error.
From the uploading the files point of view, it makes more sense: The uploader could gather all the files needed and would have to take care of only one single upload. This reduces overhead as well as error handling (if one uplaod fails, do you have to delete all files of this group already uploaded?).
As for the efficiency of reading them from the archive vs. reading them directly from the disc: The difference should be minimal. You (or your zip library) need to once parse the zip directory structure, which is pretty straight forward. The rest is reading part of a file into memory vs. reading a file into memory.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.