C# - remove blocks of bytes in large binary files

C# - remove blocks of bytes in large binary files - c#

i want a fast way in c# to remove a blocks of bytes in different places from binary file of size between 500MB to 1GB , the start and the length of bytes needed to be removed are in saved array
int[] rdiDataOffset= {511,15423,21047};
int[] rdiDataSize={102400,7168,512};
EDIT:
this is a piece of my code and it will not work correctly unless i put buffer size to 1:
while(true){
if (rdiDataOffset.Contains((int)fsr.Position))
{
int idxval = Array.IndexOf(rdiDataOffset, (int)fsr.Position, 0, rdiDataOffset.Length);
int oldRFSRPosition = (int)fsr.Position;
size = rdiDataSize[idxval];
fsr.Seek(size, SeekOrigin.Current);
}
int bufferSize = size == 0 ? 2048 : size;
if ((size>0) && (bufferSize > (size))) bufferSize = (size);
if (bufferSize > (fsr.Length - fsr.Position)) bufferSize = (int)(fsr.Length - fsr.Position);
byte[] buffer = new byte[bufferSize];
int nofbytes = fsr.Read(buffer, 0, buffer.Length);
fsr.Flush();
if (nofbytes < 1)
{
break;
}
}

No common file system provides an efficient way to remove chunks from the middle of an existing file (only truncate from the end). You'll have to copy all the data after the removal back to the appropriate new location.

A simple algorithm for doing this using a temp file (it could be done in-place as well but you have a riskier situation in case things go wrong).
Create a new file and call SetLength to set the stream size (if this is too slow you can Interop to SetFileValidData). This ensures that you have room for your temp file while you are doing the copy.
Sort your removal list in ascending order.
Read from the current location (starting at 0) to the first removal point. The source file should be opened without granting Write share permissions (you don't want someone mucking with it while you are editing it).
Write that content to the new file (you will likely need to do this in chunks).
Skip over the data not being copied
Repeat from #3 until done
You now have two files - the old one and the new one ... replace as necessary. If this is really critical data you might want to look a transactional approach (either one you implement or using something like NTFS transactions).
Consider a new design. If this is something you need to do frequently then it might make more sense to have an index in the file (or near the file) which contains a list of inactive blocks - then when necessary you can compress the file by actually removing blocks ... or maybe this IS that process.

If you're on the NTFS file system (most Windows deployments are) and you don't mind doing p/invoke methods, then there is a way, way faster way of deleting chunks from a file. You can make the file sparse. With sparse files, you can eliminate a large chunk of the file with a single call.
When you do this, the file is not rewritten. Instead, NTFS updates metadata about the extents of zeroed-out data. The beauty of sparse files is that consumers of your file don't have to be aware of the file's sparseness. That is, when you read from a FileStream over a sparse file, zeroed-out extents are transparently skipped.
NTFS uses such files for its own bookkeeping. The USN journal, for example, is a very large sparse memory-mapped file.
The way you make a file sparse and zero-out sections of that file is to use the DeviceIOControl windows API. It is arcane and requires p/invoke but if you go this route, you'll surely hide the uggles behind nice pretty function calls.
There are some issues to be aware of. For example, if the file is moved to a non-ntfs volume and then back, the sparseness of the file can disappear - so you should program defensively.
Also, a sparse file can appear to be larger than it really is - complicating tasks involving disk provisioning. A 5g sparse file that has been completely zeroed out still counts 5g towards a user's disk quota.
If a sparse file accumulates a lot of holes, you might want to occasionally rewrite the file in a maintenance window. I haven't seen any real performance troubles occur, but I can at least imagine that the metadata for a swiss-cheesy sparse file might accrue some performance degradation.
Here's a link to some doc if you're into the idea.

Related

Performance issues while creating file checksums

I am writing a console application which iterates through a binary tree and searches for new or changed files based on their md5 checksums.
The whole process is acceptable fast (14sec for ~70.000 files) but generating the checksums takes about 5min which is quite too slow...
Any suggestions for improving this process? My hash function is the following:
private string getMD5(string filename)
{
using (var md5 = new MD5CryptoServiceProvider())
{
if (File.Exists(#filename))
{
try
{
var buffer = md5.ComputeHash(File.ReadAllBytes(filename));
var sb = new StringBuilder();
for (var i = 0; i < buffer.Length; i++)
{
sb.Append(buffer[i].ToString("x2"));
}
return sb.ToString();
}
catch (Exception)
{
Program.logger.log("Error while creating checksum!", Program.logger.LOG_ERROR);
return "";
}
}
else
{
return "";
}
}
}

Well, accepted answer is not valid, because, of course, there is a ways to improve your code performance. It is valid for some other thoughts however)
Main stopper here, except disk I/O, is memory allocation. Here the some thoughts that should improve speed:
Do not read entire file in memory for calculation, it is slow, and it'll produce a lot of memory pressure via LOH objects. Instead open file as a stream, and calculate Hash by chunks.
The reason, why you have slowdown when using ComputeHash stream override, because internally it use very small buffer (4kb), so choose appropriate buffer size (256kb or more, optimal value to be found by experimenting)
Use TransformBlock and TransformFinalBlock functions to calculate hash value. You can pass null for outputBuffer parameter.
Reuse that buffer for following files hash calculations, so there is no need for additional allocations.
Additionally you can reuse MD5CryptoServiceProvider, but benefits are questionable.
And the last, you can apply async pattern for reading chunks from stream, so OS will read next chunk from disk on the same time, when you calculating partial hash for previous chunk. Of course such code is more difficult to write, and you'll need at least two buffers (reuse them as well), but it can provide great impact on speed.
As a minor improvement, do not check for file existence. I believe, that your function called from some enumeration, and there is very little chance, that file is deleted meanwhile.
All above is valid for medium to large sized files. If you, instead, have a lot of very small files, you can speed calculation by processing files in parallel. Actually parallelization can also help with large files, but it is up to be measured.
And the last, if collisions doesn't bother you too much, you can chose less expensive hash algorithm, CRC, for example.

In order to create the Hash, you have to read every last byte of the file. So this operation is Disk-limited, not CPU limited and scales proportionally to the size of files. Multithreading will not help.
Unless the FS can somehow calculate and store the hash for you, there is just no way to speed this up. You are dependant on what the FS does for you to track changes.
Generally proramms that check for "changed files" (like backup routines) do not calculate the Hashvalue for exactly that reason. They may still calculate and store it for validation purposes, but that is it.
Unless the user does some serious (NTFS driver loading level) sabotage, the "last changed" date with the filesize is enough to detect changes. Maybe also check the archive bit, but that one is rarely used nowadays.
A minor improovement for these kind of scenarios (list files and process them) is using "Enumerate Files" rather then list files. But at 14 seconds Listing/5 minutes processing that will just not have any relevant effect.

Is overwriting a file multiple times enough to erase its data?

In Shredding files in .NET it is recommended to use Eraser or this code here on CodeProject to securely erase a file in .NET.
I was trying to make my own method of doing so, as the code from CodeProject had some problems for me. Here's what I came up with:
public static void secureDelete(string file, bool deleteFile = true)
{
string nfName = "deleted" + rnd.Next(1000000000, 2147483647) + ".del";
string fName = Path.GetFileName(file);
System.IO.File.Move(file, file.Replace(fName, nfName));
file = file.Replace(fName, nfName);
int overWritten = 0;
while (overWritten <= 7)
{
byte[] data = new byte[1 * 1024 * 1024];
rnd.NextBytes(data);
File.WriteAllBytes(file, data);
overWritten += 1;
}
if (deleteFile) { File.Delete(file); }
}
It seems to work fine. It renames the file randomly and then overwrites it with 1 mb of random data 7 times. However, I was wondering how safe it actually is, and if there was anyway I could make it safer?

A file system, especially when accessed through a higher-level API such as the ones found in System.IO is so many levels of abstraction above the actual storage implementation that this approach makes little sense for modern drives.
To be clear: the CodeProject article, which promotes overwriting a file by name multiple times, is absolute nonsense - for SSDs at least. There is no guarantee whatsoever that writing to a file at some path multiple times writes to the same physical location on disk every time.
Of course, opening a file with read-write access and overwriting it from the beginning, conceptually writes to the same "location". But that location is pretty abstract.
See it like this: hard disks, but especially solid state drives, might take a write, such as "set byte N of cluster M to O", and actually write an entire new cluster to an entirely different location on the drive, to prolong the drive's lifetime (as repeated writes to the same memory cells may damage the drive).
From Coding for SSDs – Part 3: Pages, Blocks, and the Flash Translation Layer | Code Capsule:
Pages cannot be overwritten
A NAND-flash page can be written to only if it is in the “free” state. When data is changed, the content of the page is copied into an internal register, the data is updated, and the new version is stored in a “free” page, an operation called “read-modify-write”. The data is not updated in-place, as the “free” page is a different page than the page that originally contained the data. Once the data is persisted to the drive, the original page is marked as being “stale”, and will remain as such until it is erased.
This means that somewhere on the drive, the original data is still readable, namely in the cluster M to which a write was requested. That is, until it is overwritten. The cluster is now marked as "free", but you'll need very low-level access to the disk to access that cluster in order to overwrite it, and I'm not sure that's possible with SSDs.
Even if you would overwrite the entire SSD or hard drive multiple times, chances are that some of your very private data is hidden in a now defunct sector or page on the disk or SSD, because at the moment of overwriting or clearing it the drive determined that location to be defective. A forensics team will be able to read this data (albeit damaged). So, if you have data on a hard drive that can be used against you: toss the drive into a fire.
See also Get file offset on disk/cluster number for some more (links to) information about lower-level file system APIs.
But all of this is to be taken with quite a grain of salt, as all of this is hearsay and I have no actual experience with this level of disk access.

Finding "empty" portions in a file

EDIT 1:
I build a torrent application; Downloading from diffrent clients simultaneously. Each download represent a portion for my file and diffrent clients have diffrent portions.
After a download is complete, I need to know which portion I need to achieve now by Finding "empty" portions in my file.
One way to creat a file with fixed size:
File.WriteAllBytes(#"C:\upload\BigFile.rar", new byte[Big Size]);
My portion Arr that represent my file as portions:
BitArray TorrentPartsState = new BitArray(10);
For example:
File size is 100.
TorrentPartsState[0] = true; // thats mean that in my file, from position 0 until 9 I **dont** need to fill in some information.
TorrentPartsState[1] = true; // thats mean that in my file, from position 10 until 19 I **need** to fill in some information.
I seatch an effective way to save what the BitArray is containing even if the computer/application is shut down. One way I tought of, is by xml file and to update it each time a portion is complete.
I don't think its smart and effective solution. Any idea for other one?

It sounds like you know the following when you start a transfer:
The size of the final file.
The (maximum) number of streams you intend to use for the file.
Create the output file and allocate the required space.
Create a second "control" file with a related filename, e.g. add you own extension. In that file maintain an array of stream status structures corresponding to the network streams. Each status consists of the starting offset and number of bytes transferred. Periodically flush the stream buffers and then update the control file to reflect the progress made and committed.
Variations on the theme:
The control file can define segments to be transferred, e.g. 16MB chunks, and treated as a work queue by threads that look for an incomplete segment and a suitable server from which to retrieve it.
The control file could be a separate fork within the result file. (Who am I kidding?)

You could use a BitArray (in System.Collections).
Then, when you visit on offset in the file, you can set the BitArray at that offset to true.
So for your 10,000 byte file:
BitArray ba = new BitArray(10000);
// Visited offset, mark in the BitArray
ba[4] = true;

Implement a file system (like on a disk) in your file - just use something simple, should be something available in the FOS arena

FileStream.Seek vs. Buffered Reading

Motivated by this answer I was wondering what's going on under the curtain if one uses lots of FileStream.Seek(-1).
For clarity I'll repost the answer:
using (var fs = File.OpenRead(filePath))
{
fs.Seek(0, SeekOrigin.End);
int newLines = 0;
while (newLines < 3)
{
fs.Seek(-1, SeekOrigin.Current);
newLines += fs.ReadByte() == 13 ? 1 : 0; // look for \r
fs.Seek(-1, SeekOrigin.Current);
}
byte[] data = new byte[fs.Length - fs.Position];
fs.Read(data, 0, data.Length);
}
Personally I would have read like 2048 bytes into a buffer and searched that buffer for the char.
Using Reflector I found out that internally the method is using SetFilePointer.
Is there any documentation about windows caching and reading a file backwards? Does Windows buffer "backwards" and consult the buffer when using consecutive Seek(-1) or will it read ahead starting from the current position?
It's interesting that on the one hand most people agree with Windows doing good caching, but on the other hand every answer to "reading file backwards" involves reading chunks of bytes and operating on that chunk.

Going forward vs backward doesn't usually make much difference. The file data is read into the file system cache after the first read, you get a memory-to-memory copy on ReadByte(). That copy isn't sensitive to the file pointer value as long as the data is in the cache. The caching algorithm does however work from the assumption that you'd normally read sequentially. It tries to read ahead, as long as the file sectors are still on the same track. They usually are, unless the disk is heavily fragmented.
But yes, it is inefficient. You'll get hit with two pinvoke and API calls for each individual byte. There's a fair amount of overhead in that, those same two calls could also read, say, 65 kilobytes with the same amount of overhead. As usual, fix this only when you find it to be a perf bottleneck.

Here is a pointer on File Caching in Windows
The behavior may also depends on where physically resides the file (hard disk, network, etc.) as well as local configuration/optimization.
An also important source of information is the CreateFile API documentation: CreateFile Function
There is a good section named "Caching Behavior" that tells us at least how you can influence file caching, at least in the unmanaged world.

Writing at the end of file

I'm working on a system that requires high file I/O performance (with C#).
Basically, I'm filling up large files (~100MB) from the start of the file until the end of the file.
Every ~5 seconds I'm adding ~5MB to the file (sequentially from the start of the file), on every bulk I'm flushing the stream.
Every few minutes I need to update a structure which I write at the end of the file (some kind of metadata).
When flushing each one of the bulks I have no performance issue.
However, when updating the metadata at the end of the file I get really low performance.
My guess is that when creating the file (which also should be done extra fast), the file doesn't really allocates the entire 100MB on the disk and when I flush the metadata it must allocates all space until the end of file.
Guys/Girls, any Idea how I can overcome this problem?
Thanks a lot!
From comment:
In general speaking the code is as follows, first the file is opened:
m_Stream = new FileStream(filename,
FileMode.CreateNew,
FileAccess.Write,
FileShare.Write, 8192, false);
m_Stream.SetLength(100*1024*1024);
Every few seconds I'm writing ~5MB.
m_Stream.Seek(m_LastPosition, SeekOrigin.Begin);
m_Stream.Write(buffer, 0, buffer.Length);
m_Stream.Flush();
m_LastPosition += buffer.Length; // HH: guessed the +=
m_Stream.Seek(m_MetaDataSize, SeekOrigin.End);
m_Stream.Write(metadata, 0, metadata.Length);
m_Stream.Flush(); // Takes too long on the first time(~1 sec).

As suggested above would it not make sense (assuming you must have the meta data at the end of the file) write that first.
That would do 2 things (assuming a non sparse file)...
1. allocate the total space for the entire file
2. make any following write operations a little faster as the space is ready and waiting.
Can you not do this asyncronously?
At least the application can then move on to other things.

Have you tried the AppendAllText method?

Your question isn't totally clear, but my guess is you create a file, write 5MB, then seek to 100MB and write the metadata, then seek back to 5MB and write another 5MB and so on.
If that is the case, this is a filesystem problem. When you extend the file, NTFS has to fill the gap in with something. As you say, the file is not allocated until you write to it. The first time you write the metadata the file is only 5MB long, so when you write the metadata NTFS has to allocate and write out 95MB of zeros before it writes the metadata. Upsettingly I think it also does this synchronously, so you don't even win using overlapped IO.

How about using the BufferedStream?
http://msdn.microsoft.com/en-us/library/system.io.bufferedstream(v=VS.100).aspx

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.