Is overwriting a file multiple times enough to erase its data? - c#

In Shredding files in .NET it is recommended to use Eraser or this code here on CodeProject to securely erase a file in .NET.
I was trying to make my own method of doing so, as the code from CodeProject had some problems for me. Here's what I came up with:
public static void secureDelete(string file, bool deleteFile = true)
{
string nfName = "deleted" + rnd.Next(1000000000, 2147483647) + ".del";
string fName = Path.GetFileName(file);
System.IO.File.Move(file, file.Replace(fName, nfName));
file = file.Replace(fName, nfName);
int overWritten = 0;
while (overWritten <= 7)
{
byte[] data = new byte[1 * 1024 * 1024];
rnd.NextBytes(data);
File.WriteAllBytes(file, data);
overWritten += 1;
}
if (deleteFile) { File.Delete(file); }
}
It seems to work fine. It renames the file randomly and then overwrites it with 1 mb of random data 7 times. However, I was wondering how safe it actually is, and if there was anyway I could make it safer?

A file system, especially when accessed through a higher-level API such as the ones found in System.IO is so many levels of abstraction above the actual storage implementation that this approach makes little sense for modern drives.
To be clear: the CodeProject article, which promotes overwriting a file by name multiple times, is absolute nonsense - for SSDs at least. There is no guarantee whatsoever that writing to a file at some path multiple times writes to the same physical location on disk every time.
Of course, opening a file with read-write access and overwriting it from the beginning, conceptually writes to the same "location". But that location is pretty abstract.
See it like this: hard disks, but especially solid state drives, might take a write, such as "set byte N of cluster M to O", and actually write an entire new cluster to an entirely different location on the drive, to prolong the drive's lifetime (as repeated writes to the same memory cells may damage the drive).
From Coding for SSDs – Part 3: Pages, Blocks, and the Flash Translation Layer | Code Capsule:
Pages cannot be overwritten
A NAND-flash page can be written to only if it is in the “free” state. When data is changed, the content of the page is copied into an internal register, the data is updated, and the new version is stored in a “free” page, an operation called “read-modify-write”. The data is not updated in-place, as the “free” page is a different page than the page that originally contained the data. Once the data is persisted to the drive, the original page is marked as being “stale”, and will remain as such until it is erased.
This means that somewhere on the drive, the original data is still readable, namely in the cluster M to which a write was requested. That is, until it is overwritten. The cluster is now marked as "free", but you'll need very low-level access to the disk to access that cluster in order to overwrite it, and I'm not sure that's possible with SSDs.
Even if you would overwrite the entire SSD or hard drive multiple times, chances are that some of your very private data is hidden in a now defunct sector or page on the disk or SSD, because at the moment of overwriting or clearing it the drive determined that location to be defective. A forensics team will be able to read this data (albeit damaged). So, if you have data on a hard drive that can be used against you: toss the drive into a fire.
See also Get file offset on disk/cluster number for some more (links to) information about lower-level file system APIs.
But all of this is to be taken with quite a grain of salt, as all of this is hearsay and I have no actual experience with this level of disk access.

Related

C# .NET reading all of of a process troubles

I'm messing around with a scanning engine I'm working on and I'm trying to read the memory of a process. My code is below (it's a little messy) but for some reason if I read the memory of an application in different states, or after it has a lot of things loaded into memory, I get the same memory size no matter what. Are my entry point addresses and length incorrect?
If I use a memory editor I don't get the same results I do with this.
Process process = Process.GetProcessesByName(processName)[0];
List<Byte[]> moduleMemory = new List<byte[]>();
byte[] temp;
//MessageBox.Show(pm.FileName);
temp = new byte[pm.ModuleMemorySize];
int read;
if (ReadProcessMemory(process.Handle, pm.BaseAddress, temp, temp.Length, out read)) {
moduleMemory.Add(temp);
}
}
//string d = Encoding.Default.GetString(moduleMemory[0]);
MessageBox.Show("Size: " + moduleMemory[0].Length);
Your problem is probaly caused by the fact, that Process class caches values:
The process component obtains information about a group of properties
all at once. After the Process component has obtained information
about one member of any group, it will cache the values for the other
properties in that group and not obtain new information about the
other members of the group until you call the Refresh method.
Therefore, a property value is not guaranteed to be any newer than the
last call to the Refresh method. The group breakdowns are
operating-system dependent.
Therefore after target process loads some additional modules, process instance will still return old values. Calling process.Refresh() should update all cached values and fix the issue.
As I see this code does nothing more than reading the memory layout of the executable module (.exe file) which the process was created for. So no wonder you get the same size all the time.
I assume you are up to read the "operational" memory of the process. If so, you should have a look at this discussion.

C# - remove blocks of bytes in large binary files

i want a fast way in c# to remove a blocks of bytes in different places from binary file of size between 500MB to 1GB , the start and the length of bytes needed to be removed are in saved array
int[] rdiDataOffset= {511,15423,21047};
int[] rdiDataSize={102400,7168,512};
EDIT:
this is a piece of my code and it will not work correctly unless i put buffer size to 1:
while(true){
if (rdiDataOffset.Contains((int)fsr.Position))
{
int idxval = Array.IndexOf(rdiDataOffset, (int)fsr.Position, 0, rdiDataOffset.Length);
int oldRFSRPosition = (int)fsr.Position;
size = rdiDataSize[idxval];
fsr.Seek(size, SeekOrigin.Current);
}
int bufferSize = size == 0 ? 2048 : size;
if ((size>0) && (bufferSize > (size))) bufferSize = (size);
if (bufferSize > (fsr.Length - fsr.Position)) bufferSize = (int)(fsr.Length - fsr.Position);
byte[] buffer = new byte[bufferSize];
int nofbytes = fsr.Read(buffer, 0, buffer.Length);
fsr.Flush();
if (nofbytes < 1)
{
break;
}
}
No common file system provides an efficient way to remove chunks from the middle of an existing file (only truncate from the end). You'll have to copy all the data after the removal back to the appropriate new location.
A simple algorithm for doing this using a temp file (it could be done in-place as well but you have a riskier situation in case things go wrong).
Create a new file and call SetLength to set the stream size (if this is too slow you can Interop to SetFileValidData). This ensures that you have room for your temp file while you are doing the copy.
Sort your removal list in ascending order.
Read from the current location (starting at 0) to the first removal point. The source file should be opened without granting Write share permissions (you don't want someone mucking with it while you are editing it).
Write that content to the new file (you will likely need to do this in chunks).
Skip over the data not being copied
Repeat from #3 until done
You now have two files - the old one and the new one ... replace as necessary. If this is really critical data you might want to look a transactional approach (either one you implement or using something like NTFS transactions).
Consider a new design. If this is something you need to do frequently then it might make more sense to have an index in the file (or near the file) which contains a list of inactive blocks - then when necessary you can compress the file by actually removing blocks ... or maybe this IS that process.
If you're on the NTFS file system (most Windows deployments are) and you don't mind doing p/invoke methods, then there is a way, way faster way of deleting chunks from a file. You can make the file sparse. With sparse files, you can eliminate a large chunk of the file with a single call.
When you do this, the file is not rewritten. Instead, NTFS updates metadata about the extents of zeroed-out data. The beauty of sparse files is that consumers of your file don't have to be aware of the file's sparseness. That is, when you read from a FileStream over a sparse file, zeroed-out extents are transparently skipped.
NTFS uses such files for its own bookkeeping. The USN journal, for example, is a very large sparse memory-mapped file.
The way you make a file sparse and zero-out sections of that file is to use the DeviceIOControl windows API. It is arcane and requires p/invoke but if you go this route, you'll surely hide the uggles behind nice pretty function calls.
There are some issues to be aware of. For example, if the file is moved to a non-ntfs volume and then back, the sparseness of the file can disappear - so you should program defensively.
Also, a sparse file can appear to be larger than it really is - complicating tasks involving disk provisioning. A 5g sparse file that has been completely zeroed out still counts 5g towards a user's disk quota.
If a sparse file accumulates a lot of holes, you might want to occasionally rewrite the file in a maintenance window. I haven't seen any real performance troubles occur, but I can at least imagine that the metadata for a swiss-cheesy sparse file might accrue some performance degradation.
Here's a link to some doc if you're into the idea.

Set limit of text file

I am writing lines to text file. Is there a way to limit the maximum number of lines in a text file. So that I am not allowed to write after that limit.
Or
if i continue to write after the max line limit the oldest written lines are deleted to accommodate the newly added lines.
There is ... but you shouldn't be hitting it ... And if you ARE ... well, maybe a text file isn't what you're looking for.
Size wise, a file has different limitations depending on your file system ... NTFS (almost 16TB), FAT (fat 32 is almost 4GB), unix file systems will have their limitations, and so on ...
here you have answers about the size: one answer, and another
Like they suggest, your limit will be the size of the file.
As for your comment:
You can set the limit to whatever you wish.
What you do then is up to you ... if you decide to overwrite the file, it'll delete and start afresh. if you decide to append, it'll append to the end.
I would suggest create a queue of a 100 strings, and if you push new ones, drop the last one in the queue. Then you can just have that class save the log whenever, wherever and however you want.
Create your own method like this
public void writeLines(string filePath,string[] lines,int limit)
{
var buffer=Enumerable.Empty<string>();
if(File.Exists(filePath))
buffer=File.ReadAllLines(path);
File.WriteAllLines(filePath,lines);
int range=limit-lines.Length+buffer.Count;
File.AppendAllLines(filePath,buffer.Take(range));
}
The answer to the first question is very simple:
Know your storage limit.
Know your current file size.
If the new line's length plus current file size is more than the storage limit, don't append it.
Now, the second is kind of tricky. As pointed out by several participants on this thread, line-by-line threshold manipulation can be very very costly.
Let's do some napkin simulations, and assume you're inserting 1024 bytes (1KB) at each Append, and your storage limit is 1GB. Once you insert the last line (n. 1048576), you decide you need to remove the first line. There's a few ways to accomplish this, but the majority of them will involve loading the whole collection minus the initial line elsewhere (memory, disk, you name it) and Appending the new one. Not exactly the most practical approach - you'll be manipulating a stack a million times larger than the content you want to add, just for the sake of adding it.
Solution 1
Cursor buffer
On our example you have 1048576 possible entries (1KB records on 1GB file).
Start filling it up; save the current position (cursor) elsewhere.
Once you reach the limit, your cursor resets; you overwrite position 0, then 1, and so forth.
Advantages: Very low disk cost.
Disadvantages: You'll need to keep track of your current cursor somewhere.
Solution 2
Text blocks
Assume 1GB storage, and max 1MB files for this example.
Start filling up File #0.
Once it reaches 1MB, close it. Open File #1. Rinse and repeat.
Once you fill up file #1023 (thus reaching the 1GB max), delete oldest file (#0). Create file #1024. Continue your logging.
Advantages: Low manipulation cost - you only run one delete operation.
Disadvantages: You don't delete only one entry - you delete a whole block.

Passthru reading of all files in folder

I've got pretty unusual request:
I would like to load all files from specific folder (so far easy). I need something with very small memory footprint.
Now it gets complicated (at least for me). I DON'T need to store or use the content of the files - I just need to force block-level caching mechanism to cache all the blocks that are used by that specific folder.
I know there are many different methods (BinaryReader, StreamReader etc.), but my case is quite special, since I don't care about the content...
Any idea what would be the best way how to achieve this?
Should I use small buffer? But since it would filled quickly, wouldn't flushing of the buffer actually slow down the operation?
Thanks,
Martin
I would perhaps memory map the files and then loop around accessing an element of each file at regular (block-spaced) intervals.
Assuming of course that you are able to use .Net 4.0.
In psuedo code you'd do something like:
using ( var mmf = MemoryMappedFile.CreateFromFile( path ) )
{
for ( long offset = 0 ; offset < file.Size ; offset += block_size )
{
using ( var acc = accessor = mmf.CreateViewAccessor(offset, 1) )
{
acc.ReadByte(offset);
}
}
}
But at the end of the day, each method will have different performance characteristics so you might have to use a bit of trial and error to find out which is the most performant.
I would simply read those files. When you do that, CacheManager in NTFS caches these files automatically, and you don't have to care about anything else - that's exactly the role of CacheManager, and by reading these files, you give it a hint that these files should be cached.

FileStream.Seek vs. Buffered Reading

Motivated by this answer I was wondering what's going on under the curtain if one uses lots of FileStream.Seek(-1).
For clarity I'll repost the answer:
using (var fs = File.OpenRead(filePath))
{
fs.Seek(0, SeekOrigin.End);
int newLines = 0;
while (newLines < 3)
{
fs.Seek(-1, SeekOrigin.Current);
newLines += fs.ReadByte() == 13 ? 1 : 0; // look for \r
fs.Seek(-1, SeekOrigin.Current);
}
byte[] data = new byte[fs.Length - fs.Position];
fs.Read(data, 0, data.Length);
}
Personally I would have read like 2048 bytes into a buffer and searched that buffer for the char.
Using Reflector I found out that internally the method is using SetFilePointer.
Is there any documentation about windows caching and reading a file backwards? Does Windows buffer "backwards" and consult the buffer when using consecutive Seek(-1) or will it read ahead starting from the current position?
It's interesting that on the one hand most people agree with Windows doing good caching, but on the other hand every answer to "reading file backwards" involves reading chunks of bytes and operating on that chunk.
Going forward vs backward doesn't usually make much difference. The file data is read into the file system cache after the first read, you get a memory-to-memory copy on ReadByte(). That copy isn't sensitive to the file pointer value as long as the data is in the cache. The caching algorithm does however work from the assumption that you'd normally read sequentially. It tries to read ahead, as long as the file sectors are still on the same track. They usually are, unless the disk is heavily fragmented.
But yes, it is inefficient. You'll get hit with two pinvoke and API calls for each individual byte. There's a fair amount of overhead in that, those same two calls could also read, say, 65 kilobytes with the same amount of overhead. As usual, fix this only when you find it to be a perf bottleneck.
Here is a pointer on File Caching in Windows
The behavior may also depends on where physically resides the file (hard disk, network, etc.) as well as local configuration/optimization.
An also important source of information is the CreateFile API documentation: CreateFile Function
There is a good section named "Caching Behavior" that tells us at least how you can influence file caching, at least in the unmanaged world.

Categories