Intercepting a filesystem write to a stream or byte array - c#

I have a C library with a .NET wrapper (it's Shapelib in this case) that writes files (Shapefiles) to the filesystem using a path like C:\Path\To\Things\Filename.shp. However, writing the files to the filesystem isn't actually what I need. Once they're written, I have to read them into back into streams anyways to either deliver them via the web, add them to a zip file, or some other task. Writing them to the filesystem means I just have to track the clutter and inevitably clean them up somehow.
I'm not sure if there's anything like PHP's stream protocol registers where the path could be like stream://output.shp...
Is it possible to intercept the filesystem writing and handle this entire task in memory? Even if this can be done, is it horrible practice? Thanks!

The consensus is that this is "virtually impossible." If you really need to ensure that this is done in RAM, your best bet is to install a RAM disk driver. Do a Google search for [windows intercept file output]. Or check out Intercept outputs from a Program in Windows 7.
That said, it's quite possible that much, perhaps most, of the data that you write to disk will be buffered in memory, so turning right around and reading the data from disk won't be all that expensive. You still have the cleanup problem, but it's really not that tough to solve: just use a try/finally block:
try
{
// do everything
}
finally
{
// clean up
}

Related

Writing continuously to file vs saving data in an array and writing to file all at once

I am working in c# and in my program i am currently opening a file stream at launch and writing data to it in csv format about every second. I am wondering, would it be more efficient to store this data in an arraylist and write it all at once at the end, or continue to keep an open filestream and just write the data every second?
If the amount of data is "reasonably manageable" in memory, then write the data a the end.
If this is continuous, I wonder id an option could be to use something like NLog to write your csv (create a specific log format) as that manages writes pretty efficiently. You would also need to set it to raise exceptions if there was an error.
You should consider using a BufferedStream instead. Write to the stream and allow the framework to flush to file as necessary. Just make sure to flush the stream before closing it.
From what I learned in Operating Systems writing to a file is a lot more expensive than writing to memory. However, your stream is most likely going to be cached. Which means that under the hood, all that file writing you are doing is happening in memory. The operating system handles all the actual writing to file asynchronously when it's the right time. Depending on your applications there is no need to worry about such micro-optimizations.
You can read more about why most languages take this approach under the hood here https://unix.stackexchange.com/questions/224415/whats-the-philosophy-behind-delaying-writing-data-to-disk
This kind of depends on your specific case. If you're writing data about once per second it seems likely that you're not going to see much of an impact from writing directly.
In general writing to a FileStream in small pieces is quite performant because the .NET Framework and the OS handle buffering for you. You won't see the file itself being updated until the buffer fills up or you explicitly flush the stream.
Buffering in memory isn't a terrible idea for smallish data and shortish periods. Of course if your program throws an exception or someone kills it before it writes to the disk then you lose all that information, which is probably not your favourite thing.
If you're worried about performance then use a logging thread. Post objects to it through a ConcurrentQueue<> or similar and have it do all the writes on a separate thread. Obviously threaded logging is more complex. It's not something I'd advise unless you really, really need the extra performance.
For quick-and-dirty logging I generally just use File.AppendAllText() or File.AppendAllLines() to push the data out. It takes a bit longer, but it's pretty reliable. And I can read the output while the program is still running, which is often useful.

When does it become worthwhile to spend the execution time to zip files?

We are using the #ziplib (found here) in an application that synchronizes files from a server for an occasionally connected client application.
My question is, with this algorithm, when does it become worthwhile to spend the execution time to do the actual zipping of files? Presumably, if only one small text file is being synchronized, the time to zip would not sufficiently reduce the size of the transfer and would actually slow down the entire process.
Since the zip time profile is going to change based on the number of files, the types of files and the size of those files, is there a good way to discover programmatically when I should zip the files and when I should just pass them as is? In our application, files will almost always be photos though the type of photo and size may well change.
I havent written the actual file transfer logic yet, but expect to use System.Net.WebClient to do this, but am open to alternatives to save on execution time as well.
UPDATE: As this discussion develops, is "to zip, or not to zip" the wrong question? Should the focus be on replacing the older System.Net.WebClient method with compressed WCF traffic or something similar? The database synchronization portion of this utility already uses Microsoft Synchronization Framework and WCF, so I am certainly open to that. Anything we can do now to limit network traffic is going to be huge for our clients.
To determine whether it's useful to compress a file, you have to read the file anyway. When on it, you might as well zip it then.
If you want to prevent useless zipping without reading the files, you could try to decide it on beforehand, based on other properties.
You could create an 'algorithm' that decides whether it's useful, for example based on file extention and size. So, a .txt file of more than 1 KB can be zipped, but a .jpg file shouldn't, regardless of the file size. But it's a lot of work to create such a list (you could also create a black- or whitelist and allow c.q. deny all files not on the list).
You probably have plenty of CPU time, so the only issue is: does it shrink?
If you can decrease the file you will save on (Disk and Network) I/O. That becomes profitable very quickly.
Alas, photos (jpeg) are already compressed so you probably won't see much gain.
You can write your own pretty simple heuristic analysis and then reuse it whilst each next file processing. Collected statistics should be saved to keep efficiency between restarts.
Basically interface:
enum FileContentType
{
PlainText,
OfficeDoc,
OffixeXlsx
}
// Name is ugly so find out better
public interface IHeuristicZipAnalyzer
{
bool IsWorthToZip(int fileSizeInBytes, FileContentType contentType);
void AddInfo(FileContentType, fileSizeInBytes, int finalZipSize);
}
Then you can collect statistic by adding information regarding just zipped file using AddInfo(...) and based on it can determine whether it worth to zip a next file by calling IsWorthToZip(...)

File close issue windows

We are using File.WriteAllBytes to write data to the disk. But if a reboot happens just about the time when we close the file, windows adds null to the file. This seems to be happening on Windows 7. So once we come back to the file we see nulls in the file. Is there a way to prevent this. Is windows closing it's internal handle after certain time and can this be forced to close immediately ?.
Depending on what behavior you want; you can either put it in a UPS as 0A0D suggested; but in addition you can use Windows' Vista+ Transactional NTFS functionality. This allows you to atomically write to the file system. So in your case; nothing would be written rather than improper data. It isn't directly part of the .NET Framework yet; but there are plenty of managed wrappers to be found online.
Sometimes no data is better than wrong data. When your application starts up again; it can see that the file is missing; it can "continue" from where it left off; depending on what your application does.
Based on your comments, there is no guarantees when writing a file - especially if you lose power during a file write. Your best bet is to put the PC on an Uninterruptable Power Supply. If you are able to create an auto-restore mechanism, like Microsoft Office products, then that would prevent complete loss of data but it won't fix the missing data upon power loss.
I would consider this a case of a fatal exception (sudden loss of power). There isn't anything you can do about it, and generally, trying to handle them only makes matters worse.
I have had to deal with something similar; essentially an embedded system running on Windows, where the expectation is that the power might be shut off at any time.
In practice, I work with the understanding that a file written to disk less than 10 seconds before loss-of-power means that the file will be corrupted. (I use 30 seconds in my code to play it safe).
I am not aware of any way of guaranteeing from code that a file has been fully closed, flushed to disk, and that the disk hardware has finalized its writes. Except to know that 10 (or 30) seconds has elapsed. It's not a very satisfying situation, but there it is.
Here are some pointers I have used in a real-life embedded project...
Use a system of checksums and backup files.
Checksums: at the end of any file you write, include a checksum (if it's a custom XML file then perhaps include a <checksum .../> tag of some sort). Then upon reading, if the checksum tag isn't there, or doesn't match the data, then you must reject the file as corrupt.
Backups: every time you write a file, save a copy to one of two backups; say A and B. If A exists on disk but is less than 30 seconds old, then copy to B instead. Then upon reading, read the original file first. If corrupt, then read A, if corrupt then read B.
Also
If it is an embedded system, you need to run the DOS command "chkdsk /F" on the drive you do your writes to, upon boot. Because if you are getting corrupted files, then you are also going to be getting a corrupted file system.
NTFS disk systems are meant to be more robust against errors than FAT32. But I believe that NTFS disks can also require more time to fully flush their data. I use FAT32 when I can.
Final thought: if you are really using an embedded system, under windows, you would do well to learn more about Windows Embedded, and the Enhanced Write Filter system.

Efficient log backup program in C#

I am writing a log backup program in C#. The main objective is to take logs from multiple servers, copy and compress the files and then move them to a central data storage server. I will have to move about 270Gb of data every 24 hours. I have a dedicated server to run this job and a LAN of 1Gbps. Currently I am reading lines from a (text)file, copying them into a buffer stream and writing them to the destination.
My last test copied about 2.5Gb of data in 28 minutes. This will not do. I will probably thread the program for efficiency, but I am looking for a better method to copy the files.
I was also playing with the idea of compressing everything first and then using a stream buffer a bit to copy. Really, I am just looking for a little advice from someone with more experience than me.
Any help is appreciated, thanks.
You first need to profile as Umair said so that you can figure out how much of the 28 minutes is spent compressing vs. transmitting. Also measure the compression rate (bytes/sec) with different compression libraries, and compare your transfer rate against other programs such as Filezilla to see if you're close to your system's maximum bandwidth.
One good library to consider is DotNetZip, which allows you to zip to a stream, which can be handy for large files.
Once you get it fine-tuned for one thread, experiment with several threads and watch your processor utilization to see where the sweet spot is.
One of the solutions can be is what you mantioned: compress files in one Zip file and after transfer them via network. This will bemuch faster as you are transfering one file and often on of principal bottleneck during file transfers is Destination security checks.
So if you use one zip file, there should be one check.
In short:
Compress
Transfer
Decompress (if you need)
This already have to bring you big benefits in terms of performance.
Compress the logs at source and use TransmitFile (that's a native API - not sure if there's a framework equivalent, or how easy it is to P/Invoke this) to send them to the destination. (Possibly HttpResponse.TransmitFile does the same in .Net?)
In any event, do not read your files linewise - read the files in blocks (loop doing FileStream.Read for 4K - say - bytes until read count == 0) and send that direct to the network pipe.
Trying profiling your program... bottleneck is often where you least expect it to be. As some clever guy said "Premature optimisation is the root of all evil".
Once in a similar scenario at work, I was given the task to optimise the process. And after profiling the bottleneck was found to be a call to sleep function (which was used for synchronisation between thread!!!! ).

Logic in Disk Defragmantation & Disk Check

What is the logic behind disk defragmentation and Disk Check in Windows? Can I do it using C# coding?
For completeness sake, here's a C# API wrapper for defragmentation:
http://blogs.msdn.com/jeffrey_wall/archive/2004/09/13/229137.aspx
Defragmentation with these APIs is (supposed to be) very safe nowadays. You shouldn't be able to corrupt the file system even if you wanted to.
Commercial defragmentation programs use the same APIs.
Look at Defragmenting Files at msdn for possible API helpers.
You should carefully think about using C# for this task, as it may introduce some undesired overhead for marshaling into native Win32.
If you don't know the logic for defragmentation, and if you didn't write the file system yourself so you can't authoritatively check it for errors, why not just start new processes running 'defrag' and 'chkdsk'?
Mark Russinovich wrote an article Inside Windows NT Disk Defragmentation a while ago which gives in-depth details. If you really want to do this I would really advise you to use the built-in facilities for defragmenting. More so, on recent OSes I have never seen a need as a user to even care about defragmenting; it will be done automatically on a schedule and the NTFS folks at MS are definitely smarter at that stuff than you (sorry, but they do this for some time now, you don't).
Despite its importance, the file system is no more than a data structure that maps file names into lists of disk blocks. And keeps track of meta-information such as the actual length of the file and special files that keep lists of files (e.g., directories). A disk checker verifies that the data structure is consistent. That is, every disk block must either be free for allocation to a file or belong to a single file. It can also check for certain cases where a set of disk blocks appears to be a file that should be in a directory but is not for some reason.
Defragmentation is about looking at the lists of disk blocks assigned to each file. Files will generally load faster if they use a contiguous set of blocks rather than ones scattered all over the disk. And generally the entire file system will perform best if all the disk blocks in use confine themselves to a single congtiguous range of the disk. Thus the trick is moving disk blocks around safely to achieve this end while not destroying the file system.
The major difficulty here is running these application while a disk is in use. It is possible but one has to be very, very, very careful not to make some kind of obvious or extremely subtle error and destroy most or all of the files. It is easier to work on a file system offline.
The other difficulty is dealing with the complexities of the file system. For example, you'd be much better off building something that supports FAT32 rather than NTFS because the former is a much, much simpler file system.
As long as you have low-level block access and some sensible way for dealing with concurrency problems (best handled by working on the file system when it is not in use) you can do this in C#, perl or any language you like.
BUT BE VERY CAREFUL. Early versions of the program will destroy entire file systems. Later versions will do so but only under obscure circumstances. And users get extremely angry and litigious if you destroy their data.

Categories