Logic in Disk Defragmantation & Disk Check - c#

What is the logic behind disk defragmentation and Disk Check in Windows? Can I do it using C# coding?

For completeness sake, here's a C# API wrapper for defragmentation:
http://blogs.msdn.com/jeffrey_wall/archive/2004/09/13/229137.aspx
Defragmentation with these APIs is (supposed to be) very safe nowadays. You shouldn't be able to corrupt the file system even if you wanted to.
Commercial defragmentation programs use the same APIs.

Look at Defragmenting Files at msdn for possible API helpers.
You should carefully think about using C# for this task, as it may introduce some undesired overhead for marshaling into native Win32.

If you don't know the logic for defragmentation, and if you didn't write the file system yourself so you can't authoritatively check it for errors, why not just start new processes running 'defrag' and 'chkdsk'?

Mark Russinovich wrote an article Inside Windows NT Disk Defragmentation a while ago which gives in-depth details. If you really want to do this I would really advise you to use the built-in facilities for defragmenting. More so, on recent OSes I have never seen a need as a user to even care about defragmenting; it will be done automatically on a schedule and the NTFS folks at MS are definitely smarter at that stuff than you (sorry, but they do this for some time now, you don't).

Despite its importance, the file system is no more than a data structure that maps file names into lists of disk blocks. And keeps track of meta-information such as the actual length of the file and special files that keep lists of files (e.g., directories). A disk checker verifies that the data structure is consistent. That is, every disk block must either be free for allocation to a file or belong to a single file. It can also check for certain cases where a set of disk blocks appears to be a file that should be in a directory but is not for some reason.
Defragmentation is about looking at the lists of disk blocks assigned to each file. Files will generally load faster if they use a contiguous set of blocks rather than ones scattered all over the disk. And generally the entire file system will perform best if all the disk blocks in use confine themselves to a single congtiguous range of the disk. Thus the trick is moving disk blocks around safely to achieve this end while not destroying the file system.
The major difficulty here is running these application while a disk is in use. It is possible but one has to be very, very, very careful not to make some kind of obvious or extremely subtle error and destroy most or all of the files. It is easier to work on a file system offline.
The other difficulty is dealing with the complexities of the file system. For example, you'd be much better off building something that supports FAT32 rather than NTFS because the former is a much, much simpler file system.
As long as you have low-level block access and some sensible way for dealing with concurrency problems (best handled by working on the file system when it is not in use) you can do this in C#, perl or any language you like.
BUT BE VERY CAREFUL. Early versions of the program will destroy entire file systems. Later versions will do so but only under obscure circumstances. And users get extremely angry and litigious if you destroy their data.

Related

How does database engines guarantee no data loss and still uses basic I/O (Windows OS)?

This question arise because when someone wants to use flat file as database, most people will say "is database not an option?" and things like that. This makes me think that most people believe that popular database softwares are reliable in handling data storage.
However, since database engines also write their data stores to files (or allow me to say "flat files"), then I am confused as to why most people believe that protection from data loss is almost completely guaranteed in database engines.
I suppose that database softwares use features like the Windows' CreateFile() function with the FILE_FLAG_WRITE_THROUGH option set; yet, Microsoft specifies in their Documentation that "Not all hard disk hardware supports this write-through capability."
Then why can a database engine be more reliable than my C# code that also uses unmanaged CreateFile() function to write to disk directly using some algorithms (like this SO way) to prevent damage to data? Especially when writing small bits of files and appending small bytes to it in some future time? (Note: not comparing in terms of robustness, features, etc... just reliability of data integrity)
The key to most database systems integrity is the log file.
As well as updating the various tables/data stores/documents they also write all operations and associated data to a log file.
In most cases when the program "commits()" it waits until all operations are written (really written!) to the log file. If anything happens after that database can be rebuilt using the log file data.
Note -- you could get something similar using standard disk I/O and calling flush() at the appropriate points. However you could never guarantee the status of the file (many I/Os could have taken place before you called flush() ) and you could never recover to a point in time as you have no copy of deleted records or a copy of a previous version of an updated record.
Of course you can write a very secure piece of C# code that handles all possible exceptions and faults, that calculates hash codes and check them back for anything it is going to write on the disk, that manages all quirks of every operating system it's deployed on with respect with file caching, disk write buffering and so forth and so on.
The question is: why should you?
Admittedly, a DB is not always the right choice if you just want to write data on the disk. But if you want to store data consistently, safely and most importantly, without losing too much of your time in nitty-gritty IO-operation details, then you should use some kind of well established and tested piece of code that someone else wrote and took the time to debug (hint: a database is a good choice).
See?
Also, there are databases, like sqlite, that are perfect for fast, installation-less use in a program. Use them or not, it's your choice, but I wouldn't spend my time to reinvent the wheel, if I were you.

File close issue windows

We are using File.WriteAllBytes to write data to the disk. But if a reboot happens just about the time when we close the file, windows adds null to the file. This seems to be happening on Windows 7. So once we come back to the file we see nulls in the file. Is there a way to prevent this. Is windows closing it's internal handle after certain time and can this be forced to close immediately ?.
Depending on what behavior you want; you can either put it in a UPS as 0A0D suggested; but in addition you can use Windows' Vista+ Transactional NTFS functionality. This allows you to atomically write to the file system. So in your case; nothing would be written rather than improper data. It isn't directly part of the .NET Framework yet; but there are plenty of managed wrappers to be found online.
Sometimes no data is better than wrong data. When your application starts up again; it can see that the file is missing; it can "continue" from where it left off; depending on what your application does.
Based on your comments, there is no guarantees when writing a file - especially if you lose power during a file write. Your best bet is to put the PC on an Uninterruptable Power Supply. If you are able to create an auto-restore mechanism, like Microsoft Office products, then that would prevent complete loss of data but it won't fix the missing data upon power loss.
I would consider this a case of a fatal exception (sudden loss of power). There isn't anything you can do about it, and generally, trying to handle them only makes matters worse.
I have had to deal with something similar; essentially an embedded system running on Windows, where the expectation is that the power might be shut off at any time.
In practice, I work with the understanding that a file written to disk less than 10 seconds before loss-of-power means that the file will be corrupted. (I use 30 seconds in my code to play it safe).
I am not aware of any way of guaranteeing from code that a file has been fully closed, flushed to disk, and that the disk hardware has finalized its writes. Except to know that 10 (or 30) seconds has elapsed. It's not a very satisfying situation, but there it is.
Here are some pointers I have used in a real-life embedded project...
Use a system of checksums and backup files.
Checksums: at the end of any file you write, include a checksum (if it's a custom XML file then perhaps include a <checksum .../> tag of some sort). Then upon reading, if the checksum tag isn't there, or doesn't match the data, then you must reject the file as corrupt.
Backups: every time you write a file, save a copy to one of two backups; say A and B. If A exists on disk but is less than 30 seconds old, then copy to B instead. Then upon reading, read the original file first. If corrupt, then read A, if corrupt then read B.
Also
If it is an embedded system, you need to run the DOS command "chkdsk /F" on the drive you do your writes to, upon boot. Because if you are getting corrupted files, then you are also going to be getting a corrupted file system.
NTFS disk systems are meant to be more robust against errors than FAT32. But I believe that NTFS disks can also require more time to fully flush their data. I use FAT32 when I can.
Final thought: if you are really using an embedded system, under windows, you would do well to learn more about Windows Embedded, and the Enhanced Write Filter system.

When to use memory-mapped files?

I have an application that receives chunks of data over the network, and writes these to disk.
Once all chunks have been received, they can be decoded/recombined into the single file they actually represent.
I'm wondering if it's useful to use memory-mapped files or not - first for writing the single chunks to disk, second for the single file into which all of them are decoded.
My own feeling is that it might be useful for the second case only, anyone got some ideas on this?
Edit:
It's a C# app, and I'm only planning an x64 version.
(So running into the 'largest contigious free space' problem shouldn't be relevant)
Memory-mapped files are beneficial for scenarios where a relatively small portion (view) of a considerably larger file needs to be accessed repeatedly.
In this scenario, the operating system can help optimize the overall memory usage and paging behavior of the application by paging in and out only the most recently used portions of the mapped file.
In addition, memory-mapped files can expose interesting features such as copy-on-write or serve as the basis of shared-memory.
For your scenario, memory-mapped files can help you assemble the file if the chunks arrive out of order. However, you would still need to know the final file size in advance.
Also, you should be accessing the files only once, for writing a chunk. Thus, a performance advantage over explicitly implemented asynchronous I/O is unlikely, but it may be easier and quicker to implement your file writer correctly.
In .NET 4, Microsoft added support for memory-mapped files and there are some comprehensive articles with sample code, e.g. http://blogs.msdn.com/salvapatuel/archive/2009/06/08/working-with-memory-mapped-files-in-net-4.aspx.
Memory-mapped files are primarily used for Inter-Process Communication or I/O performance improvement.
In your case, are you trying to get better I/O performance?
Hate to point out the obivious, but Wikipedia gives a good rundown of the situation...
http://en.wikipedia.org/wiki/Memory-mapped_file
Specifically...
The memory mapped approach has its cost in minor page faults - when a block of data is loaded in page cache, but not yet mapped in to the process's virtual memory space. Depending on the circumstances, memory mapped file I/O can actually be substantially slower than standard file I/O.
It sounds like you're about to prematurely optimize for speed. Why not a regular file approach, and then refactor for MM files later if needed?
I'd say both cases are relevant. Simply write the single chunks to their proper place in the memory mapped file, out of order, as they come in. This of course is only useful if you know where each chunk should go, like in a bittorrent downloader. If you have to perform some extra analysis to know where the chunk should go, the benefit of a memory mapped file might not be as large.

FileStream very slow on application-cold start

A very similar question has also been asked here on SO in case you are interested, but as we will see the accepted answer of that question is not always the case (and it's never the case for my application use-pattern).
The performance determining code consists of FileStream constructor (to open a file) and a SHA1 hash (the .Net framework implementation). The code is pretty much C# version of what was asked in the question I've linked to above.
Case 1: The Application is started either for the first time or Nth time, but with different target file set. The application is now told to compute the hash values on the files that were never accessed before.
~50ms
80% FileStream constructor
18% hash computation
Case 2: Application is now fully terminated, and started again, asked to compute hash on the same files:
~8ms
90% hash computation
8% FileStream constructor
Problem
My application is always in use Case 1. It will never be asked to re-compute a hash on a file that was already visited once.
So my rate-determining step is FileStream Constructor! Is there anything I can do to speed up this use case?
Thank you.
P.S. Stats were gathered using JetBrains profiler.
... but with different target file set.
Key phrase, your app will not be able to take advantage of the file system cache. Like it did in the second measurement. The directory info can't come from RAM because it wasn't read yet, the OS always has to fall back to the disk drive and that is slow.
Only better hardware can speed it up. 50 msec is about the standard amount of time needed for a spindle drive, 20 msec is about as low as such drives can go. Reader head seek time is the hard mechanical limit. That's easy to beat today, SSD is widely available and reasonably affordable. The only problem with it is that when you got used to it then you never move back :)
The file system and or disk controller will cache recently accessed files / sectors.
The rate-determining step is reading the file, not constructing a FileStream object, and it's completely normal that it will be significantly faster on the second run when data is in the cache.
Off track suggestion, but this is something that I have done a lot and got our analyses 30% - 70% faster:
Caching
Write another piece of code that will:
iterate over all the files;
compute the hash; and,
store it in another index file.
Now, don't call a FileStream constructor to compute the hash when your application starts. Instead, open the (expectedly much) smaller index file and read the precomputed hash off it.
Further, if these files are log etc. files which are freshly created every time before your application starts, add code in the file creator to also update the index file with the hash of the newly created file.
This way your application can always read the hash from the index file only.
I concur with #HansPassant's suggestion of using SSDs to make your disk reads faster. This answer and his answer are complimentary. You can implement both to maximize the performance.
As stated earlier, the file system has its own caching mechanism which perturbates your measurement.
However, the FileStream constructor performs several tasks which, the first time are expensive and require accessing the file system (therefore something which might not be in the data cache). For explanatory reasons, you can take a look at the code, and see that the CompatibilitySwitches classes is used to detect sub feature usage. Together with this class, Reflection is heavily used both directly (to access the current assembly) and indirectly (for CAS protected sections, security link demands). The Reflection engine has its own cache, and requires accessing the file system when its own cache is empty.
It feels a little bit odd that the two measurements are so different. We currently have something similar on our machines equipped with an antivirus software configured with realtime protection. In this case, the antivirus software is in the middle and the cache is hit or missed the first time depending the implementation of such software.
The antivirus software might decide to aggressively check certain image files, like PNGs, due to known decode vulnerabilities. Such checks introduce additional slowdown and accounts the time in the outermost .NET class, i.e. the FileStream class.
Profiling using native symbols and/or with kernel debugging, should give you more insights.
Based on my experience, what you describe cannot be mitigated as there are multiple hidden layers out of our control. Depending on your usage, which is not perfectly clear to me right now, you might turn the application in a service, therefore you could serve all the subsequent requests faster. Alternative, you could batch multiple requests into one single call to achieve an amortized reduced cost.
You should try to use the native FILE_FLAG_SEQUENTIAL_SCAN, you will have to pinvoke CreateFile in order to get an handle and pass it to FileStream

What's the best way to read and parse a large text file over the network?

I have a problem which requires me to parse several log files from a remote machine.
There are a few complications:
1) The file may be in use
2) The files can be quite large (100mb+)
3) Each entry may be multi-line
To solve the in-use issue, I need to copy it first. I'm currently copying it directly from the remote machine to the local machine, and parsing it there. That leads to issue 2. Since the files are quite large copying it locally can take quite a while.
To enhance parsing time, I'd like to make the parser multi-threaded, but that makes dealing with multi-lined entries a bit trickier.
The two main issues are:
1) How do i speed up the file transfer (Compression?, Is transferring locally even neccessary?, Can I read an in use file some other way?)
2) How do i deal with multi-line entries when splitting up the lines among threads?
UPDATE: The reason I didnt do the obvious parse on the server reason is that I want to have as little cpu impact as possible. I don't want to affect the performance of the system im testing.
If you are reading a sequential file you want to read it in line by line over the network. You need a transfer method capable of streaming. You'll need to review your IO streaming technology to figure this out.
Large IO operations like this won't benefit much by multithreading since you can probably process the items as fast as you can read them over the network.
Your other great option is to put the log parser on the server, and download the results.
The better option, from the perspective of performance, is going to be to perform your parsing at the remote server. Apart from exceptional circumstances the speed of your network is always going to be the bottleneck, so limiting the amount of data that you send over your network is going to greatly improve performance.
This is one of the reasons that so many databases use stored procedures that are run at the server end.
Improvements in parsing speed (if any) through the use of multithreading are going to be swamped by the comparative speed of your network transfer.
If you're committed to transferring your files before parsing them, an option that you could consider is the use of on-the-fly compression while doing your file transfer.
There are, for example, sftp servers available that will perform compression on the fly.
At the local end you could use something like libcurl to do the client side of the transfer, which also supports on-the-fly decompression.
The easiest way considering you are already copying the file would be to compress it before copying, and decompress once copying is complete. You will get huge gains compressing text files because zip algorithms generally work very well on them. Also your existing parsing logic could be kept intact rather than having to hook it up to a remote network text reader.
The disadvantage of this method is that you won't be able to get line by line updates very efficiently, which are a good thing to have for a log parser.
I guess it depends on how "remote" it is. 100MB on a 100Mb LAN would be about 8 secs...up it to gigabit, and you'd have it in around 1 second. $50 * 2 for the cards, and $100 for a switch would be a very cheap upgrade you could do.
But, assuming it's further away than that, you should be able to open it with just read mode (as you're reading it when you're copying it). SMB/CIFS supports file block reading, so you should be streaming the file at that point (of course, you didn't actually say how you were accessing the file - I'm just assuming SMB).
Multithreading won't help, as you'll be disk or network bound anyway.
Use compression for transfer.
If your parsing is really slowing you down, and you have multiple processors, you can break the parsing job up, you just have to do it in a smart way -- have a deterministic algorithm for which workers are responsible for dealing with incomplete records. Assuming you can determine that a line is part of a middle of a record, for example, you could break the file into N/M segments, each responsible for M lines; when one of the jobs determines that its record is not finished, it just has to read on until it reaches the end of the record. When one of the jobs determines that it's reading a record for which it doesn't have a beginning, it should skip the record.
If you can copy the file, you can read it. So there's no need to copy it in the first place.
EDIT: use the FileStream class to have more control over the access and sharing modes.
new FileStream("logfile", FileMode.Open, FileAccess.Read, FileShare.ReadWrite)
should do the trick.
I've used SharpZipLib to compress large files before transferring them over the Internet. So that's one option.
Another idea for 1) would be to create an assembly that runs on the remote machine and does the parsing there. You could access the assembly from the local machine using .NET remoting. The remote assembly would need to be a Windows service or be hosted in IIS. That would allow you to keep your copies of the log files on the same machine, and in theory it would take less time to process them.
i think using compression (deflate/gzip) would help
The given answer do not satisfy me and maybe my answer will help others to not think it is super complicated or multithreading wouldn't benefit in such a scenario. Maybe it will not make the transfer faster but depending on the complexity of your parsing it may make the parsing/or analysis of the parsed data faster.
It really depends upon the details of your parsing. What kind of information do you need to get from the log files? Are these information like statistics or are they dependent on multiple log message?
You have several options:
parse multiple files at the same would be the easiest I guess, you have the file as context and can create one thread per file
another option as mentioned before is use compression for the network communication
you could also use a helper that splits the log file into lines that belong together as a first step and then with multiple threads process these blocks of lines; the parsing of this depend lines should be quite easy and fast.
Very important in such a scenario is to measure were your actual bottleneck is. If your bottleneck is the network you wont benefit of optimizing the parser too much. If your parser creates a lot of objects of the same kind you could use the ObjectPool pattern and create objects with multiple threads. Try to process the input without allocating too much new strings. Often parsers are written by using a lot of string.Split and so forth, that is not really as fast as it could be. You could navigate the Stream by checking the coming values without reading the complete string and splitting it again but directly fill the objects you will need after parsing is done.
Optimization is almost always possible, the question is how much you get out for how much input and how critical your scenario is.

Categories