Concatenate large files using Win NT kernel API

Concatenate large files using Win NT kernel API - c#

I've been looking around for a way to concatenate large files (a few gigabytes) together without having to rewrite one of the files. I am sure the OS does this internally when manipulating the master file table. This is purely for an internal application where speed is critical even at the cost of data integrity (in case of risking undocumented APIs). The app processes a large amount of high-bandwidth, multi-channel ethernet data where a corrupt unit of work (file in this case) will not have a large impact on overall processing results.
At the moment when combining files A and B, the effort involved is equal to: A[Read] + B[Read] +C[Write]`. Would any of you NT gurus shed some light on how to work around this to get to the MFT directly?
I have not been able to gain any clues as to which API to explore and would appreciate some pointers. Although the app in managed, I would gladly explore native APIs and even setup light-weight VMs for testing.
Thanks in advance.

If you are appending File B to File A, all you have to do is open File A for write+append , seek to end of file, then read from B and write to A.
If you want to create File C as the concatenation of File A and File B, then you are going to have to create File C and copy A to C, then B to C.
There aren't any shortcuts.

That's not really something a file system would do. File systems allocate space for files in terms of clusters and blocks of data, not in terms of bytes. Concatenating two files like this would only work if they were both multiples of the cluster size, and the FS might have other assumptions about how blocks are allocated to files under the covers. You might be able to do this yourself to the file system if you dismounted it and wrote a tool to directly manipulate all the file system structures. But you're risking corrupting the whole disk if you do that, not just a single file.

I don't know your exact situation but would it be possible to not append the files together at all? Just keep throwing files into some directory as you receive data, and keep an index
Then as the data is needed use the index to piece together the data to create one new file?
So you only ever do the expensive file merging on demand?

Related

When does it become worthwhile to spend the execution time to zip files?

We are using the #ziplib (found here) in an application that synchronizes files from a server for an occasionally connected client application.
My question is, with this algorithm, when does it become worthwhile to spend the execution time to do the actual zipping of files? Presumably, if only one small text file is being synchronized, the time to zip would not sufficiently reduce the size of the transfer and would actually slow down the entire process.
Since the zip time profile is going to change based on the number of files, the types of files and the size of those files, is there a good way to discover programmatically when I should zip the files and when I should just pass them as is? In our application, files will almost always be photos though the type of photo and size may well change.
I havent written the actual file transfer logic yet, but expect to use System.Net.WebClient to do this, but am open to alternatives to save on execution time as well.
UPDATE: As this discussion develops, is "to zip, or not to zip" the wrong question? Should the focus be on replacing the older System.Net.WebClient method with compressed WCF traffic or something similar? The database synchronization portion of this utility already uses Microsoft Synchronization Framework and WCF, so I am certainly open to that. Anything we can do now to limit network traffic is going to be huge for our clients.

To determine whether it's useful to compress a file, you have to read the file anyway. When on it, you might as well zip it then.
If you want to prevent useless zipping without reading the files, you could try to decide it on beforehand, based on other properties.
You could create an 'algorithm' that decides whether it's useful, for example based on file extention and size. So, a .txt file of more than 1 KB can be zipped, but a .jpg file shouldn't, regardless of the file size. But it's a lot of work to create such a list (you could also create a black- or whitelist and allow c.q. deny all files not on the list).

You probably have plenty of CPU time, so the only issue is: does it shrink?
If you can decrease the file you will save on (Disk and Network) I/O. That becomes profitable very quickly.
Alas, photos (jpeg) are already compressed so you probably won't see much gain.

You can write your own pretty simple heuristic analysis and then reuse it whilst each next file processing. Collected statistics should be saved to keep efficiency between restarts.
Basically interface:
enum FileContentType
{
PlainText,
OfficeDoc,
OffixeXlsx
}
// Name is ugly so find out better
public interface IHeuristicZipAnalyzer
{
bool IsWorthToZip(int fileSizeInBytes, FileContentType contentType);
void AddInfo(FileContentType, fileSizeInBytes, int finalZipSize);
}
Then you can collect statistic by adding information regarding just zipped file using AddInfo(...) and based on it can determine whether it worth to zip a next file by calling IsWorthToZip(...)

Windows File system API to query large files

I have HDD (say 1TB) with FAT32 and NTFS partitions and I dont have information on which all files are stored on it, but when needed I want to quickly access large files say over 500 MB. I dont want to scan my whole HDD since it is very time consuming. I need quick results. I was wondering if there are any NTFS/FAT32 APIs that I can directly call - i mean if they have some metadata about the files that are stored then it will be quicker. I want to write my program in C++ and C#.
EDIT
If scanning the HDD is the only option then what all can I do to ensure best performance. Like - I could skip scanning system folders, since I am only interested in user data.

If you're willing to do a lot of extra work yourself to speed things up, you might be able to accomplish something. A lot is going to depend on what you need.
Let's start with FAT32. FAT (in general, not just the 32-bit variant) is named for the File Allocation Table. This is a block of data toward the beginning of the partition that tells which clusters in the partition belong to which files. The FAT is basically organized as linked lists of clusters. If you just want to find the data areas for the large files, you can read the FAT in as a number of raw sectors, and scan through that data to find linked lists of more than X clusters (where X defines the lower limit for what you consider a large file). You can then access those clusters and see the actual data associated with each file. Oddly, what you won't know is the name of that file. The file names are contained in directories, which are basically like files, except that what they contain are a fixed-size records of a specified format. You have to start from the root directory, and read through the directory tree to find file names.
NTFS is both simpler and more complex. NTFS has a Master File Table (MFT) to contains records for all the files in a partition. The good point is that you can read the MFT and get information about every file on the disk without chasing through the directory tree to get it. The bad point is that decoding the contents of an NTFS partition is definitely non-trivial. Reading data (meaningfully) is quite difficult -- and writing data much more difficult. Also, recent versions of Windows have added more restrictions on raw reading from disk partitions, so depending on what partition you're after, you may not be able to access the data you need at all.
None of this, however, is anything that's more than minimally supported. To do it, you open a file named "\.\D:" (where D=letter of the disk you care about). You can then read raw sectors from that disk drive (assuming that opening it worked). This will let you see the raw data for the entire disk (or partition, as the case may be) starting from the boot sector, and going through everything else that's there (FAT, root directory, subdirectories, etc. -- all as sectors of raw data). The system will let you read the raw data, but all the work to make any sense of that data is 100% your responsibility. If the speed you've asked about is an absolute necessity, this may be a possibility -- but it'll take a fair amount of work for FAT volumes, and considerably more than that for NTFS. Unless you really need extra speed like you've said, it's probably not even worth considering trying to do this.

If you're willing to target Vista and beyond, you can use the search indexer APIs.
If you look here you can find information about the search indexer. The search indexer does index the file size so it may do what you want.

Not possible. Neither filesystem keeps a list of big files that you could query directly. You'd have to recursively look at every folder and check the size of every file to find whatever you consider big.

Your only prayer is to latch onto a file indexer, otherwise you will have to iterate through all files. Depending on your computer you might be able to latch onto the native Microsoft indexer (searchindexer.exe) or if you have Google Desktop search you may be able to latch onto that.
Possible way to latch onto Microsoft's indexer

When to use memory-mapped files?

I have an application that receives chunks of data over the network, and writes these to disk.
Once all chunks have been received, they can be decoded/recombined into the single file they actually represent.
I'm wondering if it's useful to use memory-mapped files or not - first for writing the single chunks to disk, second for the single file into which all of them are decoded.
My own feeling is that it might be useful for the second case only, anyone got some ideas on this?
Edit:
It's a C# app, and I'm only planning an x64 version.
(So running into the 'largest contigious free space' problem shouldn't be relevant)

Memory-mapped files are beneficial for scenarios where a relatively small portion (view) of a considerably larger file needs to be accessed repeatedly.
In this scenario, the operating system can help optimize the overall memory usage and paging behavior of the application by paging in and out only the most recently used portions of the mapped file.
In addition, memory-mapped files can expose interesting features such as copy-on-write or serve as the basis of shared-memory.
For your scenario, memory-mapped files can help you assemble the file if the chunks arrive out of order. However, you would still need to know the final file size in advance.
Also, you should be accessing the files only once, for writing a chunk. Thus, a performance advantage over explicitly implemented asynchronous I/O is unlikely, but it may be easier and quicker to implement your file writer correctly.
In .NET 4, Microsoft added support for memory-mapped files and there are some comprehensive articles with sample code, e.g. http://blogs.msdn.com/salvapatuel/archive/2009/06/08/working-with-memory-mapped-files-in-net-4.aspx.

Memory-mapped files are primarily used for Inter-Process Communication or I/O performance improvement.
In your case, are you trying to get better I/O performance?
Hate to point out the obivious, but Wikipedia gives a good rundown of the situation...
http://en.wikipedia.org/wiki/Memory-mapped_file
Specifically...
The memory mapped approach has its cost in minor page faults - when a block of data is loaded in page cache, but not yet mapped in to the process's virtual memory space. Depending on the circumstances, memory mapped file I/O can actually be substantially slower than standard file I/O.
It sounds like you're about to prematurely optimize for speed. Why not a regular file approach, and then refactor for MM files later if needed?

I'd say both cases are relevant. Simply write the single chunks to their proper place in the memory mapped file, out of order, as they come in. This of course is only useful if you know where each chunk should go, like in a bittorrent downloader. If you have to perform some extra analysis to know where the chunk should go, the benefit of a memory mapped file might not be as large.

Logic in Disk Defragmantation & Disk Check

What is the logic behind disk defragmentation and Disk Check in Windows? Can I do it using C# coding?

For completeness sake, here's a C# API wrapper for defragmentation:
http://blogs.msdn.com/jeffrey_wall/archive/2004/09/13/229137.aspx
Defragmentation with these APIs is (supposed to be) very safe nowadays. You shouldn't be able to corrupt the file system even if you wanted to.
Commercial defragmentation programs use the same APIs.

Look at Defragmenting Files at msdn for possible API helpers.
You should carefully think about using C# for this task, as it may introduce some undesired overhead for marshaling into native Win32.

If you don't know the logic for defragmentation, and if you didn't write the file system yourself so you can't authoritatively check it for errors, why not just start new processes running 'defrag' and 'chkdsk'?

Mark Russinovich wrote an article Inside Windows NT Disk Defragmentation a while ago which gives in-depth details. If you really want to do this I would really advise you to use the built-in facilities for defragmenting. More so, on recent OSes I have never seen a need as a user to even care about defragmenting; it will be done automatically on a schedule and the NTFS folks at MS are definitely smarter at that stuff than you (sorry, but they do this for some time now, you don't).

Despite its importance, the file system is no more than a data structure that maps file names into lists of disk blocks. And keeps track of meta-information such as the actual length of the file and special files that keep lists of files (e.g., directories). A disk checker verifies that the data structure is consistent. That is, every disk block must either be free for allocation to a file or belong to a single file. It can also check for certain cases where a set of disk blocks appears to be a file that should be in a directory but is not for some reason.
Defragmentation is about looking at the lists of disk blocks assigned to each file. Files will generally load faster if they use a contiguous set of blocks rather than ones scattered all over the disk. And generally the entire file system will perform best if all the disk blocks in use confine themselves to a single congtiguous range of the disk. Thus the trick is moving disk blocks around safely to achieve this end while not destroying the file system.
The major difficulty here is running these application while a disk is in use. It is possible but one has to be very, very, very careful not to make some kind of obvious or extremely subtle error and destroy most or all of the files. It is easier to work on a file system offline.
The other difficulty is dealing with the complexities of the file system. For example, you'd be much better off building something that supports FAT32 rather than NTFS because the former is a much, much simpler file system.
As long as you have low-level block access and some sensible way for dealing with concurrency problems (best handled by working on the file system when it is not in use) you can do this in C#, perl or any language you like.
BUT BE VERY CAREFUL. Early versions of the program will destroy entire file systems. Later versions will do so but only under obscure circumstances. And users get extremely angry and litigious if you destroy their data.

What's the best way to read and parse a large text file over the network?

I have a problem which requires me to parse several log files from a remote machine.
There are a few complications:
1) The file may be in use
2) The files can be quite large (100mb+)
3) Each entry may be multi-line
To solve the in-use issue, I need to copy it first. I'm currently copying it directly from the remote machine to the local machine, and parsing it there. That leads to issue 2. Since the files are quite large copying it locally can take quite a while.
To enhance parsing time, I'd like to make the parser multi-threaded, but that makes dealing with multi-lined entries a bit trickier.
The two main issues are:
1) How do i speed up the file transfer (Compression?, Is transferring locally even neccessary?, Can I read an in use file some other way?)
2) How do i deal with multi-line entries when splitting up the lines among threads?
UPDATE: The reason I didnt do the obvious parse on the server reason is that I want to have as little cpu impact as possible. I don't want to affect the performance of the system im testing.

If you are reading a sequential file you want to read it in line by line over the network. You need a transfer method capable of streaming. You'll need to review your IO streaming technology to figure this out.
Large IO operations like this won't benefit much by multithreading since you can probably process the items as fast as you can read them over the network.
Your other great option is to put the log parser on the server, and download the results.

The better option, from the perspective of performance, is going to be to perform your parsing at the remote server. Apart from exceptional circumstances the speed of your network is always going to be the bottleneck, so limiting the amount of data that you send over your network is going to greatly improve performance.
This is one of the reasons that so many databases use stored procedures that are run at the server end.
Improvements in parsing speed (if any) through the use of multithreading are going to be swamped by the comparative speed of your network transfer.
If you're committed to transferring your files before parsing them, an option that you could consider is the use of on-the-fly compression while doing your file transfer.
There are, for example, sftp servers available that will perform compression on the fly.
At the local end you could use something like libcurl to do the client side of the transfer, which also supports on-the-fly decompression.

The easiest way considering you are already copying the file would be to compress it before copying, and decompress once copying is complete. You will get huge gains compressing text files because zip algorithms generally work very well on them. Also your existing parsing logic could be kept intact rather than having to hook it up to a remote network text reader.
The disadvantage of this method is that you won't be able to get line by line updates very efficiently, which are a good thing to have for a log parser.

I guess it depends on how "remote" it is. 100MB on a 100Mb LAN would be about 8 secs...up it to gigabit, and you'd have it in around 1 second. $50 * 2 for the cards, and $100 for a switch would be a very cheap upgrade you could do.
But, assuming it's further away than that, you should be able to open it with just read mode (as you're reading it when you're copying it). SMB/CIFS supports file block reading, so you should be streaming the file at that point (of course, you didn't actually say how you were accessing the file - I'm just assuming SMB).
Multithreading won't help, as you'll be disk or network bound anyway.

Use compression for transfer.
If your parsing is really slowing you down, and you have multiple processors, you can break the parsing job up, you just have to do it in a smart way -- have a deterministic algorithm for which workers are responsible for dealing with incomplete records. Assuming you can determine that a line is part of a middle of a record, for example, you could break the file into N/M segments, each responsible for M lines; when one of the jobs determines that its record is not finished, it just has to read on until it reaches the end of the record. When one of the jobs determines that it's reading a record for which it doesn't have a beginning, it should skip the record.

If you can copy the file, you can read it. So there's no need to copy it in the first place.
EDIT: use the FileStream class to have more control over the access and sharing modes.
new FileStream("logfile", FileMode.Open, FileAccess.Read, FileShare.ReadWrite)
should do the trick.

I've used SharpZipLib to compress large files before transferring them over the Internet. So that's one option.
Another idea for 1) would be to create an assembly that runs on the remote machine and does the parsing there. You could access the assembly from the local machine using .NET remoting. The remote assembly would need to be a Windows service or be hosted in IIS. That would allow you to keep your copies of the log files on the same machine, and in theory it would take less time to process them.

i think using compression (deflate/gzip) would help

The given answer do not satisfy me and maybe my answer will help others to not think it is super complicated or multithreading wouldn't benefit in such a scenario. Maybe it will not make the transfer faster but depending on the complexity of your parsing it may make the parsing/or analysis of the parsed data faster.
It really depends upon the details of your parsing. What kind of information do you need to get from the log files? Are these information like statistics or are they dependent on multiple log message?
You have several options:
parse multiple files at the same would be the easiest I guess, you have the file as context and can create one thread per file
another option as mentioned before is use compression for the network communication
you could also use a helper that splits the log file into lines that belong together as a first step and then with multiple threads process these blocks of lines; the parsing of this depend lines should be quite easy and fast.
Very important in such a scenario is to measure were your actual bottleneck is. If your bottleneck is the network you wont benefit of optimizing the parser too much. If your parser creates a lot of objects of the same kind you could use the ObjectPool pattern and create objects with multiple threads. Try to process the input without allocating too much new strings. Often parsers are written by using a lot of string.Split and so forth, that is not really as fast as it could be. You could navigate the Stream by checking the coming values without reading the complete string and splitting it again but directly fill the objects you will need after parsing is done.
Optimization is almost always possible, the question is how much you get out for how much input and how critical your scenario is.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.