Windows File system API to query large files

Windows File system API to query large files - c#

I have HDD (say 1TB) with FAT32 and NTFS partitions and I dont have information on which all files are stored on it, but when needed I want to quickly access large files say over 500 MB. I dont want to scan my whole HDD since it is very time consuming. I need quick results. I was wondering if there are any NTFS/FAT32 APIs that I can directly call - i mean if they have some metadata about the files that are stored then it will be quicker. I want to write my program in C++ and C#.
EDIT
If scanning the HDD is the only option then what all can I do to ensure best performance. Like - I could skip scanning system folders, since I am only interested in user data.

If you're willing to do a lot of extra work yourself to speed things up, you might be able to accomplish something. A lot is going to depend on what you need.
Let's start with FAT32. FAT (in general, not just the 32-bit variant) is named for the File Allocation Table. This is a block of data toward the beginning of the partition that tells which clusters in the partition belong to which files. The FAT is basically organized as linked lists of clusters. If you just want to find the data areas for the large files, you can read the FAT in as a number of raw sectors, and scan through that data to find linked lists of more than X clusters (where X defines the lower limit for what you consider a large file). You can then access those clusters and see the actual data associated with each file. Oddly, what you won't know is the name of that file. The file names are contained in directories, which are basically like files, except that what they contain are a fixed-size records of a specified format. You have to start from the root directory, and read through the directory tree to find file names.
NTFS is both simpler and more complex. NTFS has a Master File Table (MFT) to contains records for all the files in a partition. The good point is that you can read the MFT and get information about every file on the disk without chasing through the directory tree to get it. The bad point is that decoding the contents of an NTFS partition is definitely non-trivial. Reading data (meaningfully) is quite difficult -- and writing data much more difficult. Also, recent versions of Windows have added more restrictions on raw reading from disk partitions, so depending on what partition you're after, you may not be able to access the data you need at all.
None of this, however, is anything that's more than minimally supported. To do it, you open a file named "\.\D:" (where D=letter of the disk you care about). You can then read raw sectors from that disk drive (assuming that opening it worked). This will let you see the raw data for the entire disk (or partition, as the case may be) starting from the boot sector, and going through everything else that's there (FAT, root directory, subdirectories, etc. -- all as sectors of raw data). The system will let you read the raw data, but all the work to make any sense of that data is 100% your responsibility. If the speed you've asked about is an absolute necessity, this may be a possibility -- but it'll take a fair amount of work for FAT volumes, and considerably more than that for NTFS. Unless you really need extra speed like you've said, it's probably not even worth considering trying to do this.

If you're willing to target Vista and beyond, you can use the search indexer APIs.
If you look here you can find information about the search indexer. The search indexer does index the file size so it may do what you want.

Not possible. Neither filesystem keeps a list of big files that you could query directly. You'd have to recursively look at every folder and check the size of every file to find whatever you consider big.

Your only prayer is to latch onto a file indexer, otherwise you will have to iterate through all files. Depending on your computer you might be able to latch onto the native Microsoft indexer (searchindexer.exe) or if you have Google Desktop search you may be able to latch onto that.
Possible way to latch onto Microsoft's indexer

Related

Is FIle.Exists() suitable for big directories?

I asked this question earlier but it was closed because it wasn't "focused". So I have deleted that question to provide what I hope is a more focused question:
I have a task where I need to look for an image file over a network. The folder this file is in is over a network and this folder can have 1 mil to 2 mil images. Some of these images are 10 megabytes big. I have no control over this folder so I can't structure it. I am just providing the application to the customer to look for image files in this big folder.
I was going to use the C# File.Exist() method to look up the file.
Is the performance of File.Exists affected by the number of files in the directory and/or the size of those files?

The performance of File.Exists() mostly depends on the underlying file system (of the machine at the other end) and of course the network. Any reasonable file system will implement it in such a way that size won't matter.
However the total number of files may affect the performance, because of indexation of large number of entries. But again, a self respecting file system will use some kind of log (or even constant) lookup, so it should be negligible (even for 5mil files and log scale, the FS has to scan at most 23 entries, its nothing). The network will definitely be a bottleneck here.
That being said, YMMV and I encourage you to simply measure it yourself.

In my experience the size of the images will not be a factor, but the number of them will be. Those folders are unreasonably large and are going to be slow for many different I/O operations, including just listing them.
That aside, this is such a simple operation to test you really should just benchmark it yourself. Creating a simple console application that can connect to the network folder and check for known existing files, and known missing files will give you an idea of the time per operation you're looking at. It's not like you have to do a ton of implementation in order to test a single standard library function.

Concatenate large files using Win NT kernel API

I've been looking around for a way to concatenate large files (a few gigabytes) together without having to rewrite one of the files. I am sure the OS does this internally when manipulating the master file table. This is purely for an internal application where speed is critical even at the cost of data integrity (in case of risking undocumented APIs). The app processes a large amount of high-bandwidth, multi-channel ethernet data where a corrupt unit of work (file in this case) will not have a large impact on overall processing results.
At the moment when combining files A and B, the effort involved is equal to: A[Read] + B[Read] +C[Write]`. Would any of you NT gurus shed some light on how to work around this to get to the MFT directly?
I have not been able to gain any clues as to which API to explore and would appreciate some pointers. Although the app in managed, I would gladly explore native APIs and even setup light-weight VMs for testing.
Thanks in advance.

If you are appending File B to File A, all you have to do is open File A for write+append , seek to end of file, then read from B and write to A.
If you want to create File C as the concatenation of File A and File B, then you are going to have to create File C and copy A to C, then B to C.
There aren't any shortcuts.

That's not really something a file system would do. File systems allocate space for files in terms of clusters and blocks of data, not in terms of bytes. Concatenating two files like this would only work if they were both multiples of the cluster size, and the FS might have other assumptions about how blocks are allocated to files under the covers. You might be able to do this yourself to the file system if you dismounted it and wrote a tool to directly manipulate all the file system structures. But you're risking corrupting the whole disk if you do that, not just a single file.

I don't know your exact situation but would it be possible to not append the files together at all? Just keep throwing files into some directory as you receive data, and keep an index
Then as the data is needed use the index to piece together the data to create one new file?
So you only ever do the expensive file merging on demand?

When does it become worthwhile to spend the execution time to zip files?

We are using the #ziplib (found here) in an application that synchronizes files from a server for an occasionally connected client application.
My question is, with this algorithm, when does it become worthwhile to spend the execution time to do the actual zipping of files? Presumably, if only one small text file is being synchronized, the time to zip would not sufficiently reduce the size of the transfer and would actually slow down the entire process.
Since the zip time profile is going to change based on the number of files, the types of files and the size of those files, is there a good way to discover programmatically when I should zip the files and when I should just pass them as is? In our application, files will almost always be photos though the type of photo and size may well change.
I havent written the actual file transfer logic yet, but expect to use System.Net.WebClient to do this, but am open to alternatives to save on execution time as well.
UPDATE: As this discussion develops, is "to zip, or not to zip" the wrong question? Should the focus be on replacing the older System.Net.WebClient method with compressed WCF traffic or something similar? The database synchronization portion of this utility already uses Microsoft Synchronization Framework and WCF, so I am certainly open to that. Anything we can do now to limit network traffic is going to be huge for our clients.

To determine whether it's useful to compress a file, you have to read the file anyway. When on it, you might as well zip it then.
If you want to prevent useless zipping without reading the files, you could try to decide it on beforehand, based on other properties.
You could create an 'algorithm' that decides whether it's useful, for example based on file extention and size. So, a .txt file of more than 1 KB can be zipped, but a .jpg file shouldn't, regardless of the file size. But it's a lot of work to create such a list (you could also create a black- or whitelist and allow c.q. deny all files not on the list).

You probably have plenty of CPU time, so the only issue is: does it shrink?
If you can decrease the file you will save on (Disk and Network) I/O. That becomes profitable very quickly.
Alas, photos (jpeg) are already compressed so you probably won't see much gain.

You can write your own pretty simple heuristic analysis and then reuse it whilst each next file processing. Collected statistics should be saved to keep efficiency between restarts.
Basically interface:
enum FileContentType
{
PlainText,
OfficeDoc,
OffixeXlsx
}
// Name is ugly so find out better
public interface IHeuristicZipAnalyzer
{
bool IsWorthToZip(int fileSizeInBytes, FileContentType contentType);
void AddInfo(FileContentType, fileSizeInBytes, int finalZipSize);
}
Then you can collect statistic by adding information regarding just zipped file using AddInfo(...) and based on it can determine whether it worth to zip a next file by calling IsWorthToZip(...)

How does disk de-fragmenting work?

I'd like to have a go at writing something which shows the state of a hard drive in terms of how fragmented it is. Maybe even has a go at de-fragmenting it.
But I've realised that I don't fully understand how this works.
Can anyone explain this to me and perhaps offer some suggestions of where I might start?
I mainly use C# - would this be a suitable language to have a go at putting something together.
Thanks in advance

Please begin with the Wikipedia Article on Disk Fragmentation
Then after that, it depends on how low-level you want to go.
First for the official howto see Defragmenting Files on MSDN.
From the article....
Use the FSCTL_GET_VOLUME_BITMAP control code to find a place on the volume that is large enough to accept an entire file.
Note If necessary, move other files to make a place that is large enough. Ideally, there is enough unallocated clusters after the first extent of the file that you can move subsequent extents into the space after the first extent.
Use the FSCTL_GET_RETRIEVAL_POINTERS control code to get a map of the current layout of the file on the disk.
Walk the RETRIEVAL_POINTERS_BUFFER structure returned by FSCTL_GET_RETRIEVAL_POINTERS.
Use the FSCTL_MOVE_FILE control code to move each cluster as you walk the structure.
Note You may need to renew either the bitmap or the retrieval structure, or both at various times as other processes write to the disk.
For a C# wrapper of the above, check out this blog post.
Finally, depending on your situation, you can use the WMI Defrag method on the Win32_Volume class.
Hope this helps.

To show the fragmentation state of a filesystem, you would have to find out which blocks of the disk belong to which files. All files that do not solely consist of consecutive blocks are fragmented; they contain holes and/or the blocks are scattered over the disk.
To defragment a filesystem you would have to move around the blocks so that all files are consecutive and rewrite the metadata to have the filesystem in a consistent state in the end.

When files are saved the bytes they use is put into allocated blocks, if the file grows and the next consecutive block is not available, the OS starts writing to the next available block, splitting the file into 2 fragments.
Defragmentation collects files into consecutive blocks by moving blocks out of the way (into free space) so that the file being defragmented can have consecutive blocks. for non Solid State hard drives this affects performance (as there is no seek time reading consecutive blocks)
Some defragmenters move more commonly read files to the outside of the disk (since it spins faster the further away from the spindle it is).

What's the best way to read and parse a large text file over the network?

I have a problem which requires me to parse several log files from a remote machine.
There are a few complications:
1) The file may be in use
2) The files can be quite large (100mb+)
3) Each entry may be multi-line
To solve the in-use issue, I need to copy it first. I'm currently copying it directly from the remote machine to the local machine, and parsing it there. That leads to issue 2. Since the files are quite large copying it locally can take quite a while.
To enhance parsing time, I'd like to make the parser multi-threaded, but that makes dealing with multi-lined entries a bit trickier.
The two main issues are:
1) How do i speed up the file transfer (Compression?, Is transferring locally even neccessary?, Can I read an in use file some other way?)
2) How do i deal with multi-line entries when splitting up the lines among threads?
UPDATE: The reason I didnt do the obvious parse on the server reason is that I want to have as little cpu impact as possible. I don't want to affect the performance of the system im testing.

If you are reading a sequential file you want to read it in line by line over the network. You need a transfer method capable of streaming. You'll need to review your IO streaming technology to figure this out.
Large IO operations like this won't benefit much by multithreading since you can probably process the items as fast as you can read them over the network.
Your other great option is to put the log parser on the server, and download the results.

The better option, from the perspective of performance, is going to be to perform your parsing at the remote server. Apart from exceptional circumstances the speed of your network is always going to be the bottleneck, so limiting the amount of data that you send over your network is going to greatly improve performance.
This is one of the reasons that so many databases use stored procedures that are run at the server end.
Improvements in parsing speed (if any) through the use of multithreading are going to be swamped by the comparative speed of your network transfer.
If you're committed to transferring your files before parsing them, an option that you could consider is the use of on-the-fly compression while doing your file transfer.
There are, for example, sftp servers available that will perform compression on the fly.
At the local end you could use something like libcurl to do the client side of the transfer, which also supports on-the-fly decompression.

The easiest way considering you are already copying the file would be to compress it before copying, and decompress once copying is complete. You will get huge gains compressing text files because zip algorithms generally work very well on them. Also your existing parsing logic could be kept intact rather than having to hook it up to a remote network text reader.
The disadvantage of this method is that you won't be able to get line by line updates very efficiently, which are a good thing to have for a log parser.

I guess it depends on how "remote" it is. 100MB on a 100Mb LAN would be about 8 secs...up it to gigabit, and you'd have it in around 1 second. $50 * 2 for the cards, and $100 for a switch would be a very cheap upgrade you could do.
But, assuming it's further away than that, you should be able to open it with just read mode (as you're reading it when you're copying it). SMB/CIFS supports file block reading, so you should be streaming the file at that point (of course, you didn't actually say how you were accessing the file - I'm just assuming SMB).
Multithreading won't help, as you'll be disk or network bound anyway.

Use compression for transfer.
If your parsing is really slowing you down, and you have multiple processors, you can break the parsing job up, you just have to do it in a smart way -- have a deterministic algorithm for which workers are responsible for dealing with incomplete records. Assuming you can determine that a line is part of a middle of a record, for example, you could break the file into N/M segments, each responsible for M lines; when one of the jobs determines that its record is not finished, it just has to read on until it reaches the end of the record. When one of the jobs determines that it's reading a record for which it doesn't have a beginning, it should skip the record.

If you can copy the file, you can read it. So there's no need to copy it in the first place.
EDIT: use the FileStream class to have more control over the access and sharing modes.
new FileStream("logfile", FileMode.Open, FileAccess.Read, FileShare.ReadWrite)
should do the trick.

I've used SharpZipLib to compress large files before transferring them over the Internet. So that's one option.
Another idea for 1) would be to create an assembly that runs on the remote machine and does the parsing there. You could access the assembly from the local machine using .NET remoting. The remote assembly would need to be a Windows service or be hosted in IIS. That would allow you to keep your copies of the log files on the same machine, and in theory it would take less time to process them.

i think using compression (deflate/gzip) would help

The given answer do not satisfy me and maybe my answer will help others to not think it is super complicated or multithreading wouldn't benefit in such a scenario. Maybe it will not make the transfer faster but depending on the complexity of your parsing it may make the parsing/or analysis of the parsed data faster.
It really depends upon the details of your parsing. What kind of information do you need to get from the log files? Are these information like statistics or are they dependent on multiple log message?
You have several options:
parse multiple files at the same would be the easiest I guess, you have the file as context and can create one thread per file
another option as mentioned before is use compression for the network communication
you could also use a helper that splits the log file into lines that belong together as a first step and then with multiple threads process these blocks of lines; the parsing of this depend lines should be quite easy and fast.
Very important in such a scenario is to measure were your actual bottleneck is. If your bottleneck is the network you wont benefit of optimizing the parser too much. If your parser creates a lot of objects of the same kind you could use the ObjectPool pattern and create objects with multiple threads. Try to process the input without allocating too much new strings. Often parsers are written by using a lot of string.Split and so forth, that is not really as fast as it could be. You could navigate the Stream by checking the coming values without reading the complete string and splitting it again but directly fill the objects you will need after parsing is done.
Optimization is almost always possible, the question is how much you get out for how much input and how critical your scenario is.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.