I have problem with processing of more binary files. I have many many folders, in each there is about 200 bin files.
I choose 2 of these directories, then all bin files (their paths) from these 2 directories save to List, and make some filtering with this list. At the end of this, in list is about 200 bin files.
Then I'm iterating over all filtered files, and from each read first 4x8 Bytes (I tried FileStream or BinaryReader). All this operations take about 2-6 seconds, but only for the first time. Next time it's fast enough. If nothing happens with files for a long time (about 30 minutes), the problem appears again.
So probably it's something about caching or what?
Can someone help me please? Thanks
It is very possible that the handles to the files are disposed and that's why after a while the GC removes them and it takes longer or simply that the files are loaded in RAM by the OS and then it serves them to you from there and that's why it is faster, but that is not the issue, the process runs slow because it is slow, it isn't relevant that it is faster the 2nd time because you mustn't rely on that.
What i suggest is to parallellise as much as possible the processing of those files to be able to harness the full power of the hardware at hand.
Start by isolating the code that handles a file and then run the code within a Parallel.ForEach and see if that helps.
One possibility is that your drive is going to sleep (typically a drive will be configured to power down after 15-30 minutes). This can add a significant delay (5 seconds would be a typical figure) as the hard-drive is span back up to speed.
Luckily, this is an easy thing to test. Just set the power-down time to, say, 6 hours, and then test if the behaviour has changed.
Related
I asked this question earlier but it was closed because it wasn't "focused". So I have deleted that question to provide what I hope is a more focused question:
I have a task where I need to look for an image file over a network. The folder this file is in is over a network and this folder can have 1 mil to 2 mil images. Some of these images are 10 megabytes big. I have no control over this folder so I can't structure it. I am just providing the application to the customer to look for image files in this big folder.
I was going to use the C# File.Exist() method to look up the file.
Is the performance of File.Exists affected by the number of files in the directory and/or the size of those files?
The performance of File.Exists() mostly depends on the underlying file system (of the machine at the other end) and of course the network. Any reasonable file system will implement it in such a way that size won't matter.
However the total number of files may affect the performance, because of indexation of large number of entries. But again, a self respecting file system will use some kind of log (or even constant) lookup, so it should be negligible (even for 5mil files and log scale, the FS has to scan at most 23 entries, its nothing). The network will definitely be a bottleneck here.
That being said, YMMV and I encourage you to simply measure it yourself.
In my experience the size of the images will not be a factor, but the number of them will be. Those folders are unreasonably large and are going to be slow for many different I/O operations, including just listing them.
That aside, this is such a simple operation to test you really should just benchmark it yourself. Creating a simple console application that can connect to the network folder and check for known existing files, and known missing files will give you an idea of the time per operation you're looking at. It's not like you have to do a ton of implementation in order to test a single standard library function.
At work I have come across an error where a service I am running times out when waiting for a file to be created. What I am wondering is if there is a way for the to measure using an external application how long files take to be completely created in a particular directory.
The accuracy does not need to be perfect, but rather a general indication on how quickly files are being generated. I was thinking that I could poll a directory for new files, record their size on first discovery and wait until the file size stops growing, but this seems cumbersome and far from an ideal and accurate solution.
Suggestions would be greatly appreciated.
I am really stumped at this problem and as a result I have stopped working for a while. I work with really large pieces of data. I get approx 200gb of .txt data every week. The data can range up to 500 million lines. A lot of these are duplicate. I would guess only 20gb is unique. I have had several custom programs made including hash remove duplicates, external remove duplicates but none seem to work. The latest one was using a temp database but took several days to remove the data.
The problem with all the programs is that they crash after a certain point and after spending a large amount of money on these programs I thought I would come online and see if anyone can help. I understand this has been answered on here before and I have spent the last 3 hours reading about 50 threads on here but none seem to have the same problem as me i.e huge datasets.
Can anyone recommend anything for me? It needs to be super accurate and fast. Preferably not memory based as I only have 32gb of ram to work with.
The standard way to remove duplicates is to sort the file and then do a sequential pass to remove duplicates. Sorting 500 million lines isn't trivial, but it's certainly doable. A few years ago I had a daily process that would sort 50 to 100 gigabytes on a 16 gb machine.
By the way, you might be able to do this with an off-the-shelf program. Certainly the GNU sort utility can sort a file larger than memory. I've never tried it on a 500 GB file, but you might give it a shot. You can download it along with the rest of the GNU Core Utilities. That utility has a --unique option, so you should be able to just sort --unique input-file > output-file. It uses a technique similar to the one I describe below. I'd suggest trying it on a 100 megabyte file first, then slowly working up to larger files.
With GNU sort and the technique I describe below, it will perform a lot better if the input and temporary directories are on separate physical disks. Put the output either on a third physical disk, or on the same physical disk as the input. You want to reduce I/O contention as much as possible.
There might also be a commercial (i.e. pay) program that will do the sorting. Developing a program that will sort a huge text file efficiently is a non-trivial task. If you can buy something for a few hundreds of dollars, you're probably money ahead if your time is worth anything.
If you can't use a ready made program, then . . .
If your text is in multiple smaller files, the problem is easier to solve. You start by sorting each file, removing duplicates from those files, and writing the sorted temporary files that have the duplicates removed. Then run a simple n-way merge to merge the files into a single output file that has the duplicates removed.
If you have a single file, you start by reading as many lines as you can into memory, sorting those, removing duplicates, and writing a temporary file. You keep doing that for the entire large file. When you're done, you have some number of sorted temporary files that you can then merge.
In pseudocode, it looks something like this:
fileNumber = 0
while not end-of-input
load as many lines as you can into a list
sort the list
filename = "file"+fileNumber
write sorted list to filename, optionally removing duplicates
fileNumber = fileNumber + 1
You don't really have to remove the duplicates from the temporary files, but if your unique data is really only 10% of the total, you'll save a huge amount of time by not outputting duplicates to the temporary files.
Once all of your temporary files are written, you need to merge them. From your description, I figure each chunk that you read from the file will contain somewhere around 20 million lines. So you'll have maybe 25 temporary files to work with.
You now need to do a k-way merge. That's done by creating a priority queue. You open each file, read the first line from each file and put it into the queue along with a reference to the file that it came from. Then, you take the smallest item from the queue and write it to the output file. To remove duplicates, you keep track of the previous line that you output, and you don't output the new line if it's identical to the previous one.
Once you've output the line, you read the next line from the file that the one you just output came from, and add that line to the priority queue. You continue this way until you've emptied all of the files.
I published a series of articles some time back about sorting a very large text file. It uses the technique I described above. The only thing it doesn't do is remove duplicates, but that's a simple modification to the methods that output the temporary files and the final output method. Even without optimizations, the program performs quite well. It won't set any speed records, but it should be able to sort and remove duplicates from 500 million lines in less than 12 hours. Probably much less, considering that the second pass is only working with a small percentage of the total data (because you removed duplicates from the temporary files).
One thing you can do to speed the program is operate on smaller chunks and be sorting one chunk in a background thread while you're loading the next chunk into memory. You end up having to deal with more temporary files, but that's really not a problem. The heap operations are slightly slower, but that extra time is more than recaptured by overlapping the input and output with the sorting. You end up getting the I/O essentially for free. At typical hard drive speeds, loading 500 gigabytes will take somewhere in the neighborhood of two and a half to three hours.
Take a look at the article series. It's many different, mostly small, articles that take you through the entire process that I describe, and it presents working code. I'm happy to answer any questions you might have about it.
I am no specialist in such algorithms, but if it is a textual data (or numbers, doesn't matter), you can try to read your big file and write it into several files by first two or three symbols: all lines starting with "aaa" go to aaa.txt, all lines starting with "aab" - to aab.txt, etc. You'll get lots of files within which the data are in the equivalence relation: a duplicate to a word is in the same file as the word itself. Now, just parse each file in the memory and you're done.
Again, not sure that it will work, but i'd try this approach first...
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have a web application, the requirement is we need to load millions of byte array to memory to supply these to one personal sdk method which takes argument as IEnumerable. The problem is to convert such a huge amount of files into bytes arrays(each file to byte[]).There are about 10 million such files. These takes lots of time and memory to load. so how to accomplish this task. Any suggestion would be greatly appreciated.
This is very probably not a good idea.
It's probably best to save your data in files, load a file in memory when you need it, and keep a cache of the n most recently used files. That way you can manage the amount of memory you consume and your server won't be bogged down by what you are doing.
You didn't mention how large the files are, BTW, but file systems are pretty fast now and in combination with that cache, performance will probably be acceptable. I would test this scenario before trying anything funny in-memory.
10 million files of 2 KB each is 20 gigabytes of data. Even if it was in a single file it'd take on the order of three minutes to load at the typical disk transfer speed of 100 megabytes per second. But because you're opening 10 million individual files it's going to take a lot longer.
If those 10 million files are in a single directory it's going to take even longer. NTFS does not perform well when you have that many files in a single directory.
If the files are in a single directory, I'd suggest splitting them up. You're better off having fewer than 10,000 files (and preferably fewer than 1,000) files in a single directory. Create a directory hierarchy to hold the files.
That still leaves you with having to open 10 million individual files. If the data doesn't change often, you should create a single binary file that contains the file names and the associated data. You'd have to recreate that file every time one of the constituent files changes, but you already have to restart your application if one of the files changes.
But all told, I really don't understand why you want to load all this data into memory. If your Web app is going to squirt this down the pipe to some requesting application, the data transfer time will be, at best, the same speed as reading the data from a file. So you're better off having something that reads the data from the file and streams it to the requesting application.
If your application requires that this 20 GB be in memory so that you can send it to the requesting app, then there's probably something seriously wrong with your application design.
One more thing: as I recall, IIS recycles processes from time to time. If your Web app is idle for a long period, then IIS might very well flush it from memory. So the next time somebody makes a request to your application, it will have to reload the data. If you want the data to truly be persistent, you probably want a Windows service that will load the data and keep it in memory. The Web app can query the service for the data when it needs to.
Foreseeable issues:
Performance: Sequential serialization of a large amount of files can be time consuming.
RAM: Payload total size may request critical amounts of memory.
Possible solutions:
Distribute the serialization task. Spawn worker threads, set processor affinity for each in order to evenly distribute workload. Your disk/repository I/O will probably be the bottleneck.
Implement paging. Don't try to load everything in memory. Serialize blocks on demand. As long as your serialization is faster that the required physical network bandwidth, there'll be no 'buffer underrun' situations - in this case, an empty channel waiting for server to answer. That way your process may even start replying faster than if you tried to do full serialization before starting transmitting.
Cache as much as you want, as little as you can. Don't redo costly work.
That said...
...I completely agree with Roy Dictus and the others - seems a very bad model to me.
I'm not really sure what causes this so please forgive me if I couldn't find the information I needed in a search. Here is an example:
Let's say that we have a folder with 1,000,000 files. Running Directory.GetFiles() on that will take a few minutes. However, running it again right after will take only a few seconds. Why does this happen? Are the objects being cached somewhere? How can I run it with the original time?
Hard drives have internal caches that will help speed up subsequent reads. Try reading a bunch of other directory information in a completely different sector to clear the cache.