Remove All Duplicates In A Large Text File

Remove All Duplicates In A Large Text File - c#

I am really stumped at this problem and as a result I have stopped working for a while. I work with really large pieces of data. I get approx 200gb of .txt data every week. The data can range up to 500 million lines. A lot of these are duplicate. I would guess only 20gb is unique. I have had several custom programs made including hash remove duplicates, external remove duplicates but none seem to work. The latest one was using a temp database but took several days to remove the data.
The problem with all the programs is that they crash after a certain point and after spending a large amount of money on these programs I thought I would come online and see if anyone can help. I understand this has been answered on here before and I have spent the last 3 hours reading about 50 threads on here but none seem to have the same problem as me i.e huge datasets.
Can anyone recommend anything for me? It needs to be super accurate and fast. Preferably not memory based as I only have 32gb of ram to work with.

The standard way to remove duplicates is to sort the file and then do a sequential pass to remove duplicates. Sorting 500 million lines isn't trivial, but it's certainly doable. A few years ago I had a daily process that would sort 50 to 100 gigabytes on a 16 gb machine.
By the way, you might be able to do this with an off-the-shelf program. Certainly the GNU sort utility can sort a file larger than memory. I've never tried it on a 500 GB file, but you might give it a shot. You can download it along with the rest of the GNU Core Utilities. That utility has a --unique option, so you should be able to just sort --unique input-file > output-file. It uses a technique similar to the one I describe below. I'd suggest trying it on a 100 megabyte file first, then slowly working up to larger files.
With GNU sort and the technique I describe below, it will perform a lot better if the input and temporary directories are on separate physical disks. Put the output either on a third physical disk, or on the same physical disk as the input. You want to reduce I/O contention as much as possible.
There might also be a commercial (i.e. pay) program that will do the sorting. Developing a program that will sort a huge text file efficiently is a non-trivial task. If you can buy something for a few hundreds of dollars, you're probably money ahead if your time is worth anything.
If you can't use a ready made program, then . . .
If your text is in multiple smaller files, the problem is easier to solve. You start by sorting each file, removing duplicates from those files, and writing the sorted temporary files that have the duplicates removed. Then run a simple n-way merge to merge the files into a single output file that has the duplicates removed.
If you have a single file, you start by reading as many lines as you can into memory, sorting those, removing duplicates, and writing a temporary file. You keep doing that for the entire large file. When you're done, you have some number of sorted temporary files that you can then merge.
In pseudocode, it looks something like this:
fileNumber = 0
while not end-of-input
load as many lines as you can into a list
sort the list
filename = "file"+fileNumber
write sorted list to filename, optionally removing duplicates
fileNumber = fileNumber + 1
You don't really have to remove the duplicates from the temporary files, but if your unique data is really only 10% of the total, you'll save a huge amount of time by not outputting duplicates to the temporary files.
Once all of your temporary files are written, you need to merge them. From your description, I figure each chunk that you read from the file will contain somewhere around 20 million lines. So you'll have maybe 25 temporary files to work with.
You now need to do a k-way merge. That's done by creating a priority queue. You open each file, read the first line from each file and put it into the queue along with a reference to the file that it came from. Then, you take the smallest item from the queue and write it to the output file. To remove duplicates, you keep track of the previous line that you output, and you don't output the new line if it's identical to the previous one.
Once you've output the line, you read the next line from the file that the one you just output came from, and add that line to the priority queue. You continue this way until you've emptied all of the files.
I published a series of articles some time back about sorting a very large text file. It uses the technique I described above. The only thing it doesn't do is remove duplicates, but that's a simple modification to the methods that output the temporary files and the final output method. Even without optimizations, the program performs quite well. It won't set any speed records, but it should be able to sort and remove duplicates from 500 million lines in less than 12 hours. Probably much less, considering that the second pass is only working with a small percentage of the total data (because you removed duplicates from the temporary files).
One thing you can do to speed the program is operate on smaller chunks and be sorting one chunk in a background thread while you're loading the next chunk into memory. You end up having to deal with more temporary files, but that's really not a problem. The heap operations are slightly slower, but that extra time is more than recaptured by overlapping the input and output with the sorting. You end up getting the I/O essentially for free. At typical hard drive speeds, loading 500 gigabytes will take somewhere in the neighborhood of two and a half to three hours.
Take a look at the article series. It's many different, mostly small, articles that take you through the entire process that I describe, and it presents working code. I'm happy to answer any questions you might have about it.

I am no specialist in such algorithms, but if it is a textual data (or numbers, doesn't matter), you can try to read your big file and write it into several files by first two or three symbols: all lines starting with "aaa" go to aaa.txt, all lines starting with "aab" - to aab.txt, etc. You'll get lots of files within which the data are in the equivalence relation: a duplicate to a word is in the same file as the word itself. Now, just parse each file in the memory and you're done.
Again, not sure that it will work, but i'd try this approach first...

Related

Slow reading hundreds of files

I have problem with processing of more binary files. I have many many folders, in each there is about 200 bin files.
I choose 2 of these directories, then all bin files (their paths) from these 2 directories save to List, and make some filtering with this list. At the end of this, in list is about 200 bin files.
Then I'm iterating over all filtered files, and from each read first 4x8 Bytes (I tried FileStream or BinaryReader). All this operations take about 2-6 seconds, but only for the first time. Next time it's fast enough. If nothing happens with files for a long time (about 30 minutes), the problem appears again.
So probably it's something about caching or what?
Can someone help me please? Thanks

It is very possible that the handles to the files are disposed and that's why after a while the GC removes them and it takes longer or simply that the files are loaded in RAM by the OS and then it serves them to you from there and that's why it is faster, but that is not the issue, the process runs slow because it is slow, it isn't relevant that it is faster the 2nd time because you mustn't rely on that.
What i suggest is to parallellise as much as possible the processing of those files to be able to harness the full power of the hardware at hand.
Start by isolating the code that handles a file and then run the code within a Parallel.ForEach and see if that helps.

One possibility is that your drive is going to sleep (typically a drive will be configured to power down after 15-30 minutes). This can add a significant delay (5 seconds would be a typical figure) as the hard-drive is span back up to speed.
Luckily, this is an easy thing to test. Just set the power-down time to, say, 6 hours, and then test if the behaviour has changed.

Sorting gigantic binary files with C#

I have a large file of roughly 400 GB of size. Generated daily by an external closed system. It is a binary file with the following format:
byte[8]byte[4]byte[n]
Where n is equal to the int32 value of byte[4].
This file has no delimiters and to read the whole file you would just repeat until EOF. With each "item" represented as byte[8]byte[4]byte[n].
The file looks like
byte[8]byte[4]byte[n]byte[8]byte[4]byte[n]...EOF
byte[8] is a 64-bit number representing a period of time represented by .NET Ticks. I need to sort this file but can't seem to figure out the quickest way to do so.
Presently, I load the Ticks into a struct and the byte[n] start and end positions and read to the end of the file. After this, I sort the List in memory by the Ticks property and then open a BinaryReader and seek to each position in Ticks order, read the byte[n] value, and write to an external file.
At the end of the process I end up with a sorted binary file, but it takes FOREVER. I am using C# .NET and a pretty beefy server, but disk IO seems to be an issue.
Server Specs:
2x 2.6 GHz Intel Xeon (Hex-Core with HT) (24-threads)
32GB RAM
500GB RAID 1+0
2TB RAID 5
I've looked all over the internet and can only find examples where a huge file is 1GB (makes me chuckle).
Does anyone have any advice?

At great way to speed up this kind of file access is to memory-map the entire file into address space and let the OS take care of reading whatever bits from the file it needs to. So do the same thing as you're doing right now, except read from memory instead of using a BinaryReader/seek/read.
You've got lots of main memory, so this should provide pretty good performance (as long as you're using a 64-bit OS).

Use merge sort.
It's online and parallelizes well.
http://en.wikipedia.org/wiki/Merge_sort

If you can learn Erlang or Go, they could be very powerful and scale extremely well, as you have 24 threads. Utilize Async I/O. Merge Sort.
And since you have 32GB of Ram, try to load as much as you can into RAM and sort it there then write back to disk.

I would do this in several passes. On the first pass, I would create a list of ticks, then distribute them evenly into many (hundreds?) buckets. If you know ahead of time that the ticks are evenly distributed, you can skip this initial pass. On a second pass, I would split the records into these few hundred separate files of about same size (these much smaller files represent groups of ticks in the order that you want). Then I would sort each file separately in memory. Then concatenate the files.
It is somewhat similar to the hashsort (I think).

Multiple iterations through a file structure (C#)

I am writing a program which iterates through the file system multiple times using simple loops and recursion.
The problem is that, because I am iterating through multiple times, it is taking a long time because (I guess) the hard drive can only work at a certain pace.
Is there any way to optimize this process? Maybe by iterating though once, saving all the relevant information in a collection and then referring to the collection when I need to?
I know I can cache my results like this but I have absolutely no idea how to go about it.
Edit:
There are three main pieces of information I am trying to obtain from a given directory:
The size of the directory (the sum of the size of each file within that directory)
The number of files within the directory
The number of folders within the directory
All of the above includes sub-directories too. Currently, I am performing an iteration of a given directory to obtain each piece of information, i.e. three iterations per directory.
My output is basically a spreadsheet which looks like this:

To improve performance, you could access the Master File Table (MFT) of the NTFS file system directly. There is a excellent code sample on MSDN social forum.
It seems that accessing the MFT is about 10x faster than enumerating the file system using FindFirst/FindNext file.
Hope, this helps.

Yes anything you can do to minimize hard drive I/O will improve the performance. I would also suggest putting in a Stopwatch and measure the time it takes so you can get a sense of how your improvements are affecting the speed.

Preventing Memory issues when handling large amounts of text

I have written a program which analyzes a project's source code and reports various issues and metrics based on the code.
To analyze the source code, I load the code files that exist in the project's directory structure and analyze the code from memory. The code goes through extensive processing before it is passed to other methods to be analyzed further.
The code is passed around to several classes when it is processed.
The other day I was running it on one of the larger project my group has, and my program crapped out on me because there was too much source code loaded into memory. This is a corner case at this point, but I want to be able to handle this issue in the future.
What would be the best way to avoid memory issues?
I'm thinking about loading the code, do the initial processing of the file, then serialize the results to disk, so that when I need to access them again, I do not have to go through the process of manipulating the raw code again. Does this make sense? Or is the serialization/deserialization more expensive then processing the code again?
I want to keep a reasonable level of performance while addressing this problem. Most of the time, the source code will fit into memory without issue, so is there a way to only "page" my information when I am low on memory? Is there a way to tell when my application is running low on memory?
Update:
The problem is not that a single file fills memory, its all of the files in memory at once fill memory. My current idea is to rotate off the disk drive when I process them

1.6GB is still manageable and by itself should not cause memory problems. Inefficient string operations might do it.
As you parse the source code your probably split it apart into certain substrings - tokens or whatver you call them. If your tokens combined account for entire source code, that doubles memory consumption right there. Depending on the complexity of the processing you do the mutiplier can be even bigger.
My first move here would be to have a closer look on how you use your strings and find a way to optimize it - i.e. discarding the origianl after the first pass, compress the whitespaces, or use indexes (pointers) to the original strings rather than actual substrings - there is a number of techniques which can be useful here.
If none of this would help than I would resort to swapping them to and fro the disk

If the problem is that a single copy of your code causing you to fill the memory available then there are atleast two options.
serialize to disk
compress files in memory. If you have a lot of CPU it can be faster to zip and unzip information in memory, instead of caching to disk.
You should also check if you are disposing of objects properly. Do you have memory problems due to old copies of objects being in memory?

Use WinDbg with SOS to see what is holding on the string references (or what ever is causing the extreme memory usage).

Serializing/deserializing sounds like a good strategy. I've done a fair amount of this and it is very fast. In fact I have an app that instantiates objects from a DB and then serializes them to the hard drives of my web nodes. It has been a while since I benchmarked it, but it was serializing several hundred a second and maybe over 1k back when I was load testing.
Of course it will depend on the size of your code files. My files were fairly small.

Larger File Streams using C#

There are some text files(Records) which i need to access using C#.Net. But the matter is those files are larger than 1GB. (minimum size is 1 GB)
what should I need to do?
What are the factors which I need to be concentrate on?
Can some one give me an idea to over come from this situation.
EDIT:
Thanks for the fast responses. yes they are fixed length records. These text files coming from a local company. (There last month transaction records)
Is it possible to access these files like normal text files (using normal file stream).
and
How about the memory management????

Expanding on CasperOne's answer
Simply put there is no way to reliably put a 100GB file into memory at one time. On a 32 bit machine there is simply not enough addressing space. In a 64 bit machine there is enough addressing space but during the time in which it would take to actually get the file in memory, your user will have killed your process out of frustration.
The trick is to process the file incrementally. The base System.IO.Stream() class is designed to process a variable (and possibly infinite) stream in distinct quantities. It has several Read methods that will only progress down a stream a specific number of bytes. You will need to use these methods in order to divide up the stream.
I can't give more information because your scenario is not specific enough. Can you give us more details or your record delimeters or some sample lines from the file?
Update
If they are fixed length records then System.IO.Stream will work just fine. You can even use File.Open() to get access to the underlying Stream object. Stream.Read has an overload that requests the number of bytes to be read from the file. Since they are fixed length records this should work well for your scenario.
As long as you don't call ReadAllText() and instead use the Stream.Read() methods which take explicit byte arrays, memory won't be an issue. The underlying Stream class will take care not to put the entire file into memory (that is of course, unless you ask it to :) ).

You aren't specifically listing the problems you need to overcome. A file can be 100GB and you can have no problems processing it.
If you have to process the file as a whole then that is going to require some creative coding, but if you can simply process sections of the file at a time, then it is relatively easy to move to the location in the file you need to start from, process the data you need to process in chunks, and then close the file.
More information here would certainly be helpful.

What are the main problems you are having at the moment? The big thing to remember is to think in terms of streams - i.e. keep the minimum amount of data in memory that you can. LINQ is excellent at working with sequences (although there are some buffering operations you need to avoid, such as OrderBy).
For example, here's a way of handling simple records from a large file efficiently (note the iterator block).
For performing multiple aggregates/analysis over large data from files, consider Push LINQ in MiscUtil.
Can you add more context to the problems you are thinking of?

Expanding on JaredPar's answer.
If the file is a binary file (i.e. ints stored as 4 bytes, fixed length strings etc) you can use the BinaryReader class. Easier than pulling out n bytes and then trying to interrogate that.
Also note, the read method on System.IO.Stream is a non blocking operation. If you ask for 100 bytes it may return less than that, but still not have reached end of file.
The BinaryReader.ReadBytes method will block until it reads the requested number of bytes, or End of file - which ever comes first.
Nice collaboration lads :)

Hey Guys, I realize that this post hasn't been touched in a while, but I just wanted to post a site that has the solution to your problem.
http://thedeveloperpage.wordpress.com/c-articles/using-file-streams-to-write-any-size-file-introduction/
Hope it helps!
-CJ

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.