Extract Data of 2GB XML File in c#

Extract Data of 2GB XML File in c# - c#

I have a 2 GB XML file containing around 2.5 million records. I am not being able to load it in c#. It is throwing out of memory exception. Please help me to resolve it with easy method.

Simple and general methodology when you have these problems:
As written by mjwills and TheGeneral, compile at 64 bits
As written by Prateek use XmlReader. Don't load completely the file in memory. Don't use XDocument/XmlDocument/XmlSerializer.
If the size of the output is proportional to the size of the input (you are making a conversion of formats for example), write the result of your reading one piece at a time. If possible you shouldn't have the whole output in memory at the same time. You read an object (a node) from the source file, you make your elaborations, you write the result in a new file/on a db, you discard the result of the elaboration
If the output instead is a summary of the input (for example you are calculating some statistics on the input), and so the size of the output is sub-proportional to the size of the input then normally it is ok to keep it in memory

Related

ReadInnerXml Out Of Memory

While parsing large data in an xml file (about 1GB) the concept was to write the various elements to seperate temporary files. Each of those files could then be processed later while being able to initially advertise the overall content of the file to the end user. The user may select only a few of the elements for further processing. This is accomplished by calling the reader's ReadInnerXml() method which returns a string which in turn is easilly provided to a SteamWriter. However when an exception results such as an out of memory error I am uncertain as to if (1) I can skip past and read other (hopefully smaller) elements and (2) can I not avoid a string completely and somehow write the xml element to a temporary file without having to consume so much physical memory using the string?

Sorting gigantic binary files with C#

I have a large file of roughly 400 GB of size. Generated daily by an external closed system. It is a binary file with the following format:
byte[8]byte[4]byte[n]
Where n is equal to the int32 value of byte[4].
This file has no delimiters and to read the whole file you would just repeat until EOF. With each "item" represented as byte[8]byte[4]byte[n].
The file looks like
byte[8]byte[4]byte[n]byte[8]byte[4]byte[n]...EOF
byte[8] is a 64-bit number representing a period of time represented by .NET Ticks. I need to sort this file but can't seem to figure out the quickest way to do so.
Presently, I load the Ticks into a struct and the byte[n] start and end positions and read to the end of the file. After this, I sort the List in memory by the Ticks property and then open a BinaryReader and seek to each position in Ticks order, read the byte[n] value, and write to an external file.
At the end of the process I end up with a sorted binary file, but it takes FOREVER. I am using C# .NET and a pretty beefy server, but disk IO seems to be an issue.
Server Specs:
2x 2.6 GHz Intel Xeon (Hex-Core with HT) (24-threads)
32GB RAM
500GB RAID 1+0
2TB RAID 5
I've looked all over the internet and can only find examples where a huge file is 1GB (makes me chuckle).
Does anyone have any advice?

At great way to speed up this kind of file access is to memory-map the entire file into address space and let the OS take care of reading whatever bits from the file it needs to. So do the same thing as you're doing right now, except read from memory instead of using a BinaryReader/seek/read.
You've got lots of main memory, so this should provide pretty good performance (as long as you're using a 64-bit OS).

Use merge sort.
It's online and parallelizes well.
http://en.wikipedia.org/wiki/Merge_sort

If you can learn Erlang or Go, they could be very powerful and scale extremely well, as you have 24 threads. Utilize Async I/O. Merge Sort.
And since you have 32GB of Ram, try to load as much as you can into RAM and sort it there then write back to disk.

I would do this in several passes. On the first pass, I would create a list of ticks, then distribute them evenly into many (hundreds?) buckets. If you know ahead of time that the ticks are evenly distributed, you can skip this initial pass. On a second pass, I would split the records into these few hundred separate files of about same size (these much smaller files represent groups of ticks in the order that you want). Then I would sort each file separately in memory. Then concatenate the files.
It is somewhat similar to the hashsort (I think).

How to estimate memory need by XPathDocument for a specific xml file

Is there any way to estimate the memory requirement for creating an XpathDocument instance based on the file size of the xml?
XpathDocument xdoc = new XpathDocument(xmlfile);
Is there any way to programmatically stop the process of creating the XpathDocument if memory drops to a very low level?
Since it loads the entire xml into memory, it would be nice to know ahead of time if the xml is too big. What I have found is that when I create a new XpathDocument with a big xml file, an outofmemory exception is never fired, but that the process slows to a crawl, only 5 Mb of memory remains a available and the Task Manager reports it is not responding. This happened with a 266 Mb xml file when there was 584 Mb of ram. I was able to load a 150 Mb file with no problems in 18.
After loading the xml, I want to do xpath queries using an XpathNavigator and an XpathNodeIterator. I am using .net 2.0, xp sp3.

In short, no you cannot, except if you always have similar files to gather statictical data before starting the estimations.
Since tag, attribute, prefix and namespace strings are interned, it pretty much depends on the structure of the XML file how efficient the storage can be, and the ratio compared to the file on disk also depends on the encoding used.
In general, .NET stores any string as UTF16 in memory. Therefore, even if there was no significant structural overhead (imagine an XML file with only a single root tag and lots of plain text in it), the memory used would still double for a UTF8 source file (or also ASCII or any other 8-bit encoding) used. So string encoding is the first part in the equation.
The other thing is that a data structure is built in-memory to allow the efficient traversal of the document. Typically, nodes are constructed and linked together with references. Therefore each node uses up a certain amount of memory; since most non-value data are references, the memory used here also depends heavily on the architecture (64-bit uses twice as much memory for a single reference than a 32-bit system). So if you have a very complex document with little data (e.g. a whole bunch of few different tags with little text or attribute values) your memory usage will be much higher than the original document size, and at this will also depend a lot on the architecture your application runs on.
If you have a file with few very long tag and attribute names and maybe heavy default namespace useage, the memory used may also be much lower than the file on disk.
So assuming an arbitrary XML file with an unknown encoding, a reasonable amount of data and complexity it will be very difficult to get a reliable estimation. However, if your XML files are always similar in the points mentionned, you could create some statistics to get a factor which gets the ratio about right for your specific platform.
However, note that looking at "free memory" in the task manager or talking of a "very low memory level" are very vague quantifications. Virtual memory, caches, background applications and services etc. will influence the effective raw memory availability. The .NET Framework can therefore not reliably guess how much memory it should allow to be used to remain performant for a single process, or even before throwing an OutOfMemoryException safely. So if you get one of those exceptions, you are usually way beyond a possible recovery point for your application, and you should not try to catch and handle those exceptions.

You can simply check the file size and back out if it exceeds a certain upper bound.
var xmlFileInfo = new FileInfo(xmlfile);
var isTooBig = xmlFileInfo.Length > maximumSize
This will not be foolproof, because you cannot guess at what the correct maximum size will be.

Yes sure you can do it with FileInfo class.
System.IO.FileInfo foo = new System.IO.FileInfo("<your file path as string>");
long Size = foo.Length;

c# Why is last byte of copied file different?

I am writing a program to read and write a specific binary file format.
I believe I have it 95% working. I am running into a a strange problem.
In the screenshot I am showing a program I wrote that compares two files byte by byte. The very last byte should be 0 but is FFFFFFF.
Using a binary viewer I can see no difference in the files. They appear to be identical.
Also, windows tells me the size of the files is different but the size on disk is the same.
Can someone help me understand what is going on?
The original is on the left and my copy is on the right.

Possible answers:
You forgot to call Stream.close() or Stream.Dispose().
Your code is messing up text and and other kinds of data (e.g. casting a -1 from a Read() method into a char, then writing it.
We need to see your code though...

Size on disk vs Size
First of all you should note that the Size on disk is almost always different from the Size value because the Size on disk value reflects the allocated drive storage but the Size reflects the actual length of the file.
A disk drive splits its space into blocks of the same size. For example, if your drive works with 4KB blocks then even the smallest file containing a single byte will still take up 4KB on the disk, as that is the minimum space it can allocate. Once you write out the 4KB + 1 byte it will then allocate another 4KB block of storage, thus making it 8KB on disk. Hence the Size on disk is always a multiple of 4KB. So the fact the source and destination files have the same Size on disk does not mean the files are the same length. (Different drives have different blocks sizes, it is not always 4KB).
The Size value is the actual defined length of the file data within the disk blocks.
Your Size Issue
As your Size values are different it means that the operating system has saved different lengths of data. Hence you have a fundamental problem with your copying routine and not just an issue with the last byte as you think at the moment. One of your files is 3,434 bytes and the other 2,008 which is a big difference. Your first step must be to work out why you have such a big difference.
If your hex comparing routine is simply looking at the block data then it will think they are the same length as it is comparing disk blocks rather than actual file length.

Larger File Streams using C#

There are some text files(Records) which i need to access using C#.Net. But the matter is those files are larger than 1GB. (minimum size is 1 GB)
what should I need to do?
What are the factors which I need to be concentrate on?
Can some one give me an idea to over come from this situation.
EDIT:
Thanks for the fast responses. yes they are fixed length records. These text files coming from a local company. (There last month transaction records)
Is it possible to access these files like normal text files (using normal file stream).
and
How about the memory management????

Expanding on CasperOne's answer
Simply put there is no way to reliably put a 100GB file into memory at one time. On a 32 bit machine there is simply not enough addressing space. In a 64 bit machine there is enough addressing space but during the time in which it would take to actually get the file in memory, your user will have killed your process out of frustration.
The trick is to process the file incrementally. The base System.IO.Stream() class is designed to process a variable (and possibly infinite) stream in distinct quantities. It has several Read methods that will only progress down a stream a specific number of bytes. You will need to use these methods in order to divide up the stream.
I can't give more information because your scenario is not specific enough. Can you give us more details or your record delimeters or some sample lines from the file?
Update
If they are fixed length records then System.IO.Stream will work just fine. You can even use File.Open() to get access to the underlying Stream object. Stream.Read has an overload that requests the number of bytes to be read from the file. Since they are fixed length records this should work well for your scenario.
As long as you don't call ReadAllText() and instead use the Stream.Read() methods which take explicit byte arrays, memory won't be an issue. The underlying Stream class will take care not to put the entire file into memory (that is of course, unless you ask it to :) ).

You aren't specifically listing the problems you need to overcome. A file can be 100GB and you can have no problems processing it.
If you have to process the file as a whole then that is going to require some creative coding, but if you can simply process sections of the file at a time, then it is relatively easy to move to the location in the file you need to start from, process the data you need to process in chunks, and then close the file.
More information here would certainly be helpful.

What are the main problems you are having at the moment? The big thing to remember is to think in terms of streams - i.e. keep the minimum amount of data in memory that you can. LINQ is excellent at working with sequences (although there are some buffering operations you need to avoid, such as OrderBy).
For example, here's a way of handling simple records from a large file efficiently (note the iterator block).
For performing multiple aggregates/analysis over large data from files, consider Push LINQ in MiscUtil.
Can you add more context to the problems you are thinking of?

Expanding on JaredPar's answer.
If the file is a binary file (i.e. ints stored as 4 bytes, fixed length strings etc) you can use the BinaryReader class. Easier than pulling out n bytes and then trying to interrogate that.
Also note, the read method on System.IO.Stream is a non blocking operation. If you ask for 100 bytes it may return less than that, but still not have reached end of file.
The BinaryReader.ReadBytes method will block until it reads the requested number of bytes, or End of file - which ever comes first.
Nice collaboration lads :)

Hey Guys, I realize that this post hasn't been touched in a while, but I just wanted to post a site that has the solution to your problem.
http://thedeveloperpage.wordpress.com/c-articles/using-file-streams-to-write-any-size-file-introduction/
Hope it helps!
-CJ

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.