While parsing large data in an xml file (about 1GB) the concept was to write the various elements to seperate temporary files. Each of those files could then be processed later while being able to initially advertise the overall content of the file to the end user. The user may select only a few of the elements for further processing. This is accomplished by calling the reader's ReadInnerXml() method which returns a string which in turn is easilly provided to a SteamWriter. However when an exception results such as an out of memory error I am uncertain as to if (1) I can skip past and read other (hopefully smaller) elements and (2) can I not avoid a string completely and somehow write the xml element to a temporary file without having to consume so much physical memory using the string?
Related
I have a 2 GB XML file containing around 2.5 million records. I am not being able to load it in c#. It is throwing out of memory exception. Please help me to resolve it with easy method.
Simple and general methodology when you have these problems:
As written by mjwills and TheGeneral, compile at 64 bits
As written by Prateek use XmlReader. Don't load completely the file in memory. Don't use XDocument/XmlDocument/XmlSerializer.
If the size of the output is proportional to the size of the input (you are making a conversion of formats for example), write the result of your reading one piece at a time. If possible you shouldn't have the whole output in memory at the same time. You read an object (a node) from the source file, you make your elaborations, you write the result in a new file/on a db, you discard the result of the elaboration
If the output instead is a summary of the input (for example you are calculating some statistics on the input), and so the size of the output is sub-proportional to the size of the input then normally it is ok to keep it in memory
I have two file of about 50GB each: an input and an output file.
I am using Memory Mapped File to manage these two files.
The input file contains 3 millions of Web pages, and after I have decided a permutation π of them, I have to write into the output file the Web pages in the new order.
So, I can choose to read sequentially the input file and write the web pages in different location of the output file, accordingly to the permutation π.
Or I can do the opposite: reading randomly the input file according to the permutation π and write sequentially into the output file.
Which option is faster? Why?
TL;DR: Due to caching, all file-append operations are sequential. Even writes to the middle of files will be elevator sorted and performed at block size, etc.
Random writing tends to be faster than random reading for several reasons:
When a file grows, the filesystem can choose where to put the new block.
Writes don't have to be performed immediately, the write buffer can assure that an entire block is written at once, meaning that data won't be added to an existing block, which already has a location.
Your processing can't take place until reads complete. And reading relies on a predictive cache. The OS is good at pre-caching sequential reads, horrible for random reads. If your reads are less than block sized, things are even worse -- the actual amount of data read from the disk will be greater than the size of the file.
I have hundreds of thousands of small text files between 0 and 8kb each on a LAN network share. I can use some interop calls with kernel32.dll and FindFileEx to recursively pull a list of the fully qualified UNC path of each file and store the paths in memory in a collection class such as List<string>. Using this approach I was able to populate the List<string> fairly quickly (about 30seconds per 50k file names as compared to 3minutes of Directory.GetFiles).
Though, once I've crawled the directories and stored the file paths in the List<string> I now want to make a pass on every path stored in my list and read the contents of the small text file and perform some action based on the values read in.
As a test bed I iterated over each file path in a List<string> that stored 42,945 file paths to this LAN network share and performed the following lines on each FileFullPath:
StreamReader file = new StreamReader(FileFullPath);
file.ReadToEnd();
file.Close();
So with just these lines, it takes 13-15minutes runtime for all 42,945 files paths stored in my list.
Is there a more optimal way to load in many small text files via C#? Is there some interop I should consider? Or is this pretty much the best I can expect? It just seems like an awfully long time.
I would consider using Directory.EnumerateFiles, and then processing your files as you read them.
This would prevent the need to actually store the list of 42,945 files at once, as well as open up the potential of doing some of the processing in parallel using PLINQ (depending on the processing requirements of the files).
If the processing has a reasonably large CPU portion of the total time (and it's not purely I/O bound), this could potentially provide a large benefit in terms of complete time required.
I have HDD (say 1TB) with FAT32 and NTFS partitions and I dont have information on which all files are stored on it, but when needed I want to quickly access large files say over 500 MB. I dont want to scan my whole HDD since it is very time consuming. I need quick results. I was wondering if there are any NTFS/FAT32 APIs that I can directly call - i mean if they have some metadata about the files that are stored then it will be quicker. I want to write my program in C++ and C#.
EDIT
If scanning the HDD is the only option then what all can I do to ensure best performance. Like - I could skip scanning system folders, since I am only interested in user data.
If you're willing to do a lot of extra work yourself to speed things up, you might be able to accomplish something. A lot is going to depend on what you need.
Let's start with FAT32. FAT (in general, not just the 32-bit variant) is named for the File Allocation Table. This is a block of data toward the beginning of the partition that tells which clusters in the partition belong to which files. The FAT is basically organized as linked lists of clusters. If you just want to find the data areas for the large files, you can read the FAT in as a number of raw sectors, and scan through that data to find linked lists of more than X clusters (where X defines the lower limit for what you consider a large file). You can then access those clusters and see the actual data associated with each file. Oddly, what you won't know is the name of that file. The file names are contained in directories, which are basically like files, except that what they contain are a fixed-size records of a specified format. You have to start from the root directory, and read through the directory tree to find file names.
NTFS is both simpler and more complex. NTFS has a Master File Table (MFT) to contains records for all the files in a partition. The good point is that you can read the MFT and get information about every file on the disk without chasing through the directory tree to get it. The bad point is that decoding the contents of an NTFS partition is definitely non-trivial. Reading data (meaningfully) is quite difficult -- and writing data much more difficult. Also, recent versions of Windows have added more restrictions on raw reading from disk partitions, so depending on what partition you're after, you may not be able to access the data you need at all.
None of this, however, is anything that's more than minimally supported. To do it, you open a file named "\.\D:" (where D=letter of the disk you care about). You can then read raw sectors from that disk drive (assuming that opening it worked). This will let you see the raw data for the entire disk (or partition, as the case may be) starting from the boot sector, and going through everything else that's there (FAT, root directory, subdirectories, etc. -- all as sectors of raw data). The system will let you read the raw data, but all the work to make any sense of that data is 100% your responsibility. If the speed you've asked about is an absolute necessity, this may be a possibility -- but it'll take a fair amount of work for FAT volumes, and considerably more than that for NTFS. Unless you really need extra speed like you've said, it's probably not even worth considering trying to do this.
If you're willing to target Vista and beyond, you can use the search indexer APIs.
If you look here you can find information about the search indexer. The search indexer does index the file size so it may do what you want.
Not possible. Neither filesystem keeps a list of big files that you could query directly. You'd have to recursively look at every folder and check the size of every file to find whatever you consider big.
Your only prayer is to latch onto a file indexer, otherwise you will have to iterate through all files. Depending on your computer you might be able to latch onto the native Microsoft indexer (searchindexer.exe) or if you have Google Desktop search you may be able to latch onto that.
Possible way to latch onto Microsoft's indexer
Is there any way to estimate the memory requirement for creating an XpathDocument instance based on the file size of the xml?
XpathDocument xdoc = new XpathDocument(xmlfile);
Is there any way to programmatically stop the process of creating the XpathDocument if memory drops to a very low level?
Since it loads the entire xml into memory, it would be nice to know ahead of time if the xml is too big. What I have found is that when I create a new XpathDocument with a big xml file, an outofmemory exception is never fired, but that the process slows to a crawl, only 5 Mb of memory remains a available and the Task Manager reports it is not responding. This happened with a 266 Mb xml file when there was 584 Mb of ram. I was able to load a 150 Mb file with no problems in 18.
After loading the xml, I want to do xpath queries using an XpathNavigator and an XpathNodeIterator. I am using .net 2.0, xp sp3.
In short, no you cannot, except if you always have similar files to gather statictical data before starting the estimations.
Since tag, attribute, prefix and namespace strings are interned, it pretty much depends on the structure of the XML file how efficient the storage can be, and the ratio compared to the file on disk also depends on the encoding used.
In general, .NET stores any string as UTF16 in memory. Therefore, even if there was no significant structural overhead (imagine an XML file with only a single root tag and lots of plain text in it), the memory used would still double for a UTF8 source file (or also ASCII or any other 8-bit encoding) used. So string encoding is the first part in the equation.
The other thing is that a data structure is built in-memory to allow the efficient traversal of the document. Typically, nodes are constructed and linked together with references. Therefore each node uses up a certain amount of memory; since most non-value data are references, the memory used here also depends heavily on the architecture (64-bit uses twice as much memory for a single reference than a 32-bit system). So if you have a very complex document with little data (e.g. a whole bunch of few different tags with little text or attribute values) your memory usage will be much higher than the original document size, and at this will also depend a lot on the architecture your application runs on.
If you have a file with few very long tag and attribute names and maybe heavy default namespace useage, the memory used may also be much lower than the file on disk.
So assuming an arbitrary XML file with an unknown encoding, a reasonable amount of data and complexity it will be very difficult to get a reliable estimation. However, if your XML files are always similar in the points mentionned, you could create some statistics to get a factor which gets the ratio about right for your specific platform.
However, note that looking at "free memory" in the task manager or talking of a "very low memory level" are very vague quantifications. Virtual memory, caches, background applications and services etc. will influence the effective raw memory availability. The .NET Framework can therefore not reliably guess how much memory it should allow to be used to remain performant for a single process, or even before throwing an OutOfMemoryException safely. So if you get one of those exceptions, you are usually way beyond a possible recovery point for your application, and you should not try to catch and handle those exceptions.
You can simply check the file size and back out if it exceeds a certain upper bound.
var xmlFileInfo = new FileInfo(xmlfile);
var isTooBig = xmlFileInfo.Length > maximumSize
This will not be foolproof, because you cannot guess at what the correct maximum size will be.
Yes sure you can do it with FileInfo class.
System.IO.FileInfo foo = new System.IO.FileInfo("<your file path as string>");
long Size = foo.Length;