I'm attempting to read huge CSV files (50M+ rows, ~30 columns, multiple gigabyte files).
This will be run on business desktop-spec machines, so loading the file into memory isn't going to cut it. Streaming rows as they're parsed seems to be the sanest option.
To make things slightly more interesting, I only need 2 of the columns in the file, but the ordering of fields is not guaranteed and has to be derived from column headings.
As such, an iterator that returns array-per-row or similar would be excellent.
I can't just split on line breaks, as some of the field values may span multiple lines. I'd prefer to avoid manually checking which fields are quoted, unescaping as appropriate, etc...
Is there anything in the framework that will do this for me? If not, can someone give me some hints on how best to approach this?
You can try, Cinchoo ETL - an open source library to read and write CSV files
using (var reader = new ChoCSVReader("test.csv").WithFirstLineHeader()
.WithField("Field1")
.WithField("Field2")
)
{
foreach (dynamic item in reader)
{
Console.WriteLine(item.Field1);
Console.WriteLine(item.Field2);
}
}
Please check out articles at CodeProject on how to use it.
Hope it helps your needs.
Disclaimer: I'm the author of this library
Related
Quick question, the following line should be pretty self explanatory :
doc.Descendants("DOB").Select(dob => dob.ToString()).All(dob => DateTime.Parse(dob.ToString()) != DateTime.Parse(processing.DateOfBirth))
But just in case, I want to return false if any value of node (DOB) is the same datetime as
processing.dateofbirth, because i'll need to add the date of birth to the xml if it's not in there.
My two questions are
Is this the shortest amount of code to accomplish this, with linq to xml? (i think it's not)
and
This will be run against several million records, is there a more efficient way to accomplish this?
EDIT
I miscommunicated, sorry. The XML is small. There are millions of rows in a database representing a single person, with a column PersonXml that just has name, dob, number, and a few other things. The rows are read in through a SqlDataReader and validated/updated, this being part of that.
1) Not sure here, at least I cant think of a shorter way to write it.
2) If the format is always the same, you could consider working on the data directly. Either work on the string or on a stream directly. Parsing the string to XDoc will always cut its share. You mentioned the xml file is very small and might not change.
For a project where I am writing a xml file that is way to large to keep in memory, I have written a class (actually found parts of that code on SO) that works on a filestream and looks for matching patterns byte by byte. In your case this might be bad style and a bit unconvenient to work with, but if speed is a matter this will beat XDocument anytime.
-edit- Reading your edits, I think speed is not really something you are too concerned about and this is a task you might have to do only once to correct some old data? In that case I'd suggest you just stick with your solution, maybe use a Taskfactory to spawn threads that run a new task for each row you receive from the db and let it run over night. In my mind the easiest and safest solution
I'm kind of stuck having to use .Net 2.0, so LINQ xml isn't available, although I would be interested how it would compare...
I had to write an internal program to download, extract, and compare some large XML files (about 10 megs each) that are essentially build configurations. I first attempted using libraries, such as Microsoft's XML diff/patch, but comparing the files was taking 2-3 minutes, even with ignoring whitespace, namespaces, etc. (i tested each ignore one at a time to try and figure out what was speediest). The I tried to implement my own ideas - lists of nodes from XmlDocument objects, dictionaries of keys of the root's direct descendants (45000 children, by the way) that pointed to ints to indicate the node position in the XML document... all took at least 2 minutes to run.
My final implementation finishes in 1-2 seconds - I made a system process call to diff with a few lines of context and saved those results to display (our development machines include cygwin, thank goodness).
I can't help but think there is a better, XML specific way to do this that would be just as fast as a plain text diff - especially since all I'm really interested in is the Name element that is the child of each direct descendant, and could throw away 4/5 of the file for my purposes (we only need to know what files were included, not anything else involving language or version)
So, as popular as XML is, I'm sure somebody out there has had to do something similar. What is a fast efficient way to compare these large XML's? (prefereably open source or Free)
edit: a sample of the nodes - I only need to find missing Name elements (there are over 45k nodes as well)
<file>
<name>SomeFile</name>
<version>10.234</version>
<countries>CA,US</countries>
<languages>EN</languages>
<types>blah blah</types>
<internal>N</internal>
</file>
XmlDocument source = new XmlDocument();
source.Load("source.xml");
Dictionary<string, XmlNode> files = new Dictionary<string, XmlNode>();
foreach(XmlNode file in source.SelectNodes("//file"))
files.Add(file.SelectSingleNode("./name").InnerText, file);
XmlDocument source2 = new XmlDocument();
source2.Load("source2.xml");
XmlNode value;
foreach(XmlNode file in source2.SelectNodes("//file"))
if (files.TryGetValue(file.SelectSingleNode("./name").InnerText, out value))
// This file is both in source and source2.
else
// This file is only in source2.
I am not sure exactly what you want, I hope that this example will help you in your quest.
Diffing XML can be done many ways. You're not being very specific regarding the details, though. What does transpire is that the files are large and you need only 4/5 of the information.
Well, then the algorithm is as follows:
Normalize and reduce the documents to the information that matters.
Save the results.
Compare the results.
And the implementation:
Use the XmlReader API, which is efficient, to produce plain text representations of your information. Why plain text representation? Because diff tools predicated on the assumption that there is plain text. And so are our eyeballs. Why XmlReader? You could use SAX, which is memory-efficient, but XmlReader is more efficient. As for the precise spec of that plain text file ... you're just not including enough information.
Save the plain text files to some temp directory.
Use a command-line diff utility like GnuWin32 diff to get some diff output. Yeah, I know, not pure and proper, but works out of the box and there's no coding to be done. If you are familiar with some C# diff API (I am not), well, then use that API instead, of course.
Delete the temp files. (Or optionally keep them if you're going to reuse them.)
I have an application where an XLS file with lots of data entered by the user is opened and the data in it is converted to XML. I have already mapped the columns in the XLS file to XML Maps. When I try to use the ExportXml method in XMLMaps, I get a string with the proper XML representation of the XLS file. I parse this string a bit and upload it to my server.
The problem is, when my XLS file is really large, the string produced for XML is over 2 GB and I get a Out of Memory exception. I understand that the limit for CLR objects is 2 GB. But in my case I need to handle this scenario. Presently I just message asking the user to send less data.
Any ideas on how I can do this?
EDIT:
This is just a jist of the operation I need to do on the generated XML.
Remove certain fields which are not needed for the server data.
Add something like ID numbers for each row of data.
Modify the values of certain elements.
Do validation on the data.
While the XMLReader stream is a good idea, I cannot perform these operations by that method. While data validation can be done by Excel itself, the other things cannot be done here.
Using XMLTextReader and XMLTextWriter and creating a custom method for each of the step is a solution I had thought of. But to go through the jist above, it requires the XML document to be gone through or processed 4 times. This is just not efficient.
If the XML is that large, then you might be able to use Export to a temporary file, rather than using ExportXML to a string - http://msdn.microsoft.com/en-us/library/microsoft.office.interop.excel.xmlmap.export.aspx
If you then need to parse/handle the XML in C#, then for handling such large XML structures, you'll probably be better off implementing a custom XMLReader (or XMLWriter) which works at the stream level. See this question for some similar advice - What is the best way to parse large XML (size of 1GB) in C#?
I guess there is no other way then using x64-OS and FX if you really need to hold the whole thing in RAM, but using some other way to process the data like suggested by Stuart may is the better way to go...
What you need to do is to use "stream chaining", i.e. you open up an input stream which reads from your excel file and an output stream that writes to your xml file. Then your conversion class/method will take the two streams as input and read sufficient data from the input stream to be able to write to the output.
Edit: very simple minimal Example
Converting from file:
123
1244125
345345345
4566
11
to
<List>
<ListItem>123</ListItem>
<ListItem>1244125</ListItem>
...
</List>
using
void Convert(Stream fromStream, Stream toStream)
{
using(StreamReader from= new StreamReader(fromStream))
using(StreamWriter to = new StreamWriter(toStream))
{
to.WriteLine("<List>");
while(!from.EndOfStream)
{
string bulk = from.ReadLine(); //in this case, a single line is sufficient
//some code to parse the bulk or clean it up, e.g. remove '\r\n'
to.WriteLine(string.Format("<ListItem>{0}</ListItem>", bulk));
}
to.WriteLine("</List>");
}
}
Convert(File.OpenRead("source.xls"), File.OpenWrite("source.xml"));
Of course you could do this in much more elegent, abstract manner but this is only to show my point
I've searched a lot but I couldn't find a propper solution for my problem. I wrote a xml file containing all episode information of a TV-Show. It's 38 kb and contains attributes and strings for about 680 variables. At first I simply read it with the help of XMLTextReader which worked fine with my quadcore. But my wifes five year old laptop took about 30 seconds to read it. So I thought about multithreading but I get an exception because the file is already opened.
Thread start looks like this
while (reader.Read())
{
...
else if (reader.NodeType == XmlNodeType.Element)
{
if (reader.Name.Equals("Season1"))
{
current.seasonNr = 0;
current.currentSeason = season[0];
current.reader = reader;
seasonThread[0].Start(current);
}
else if (reader.Name.Equals("Season2"))
{
current.seasonNr = 1;
current.currentSeason = season[1];
current.reader = reader;
seasonThread[1].Start(current);
}
And the parsing method like this
reader.Read();
for (episodeNr = 0; episodeNr < tmp.currentSeason.episode.Length; episodeNr++)
{
reader.MoveToFirstAttribute();
tmp.currentSeason.episode[episodeNr].id = reader.ReadContentAsInt();
...
}
But it doesn't work...
I pass the reader because I want the 'cursor' to be in the right position. But I also have no clue if this could work at all.
Please help!
EDIT:
Guys where did I wrote about IE?? The program I wrote parses the file. I run it on my PC and on the laptop. No IE at all.
EDIT2:
I did some stopwatch research and figured out that parsing the xml file only takes about 200ms on my PC and 800ms on my wifes laptop. Is it WPF beeing so slow? What can I do?
I agree with most everyone's comments. Reading a 38Kb file should not take so long. Do you have something else running on the machine, antivirus / etc, that could be interfering with the processing?
The amount of time it would take you to create a thread will be far greater than the amount of time spent reading the file. If you could post the actual code used to read the file and the file itself, it might help analyze performance bottlenecks.
I think you can't parse XML in multiple threads, at least not in a way that would bring performance benefits, because to read from some point in the file, you need to know everything that comes before it, if nothing else, to know at what level you are.
Your code, if tit worked, would do something like this:
main season1 season2
read
read
skip read
skip read
read
skip read
skip read
Note that to do “skip”, you need to fully parse the XML, which means you're doing the same amount of work as before on the main thread. The only difference is that you're doing some additional work on the background threads.
Regarding the slowness, just parsing such a small XML file should be very fast. If it's slow, you're most likely doing something else that is slow, or you're parsing the file multiple times.
If I am understanding how your .xml file is being used, you have essentially created an .xml database.
If correct, I would recommend breaking your Xml into different .xml files, with an indexed .xml document. I would think you can then query - using Linq-2-Xml - a set of .xml data from a specific .xml source.
Of course, this means you will still need to load an .xml file; however, you will be loading significantly smaller files and you would be able to, although highly discouraged, asynchronously load .xml document objects.
Your XML schema doesn't lend itself to parallelism since you seem to have node names (Season1, Season2) that contain the same data but must be parsed individually. You could redesign you schema to have the same node names (i.e. Season) and attributes that express the differences in the data (i.e. Number to indicate the season number). Then you can parallelize i.e. using Linq to XML and PLinq:
XDocument doc = XDocument.Load(#"TVShowSeasons.xml");
var seasonData = doc.Descendants("Season")
.AsParallel()
.Select(x => new Season()
{
Number = (int)x.Attribute("Number"),
Descripton = x.Value
}).ToList();
I've been trying to deal with some delimited text files that have non standard delimiters (not comma/quote or tab delimited). The delimiters are random ASCII characters that don't show up often between the delimiters. After searching around, I've seem to have only found no solutions in .NET will suit my needs and the custom libraries that people have written for this seem to have some flaws when it comes to gigantic input (4GB file with some field values having very easily several million characters).
While this seems to be a bit extreme, it is actually a standard in the Electronic Document Discovery (EDD) industry for some review software to have field values that contain the full contents of a document. For reference, I've previously done this in python using the csv module with no problems.
Here's an example input:
Field delimiter =
quote character = þ
þFieldName1þþFieldName2þþFieldName3þþFieldName4þ
þValue1þþValue2þþValue3þþSomeVery,Very,Very,Large value(5MB or so)þ
...etc...
Edit:
So I went ahead and created a delimited file parser from scratch. I'm kind of weary using this solution as it may be prone to bugs. It also doesn't feel "elegant" or correct to have to write my own parser for a task like this. I also have a feeling that I probably didn't have to write a parser from scratch for this anyway.
Use the File Helpers API. It's .NET and open source. It's extremely high performance using compiled IL code to set fields on strongly typed objects, and supports streaming.
It supports all sorts of file types and custom delimiters; I've used it to read files larger than 4GB.
If for some reason that doesn't do it for you, try just reading line by line with a string.split:
public IEnumerable<string[]> CreateEnumerable(StreamReader input)
{
string line;
while ((line = input.ReadLine()) != null)
{
yield return line.Split('þ');
}
}
That'll give you simple string arrays representing the lines in a streamy fashion that you can even Linq into ;) Remember however that the IEnumerable is lazy loaded, so don't close or alter the StreamReader until you've iterated (or caused a full load operation like ToList/ToArray or such - given your filesize however, I assume you won't do that!).
Here's a good sample use of it:
using (StreamReader sr = new StreamReader("c:\\test.file"))
{
var qry = from l in CreateEnumerable(sr).Skip(1)
where l[3].Contains("something")
select new { Field1 = l[0], Field2 = l[1] };
foreach (var item in qry)
{
Console.WriteLine(item.Field1 + " , " + item.Field2);
}
}
Console.ReadLine();
This will skip the header line, then print out the first two field from the file where the 4th field contains the string "something". It will do this without loading the entire file into memory.
Windows and high performance I/O means, use IO Completion ports. You may have todo some extra plumbing to get it working in your case.
This is with the understanding that you want to use C#/.NET, and according to Joe Duffy
18) Don’t use Windows Asynchronous Procedure Calls (APCs) in managed
code.
I had to learn that one the hard way ;), but ruling out APC use, IOCP is the only sane option. It also supports many other types of I/O, frequently used in socket servers.
As far as parsing the actual text, check out Eric White's blog for some streamlined stream use.
I would be inclined to use a combination of Memory Mapped Files (msdn point to a .NET wrapper here) and a simple incremental parse, yielding back to an IEnumerable list of your record / text line (or whatever)
You mention that some fields are very very big, if you try to read them in their entirety to memory you may be getting yourself into trouble. I would read through the file in 8K (or small chunks), parse the current buffer, keep track of state.
What are you trying to do with this data that you are parsing? Are you searching for something? Are you transforming it?
I don't see a problem with you writing a custom parser. The requirements seem sufficiently different to anything already provided by the BCL, so go right ahead.
"Elegance" is obviously a subjective thing. In my opinion, if your parser's API looks and works like a standard BCL "reader"-type API, then that is quite "elegant".
As for the large data sizes, make your parser work by reading one byte at a time and use a simple state machine to work out what to do. Leave the streaming and buffering to the underlying FileStream class. You should be OK with performance and memory consumption.
Example of how you might use such a parser class:
using(var reader = new EddReader(new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8192)) {
// Read a small field
string smallField = reader.ReadFieldAsText();
// Read a large field
Stream largeField = reader.ReadFieldAsStream();
}
While this doesn't help address the large input issue, a possible solution to the parsing issue might include a custom parser that users the strategy pattern to supply a delimiter.