LINQ to XML Speed - c#

Quick question, the following line should be pretty self explanatory :
doc.Descendants("DOB").Select(dob => dob.ToString()).All(dob => DateTime.Parse(dob.ToString()) != DateTime.Parse(processing.DateOfBirth))
But just in case, I want to return false if any value of node (DOB) is the same datetime as
processing.dateofbirth, because i'll need to add the date of birth to the xml if it's not in there.
My two questions are
Is this the shortest amount of code to accomplish this, with linq to xml? (i think it's not)
and
This will be run against several million records, is there a more efficient way to accomplish this?
EDIT
I miscommunicated, sorry. The XML is small. There are millions of rows in a database representing a single person, with a column PersonXml that just has name, dob, number, and a few other things. The rows are read in through a SqlDataReader and validated/updated, this being part of that.

1) Not sure here, at least I cant think of a shorter way to write it.
2) If the format is always the same, you could consider working on the data directly. Either work on the string or on a stream directly. Parsing the string to XDoc will always cut its share. You mentioned the xml file is very small and might not change.
For a project where I am writing a xml file that is way to large to keep in memory, I have written a class (actually found parts of that code on SO) that works on a filestream and looks for matching patterns byte by byte. In your case this might be bad style and a bit unconvenient to work with, but if speed is a matter this will beat XDocument anytime.
-edit- Reading your edits, I think speed is not really something you are too concerned about and this is a task you might have to do only once to correct some old data? In that case I'd suggest you just stick with your solution, maybe use a Taskfactory to spawn threads that run a new task for each row you receive from the db and let it run over night. In my mind the easiest and safest solution

Related

Parse a large CSV and stream the resulting rows

I'm attempting to read huge CSV files (50M+ rows, ~30 columns, multiple gigabyte files).
This will be run on business desktop-spec machines, so loading the file into memory isn't going to cut it. Streaming rows as they're parsed seems to be the sanest option.
To make things slightly more interesting, I only need 2 of the columns in the file, but the ordering of fields is not guaranteed and has to be derived from column headings.
As such, an iterator that returns array-per-row or similar would be excellent.
I can't just split on line breaks, as some of the field values may span multiple lines. I'd prefer to avoid manually checking which fields are quoted, unescaping as appropriate, etc...
Is there anything in the framework that will do this for me? If not, can someone give me some hints on how best to approach this?
You can try, Cinchoo ETL - an open source library to read and write CSV files
using (var reader = new ChoCSVReader("test.csv").WithFirstLineHeader()
.WithField("Field1")
.WithField("Field2")
)
{
foreach (dynamic item in reader)
{
Console.WriteLine(item.Field1);
Console.WriteLine(item.Field2);
}
}
Please check out articles at CodeProject on how to use it.
Hope it helps your needs.
Disclaimer: I'm the author of this library

how to read a file in batches/parts/1000 lines at a time

I am currently trying to make a method that can handle reading in large XML files. All i need is a method that would loading in say 1000 lines at a time or in small batches.
I have been looking at streamreaders, xmlreaders and filestreams, I have seen some mentions of just keeping the stream open while processing data to get what i need but i cant seem to get my head round it.
I have spent a long time checking the similar questions but can seem to find anything that will help me.
ps. first thought i was thinking of doing a for loop around the readline to a counter of 1000 but cant seem to figure out how to continue from that 1000 lines to reading another 1000 etc until the end of the file.
My feeling is that his will require a custom XML reader implementation.
For example - if your structure looks something like:
root
item
stuff
/item
item
stuff
/item
item
stuff
/item
item
stuff
/item
/root
You'll have to write code that reads a number 'item' blocks (as many as yuo wish to process in a batch), and then converts them into a valid XML doc for further processing.
If, however, your XML doc is one massive sprawling entity - I don't think there's any elegant way you can process it piece-meal.

Equivalent to HashSet.Contains that returns HashSet or index?

I have a large list of emails that I need to check test to see if they contain a string. I only need to do this once. I originally only need to check to see if they email matched any of the emails from a list of emails.
I was using if(ListOfEmailsToRemoveHashSet.Contains(email)) { Discard(email); }
This worked great, but now I need to check for partial matches, so I am trying to invert it, but if I used the same method, I would be testing it like...
if (ListOfEmailsHashSet.Contains(badstring). Obviously that tells me which string is being found, but not which index in the hashset contains the bad string.
I can't see any way of making this work while still being fast.
Does anyone know of a function I can use that will return the HashSet of matches, the index of a matched item, or any way around this?
I only need to do this once.
If this is the case, performance shouldn't really be a consideration. Something like this should work:
if(StringsToDisallow.Any(be => email.Contains(be))) {...}
On a side note, you may want to consider using Regular Expressions rather than a straight black-list of contained strings. They'll give you a much more powerful, flexible way to find matches.
If performance does turn out to be an issue after all, you'll have to find a data structure that works better for full-text searching. It might be best to leverage an existing tool like Lucene.NET.
Just a note here, We had a program that was tasked with uploading excess of 100,000 pdf/excel/doc etc, everytime the file was uploaded an entry was made in a text file. Every Night when the program ran it would read this file, load the records and add it to the static HashSet<string> FilesVisited = new HashSet<string>(); FilesVisited.Add(reader.ReadLine());.
When the program attempted to upload a file, we had to first scan through the HashSet to see if we already worked on the file. What we found was that
if (!FilesVisited.Contains(newFilePath))... would take a lot of time and would not give us the correct results (even if the file path was in there) alternately, FilesVisited.Any(m => m.Contains(newFilePath)) was also a slow operation.
The best way we found to be fast was the traditional way of
foreach (var item in FilesVisited)
{
if (item.Contains(fileName)) {
alreadyUploded = true;
break;
}
}
Just thought I would share this....

getting element offset from XMLReader

how's everyone doing this morning?
I'm writing a program that will parse a(several) xml files.
This stage of the program is going to be focusing on adding/editing skills/schools/abilities/etc for a tabletop rpg (L5R). What I learn by this one example should carry me through the rest of the program.
So I've got the xml reading set up using XMLReader. The file I'm reading looks like...
<skills>
<skill>
<name>some name</name>
<description>a skill</description>
<type>high</type>
<stat>perception</stat>
<page>42</page>
<availability>all</availability>
</skill>
</skills>
I set up a Skill class, which holds the data, and a SkillEdit class which reads in the data, and will eventually have methods for editing and adding.
I'm currently able to read in everything right, but I had the thought that since description can vary in length, once I write the edit method the best way to ensure no data is overwritten would be to just append the edited skill to the end of the file and wipe out its previous entry.
In order for me to do that, I would need to know where skill's file offset is, and where /skill's file offset is. I can't seem to find any way of getting those offsets though.
Is there a way to do that, or can you guys suggest a better implementation for editing an already existing skill?
If you read your XML into LINQ to XML's XDocument (or XElement), everything could become very easy. You can read, edit, add stuff, etc. to XML files using a simple interface.
e.g.,
var xmlStr = #"<skills>
<skill>
<name>some name</name>
<description>a skill</description>
<type>high</type>
<stat>perception</stat>
<page>42</page>
<availability>all</availability>
</skill>
</skills>
";
var doc = XDocument.Parse(xmlStr);
// find the skill "some name"
var mySkill = doc
.Descendants("skill") // out of all skills
.Where(e => e.Element("name").Value == "some name") // that has the element name "some name"
.SingleOrDefault(); // select it
if (mySkill != null) // if found...
{
var skillType = mySkill.Element("type").Value; // read the type
var skillPage = (int)mySkill.Element("page"); // read the page (as an int)
mySkill.Element("description").Value = "an AWESOME skill"; // change the description
// etc...
}
No need to calculate offsets, manual, step-by-step reading or maintaining other state, it is all taken care of for you.
Don't do it! In general, you can't reliably know anything about physical offsets in the serialized XML because of possible character encoding differences, entity references, embedded comments and a host of other things that can cause the physical and logical layers to have a complex relationship.
If your XML is just sitting on the file system, your safest option is to have a method in your skill class which serializes to XML (you already have one to read XML already), and re-serialize whole objects when you need to.
Tyler,
Umm, sounds like you're suffering from a text-book case of premature optimization... Have you PROVEN that reading and writing the COMPLETE skill list to/from the xml-file is TOO slow? No? Well until it's been proven that there IS NO PERFORMANCE ISSUE, right? So we just write the simplest code that works (i.e. does what we want, without worrying too much about performance), and then move on directly to the next bit of trick functionality... testing as we go.
Iff (which is short for if-and-only-if) I had a PROVEN performance problem then-and-only-then I'd consider writing each skill to individual XML-file, to avert the necessisity for rewriting a potentially large list of skills each time a single skill was modified... But this is "reference data", right? I mean you wouldn't de/serialize your (volatile) game data to/from an XML file, would you? Because an RDBMS is known to be much better at that job, right? So you're NOT going to be rewriting this file often?
Cheers. Keith.

What is the fastest way to parse text with custom delimiters and some very, very large field values in C#?

I've been trying to deal with some delimited text files that have non standard delimiters (not comma/quote or tab delimited). The delimiters are random ASCII characters that don't show up often between the delimiters. After searching around, I've seem to have only found no solutions in .NET will suit my needs and the custom libraries that people have written for this seem to have some flaws when it comes to gigantic input (4GB file with some field values having very easily several million characters).
While this seems to be a bit extreme, it is actually a standard in the Electronic Document Discovery (EDD) industry for some review software to have field values that contain the full contents of a document. For reference, I've previously done this in python using the csv module with no problems.
Here's an example input:
Field delimiter =
quote character = þ
þFieldName1þþFieldName2þþFieldName3þþFieldName4þ
þValue1þþValue2þþValue3þþSomeVery,Very,Very,Large value(5MB or so)þ
...etc...
Edit:
So I went ahead and created a delimited file parser from scratch. I'm kind of weary using this solution as it may be prone to bugs. It also doesn't feel "elegant" or correct to have to write my own parser for a task like this. I also have a feeling that I probably didn't have to write a parser from scratch for this anyway.
Use the File Helpers API. It's .NET and open source. It's extremely high performance using compiled IL code to set fields on strongly typed objects, and supports streaming.
It supports all sorts of file types and custom delimiters; I've used it to read files larger than 4GB.
If for some reason that doesn't do it for you, try just reading line by line with a string.split:
public IEnumerable<string[]> CreateEnumerable(StreamReader input)
{
string line;
while ((line = input.ReadLine()) != null)
{
yield return line.Split('þ');
}
}
That'll give you simple string arrays representing the lines in a streamy fashion that you can even Linq into ;) Remember however that the IEnumerable is lazy loaded, so don't close or alter the StreamReader until you've iterated (or caused a full load operation like ToList/ToArray or such - given your filesize however, I assume you won't do that!).
Here's a good sample use of it:
using (StreamReader sr = new StreamReader("c:\\test.file"))
{
var qry = from l in CreateEnumerable(sr).Skip(1)
where l[3].Contains("something")
select new { Field1 = l[0], Field2 = l[1] };
foreach (var item in qry)
{
Console.WriteLine(item.Field1 + " , " + item.Field2);
}
}
Console.ReadLine();
This will skip the header line, then print out the first two field from the file where the 4th field contains the string "something". It will do this without loading the entire file into memory.
Windows and high performance I/O means, use IO Completion ports. You may have todo some extra plumbing to get it working in your case.
This is with the understanding that you want to use C#/.NET, and according to Joe Duffy
18) Don’t use Windows Asynchronous Procedure Calls (APCs) in managed
code.
I had to learn that one the hard way ;), but ruling out APC use, IOCP is the only sane option. It also supports many other types of I/O, frequently used in socket servers.
As far as parsing the actual text, check out Eric White's blog for some streamlined stream use.
I would be inclined to use a combination of Memory Mapped Files (msdn point to a .NET wrapper here) and a simple incremental parse, yielding back to an IEnumerable list of your record / text line (or whatever)
You mention that some fields are very very big, if you try to read them in their entirety to memory you may be getting yourself into trouble. I would read through the file in 8K (or small chunks), parse the current buffer, keep track of state.
What are you trying to do with this data that you are parsing? Are you searching for something? Are you transforming it?
I don't see a problem with you writing a custom parser. The requirements seem sufficiently different to anything already provided by the BCL, so go right ahead.
"Elegance" is obviously a subjective thing. In my opinion, if your parser's API looks and works like a standard BCL "reader"-type API, then that is quite "elegant".
As for the large data sizes, make your parser work by reading one byte at a time and use a simple state machine to work out what to do. Leave the streaming and buffering to the underlying FileStream class. You should be OK with performance and memory consumption.
Example of how you might use such a parser class:
using(var reader = new EddReader(new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.Read, 8192)) {
// Read a small field
string smallField = reader.ReadFieldAsText();
// Read a large field
Stream largeField = reader.ReadFieldAsStream();
}
While this doesn't help address the large input issue, a possible solution to the parsing issue might include a custom parser that users the strategy pattern to supply a delimiter.

Categories