Which xml structure allows faster Add/Del/Update - c#

Which XML structure allows me faster adding,deleting,updating of a node?
My assumption is the first one as the xml hierarchy is not that deep.
What do you think ?
<Departments>
<Department Id="a Guid" IsVisible="True" />
</Departments>
OR
<Departments>
<Department>
<Id>a Guid</Id>
<IsVisible>True</IsVisible>
</Department>
</Departments>

It doesn't matter.
You have to read the entire file and parse it into a document structure, do the updates, then write the entire file. Updating the object structure is so little work compared to the file I/O that the structure doesn't matter.

In general
In general terms I tend to agree with the other answers here, but I'd like to add a few remarks. Performance is normally most hindered by its slowest factor, which is the network, the database connection, the file system or even the internal memory when I/O is part of the issue. If we take that as a given, a possible conclusion is that the smaller the size, the bigger the performance improvement is.
Other factors
But there's another factor. Attributes and elements are implemented differently. Attributes are implemented something like key/value pairs with a uniqueness constraint and roughly take the size of chars * 2 + sizeof(int). Elements require a much larger structure in-memory and for the sake of brevity, I like to use one simple factor that's some average between several implementations: 3.5 * chars. I use chars here, because whether you store it as UTF8 or as UTF16 makes a storage difference, but not an in-memory difference.
The former paragraph implies that attributes are faster. But still this isn't a simple fact, because attributes are not implemented as normal nodes and searching for their data is generally slower than searching for data in nodes. This is hard to measure in general terms and requires profiling for every particular situation to find out.
LINQ
Then there's LINQ. If you use LINQ, reading and writing is done with streaming XML which is relatively fast. The in-memory representation is usually much smaller and much faster than with XmlDocument parsing.
Names
The size of the names of the fields, like elements and attributes does not matter. Internally they are keyed and given a unique ID. The contents of the elements and the attributes, however, will add to the overall memory footprint.
If the size of the names is very large compared to their content, minifying the names will make your XML less readable, but also requires less I/O or network bandwidth. As such, in some cases, it may improve performance to use small names.
UTF-8 or UTF-16
Finally, I should add a note on the way you store it. Common sense says, store it as UTF-8. But that requires the parser to read each character and transform it in-memory to UTF-16. This costs time. Sometimes, a larger size of the file (for using UTF-16) can outperform a smaller size (with UTF-8) because the processor overhead is too big. Again, measuring your performance in several scenarios can help. Oh, and if you use a lot of (very) high characters, UTF-16 should be the preferred choice, because UTF-8 may use 3, 4 or even 6 bytes per character.
Summary
To sum it up, if speed is imperative and you cannot resort to a binary format:
Prefer attributes over elements, but only if DOM use is anticipated and searching / keying is not too important;
Prefer UTF-8 over UTF-16 only when the files are very large and you use few (very) high characters, measure to find out;
Prefer streaming over DOM for all your uses (LINQ typically uses streaming);
Don't bother using small names unless your I/O is really a bottleneck and the factor data:overhead is very large;
Define a few typical usage scenarios and measure;
PS: the above is what comes to mind when thinking about XML, there may, of course, be many other factors the improve / degrade performance, the largest perhaps your own skills in writing the best procedures for your CRUD operations.

I doubt very much that you'd see a difference. XML parsing is very fast.
You'd have to test with hundreds of thousands, if not millions of records to measure the difference, which I think would be tiny.

The only way to find out which one is faster is to create some sample queries and run them a bunch of times while profiling and averaging. I doubt you'll find a difference.
I would go with which ever schema is more expressive and meets your requirements. To me that's the first one since I doubt you'd ever want more then one Id or IsVisible type.

It would depend on what you were using to do this addition, updating and deleting. All things being equal, I would expect the first one, but by a truly very, very negliable amount. I would also not be even slightly amazed if I found that there were some libraries that worked faster with the second (due to differences in in-memory model representations, which are completely implementation-defined).
Assuming there will only be one id and one isVisible on each department, I'd go for the first (with the bug of the attribute not being quoted, fixed) as helping to restict the format in itself, and being a clear fit. I wouldn't be upset at having to use the latter though.

Related

Effective approach on fast look up of unique words in C#

I have the following problem. I have to store a list of unique words in multiple languages in memory and of course when I add new words I have to check whether the new word already exist.
Of course this needs to be blazingly fast, primarily because of the huge number of words.
I was thinking about implementing a Suffix Tree, but I wondered whether there is an easier approach with some already implemented internal structures.
P.S. Number of words ≈ 107.
First, note that Suffix Trees might be an overkill here, since they allow fast search for any suffix of any word, which might be a bit too much than what you are looking for. A trie is a very similar DS, that also allows fast search for a word, but since it does not support fast search for any suffix - its creation is simpler (both to program and efficiency).
Another simpler alternative is using a simple hash table, which is implemented in C# as a HashSet. While a HashSet is on theory slower on worst case - the average case for each lookup takes constant time, and it might be enough for your application.
My suggestion is:
First try using a HashSet, which requires much less effort to implement, benchmark it and check if it is enough.
Make sure your DS is moddable, so you can switch it with very little effort if you later decide to. This is usually done by introducing an interface that is responsible to the addition and lookup of words, and if you need it changed - just introduce a different implementation to the interface.
If you do decide to add suffix tree or trie - use community resources, no need to reinvent the wheel - someone has already implemented most of these data structures, and they are available online.

How to avoid System.OutOfMemoryException in c# when building a non recombining trinomial tree

I have "successfully" implemented a non recombining trinomial tree to price certain fixed-income derivatives. (Something like shown in the picture below - but with three branches that don't reconnect)
Unfortunately it turned out that the number of nodes I can use was severely limited by the available memory. If I build a tree with 20 time-steps this results in 3^19 nodes (so 1,1 Billion nodes)
The nodes of each time step are saved in List<Node> and these arrays are stored in a Dictionary<double,List<Node>>
Each node is instantiated via new Node(...). I also instantiate each of the lists and the dictionary via new Class() Perhaps this is the source of my error.
Also System.OutOfMemoryException isn't thrown because of the Dictionary/List-Object being to large (as is often the case) but because I seem to have too many Nodes - after a while new Node(...) can't allocate any further memory. Eventually the 2GB max List-Capacity will also kick in I think - seeing as how List grows exponentially larger with each time step.
Perhaps my data-structure is too wasteful or not really suited for the task at hand.
A possible solution could be to save the tree to a text-file thus avoiding the memory-problem completely. This however would necessitate a HUGE workaround.
Edit:
To add some more background. I need the tree to price path dependant products. This means that unfortunately I will have to access all the nodes. What is more after the tree has been build I start from the leaves and go backwards in time to determine the price. I also already only generate the nodes I need.
Edit2:
I have given the topic some though and also considered the various responses. Could it be that I just need to serialize the respective tree levels to the hard-drive. So basically - I create one time-step (List<Node>) write it to Disk etc. Later on when I start from the leaves - I will just have to load it in reverse oder.
You basically have two choices. evaluate only the branches you care about (Andrew's yield) and don't store results or build up your tree and save it to disk and implement a custom collection interface on top of it that accesses the right part of the disk. In this case you are still going to keep a minimal amount of data in your process memory and rely on the OS to do proper disk caching to make access fast. If you start working with large data sets the second option is a good tool to have in your tool belt, so you should probably write this with reuse in mind.
What we have here is a classic problem of doing an enormous amount of processing up front... and then storing EVERYTHING into memory to be processed at a later time.
While simple, given harsh enough conditions (like having a billion entries), it will eat up all the memory.
Now, the OP didn't really specify what the intention of the tree was or how it was going to be used... but I would propose that instead of building it all at once... build it as you need it.
Lazy Evaluation with yield
Instead of doing everything all at once and having to store it... it might be ideal to do it ONLY when you actually require it. Check out this post for more info and examples of using yield.
This won't work great though if you need to traverese the tree a bunch of times... but it might still allow you to have deeper depth than you currently do.
I don't think Serializing to disk will help much. One, when you attempt to deserialize the list you will still run out of memory (as, to the best of my knowledge, there is no way to partially deserialize an object).
Have you considered changing your data structure into a relational database model and storing it in a SQLEXPRESS database?
This would give you the added benefit of performing queries with indexes instead of your custom tree traversal logic.

Sparse matrix compression with fast access time

I'm writing a lexer generator as a spare time project, and I'm wondering about how to go about table compression. The tables in question are 2D arrays of short and very sparse. They are always 256 characters in one dimension. The other dimension is varying in size according to the number of states in the lexer.
The basic requirements of the compression is that
The data should be accessible without decompressing the full data set. And accessible in constant O(1) time.
Reasonably fast to compute the compressed table.
I understand the row displacement method, which is what I currently have implemented. It might be my naive implementation, but what I have is horrendously slow to generate, although quite fast to access. I suppose I could make this go faster using some established algorithm for string searching such as one of the algorithms found here.
I suppose an option would be to use a Dictionary, but that feels like cheating, and I would like the fast access times that I would be able to get if I use straight arrays with some established algorithm. Perhaps I'm worrying needlessly about this.
From what I can gather, flex does not use this algorithm for it's lexing tables. Instead it seems to use something called row/column equivalence which I haven't really been able to find any explanation for.
I would really like to know how this row/column equivalence algorithm that flex uses works, or if there is any other good option that I should consider for this task.
Edit: To clarify more about what this data actually is. It is state information for state transitions in the lexer. The data needs to be stored in a compressed format in memory since the state tables can potentially be huge. It's also from this memory that the actual values will be accessed directly, without decompressing the tables. I have a working solution using row displacement, but it's murderously slow to compute - in partial due to my silly implementation.
Perhaps my implementation of the row displacement method will make it clearer how this data is accessed. It's a bit verbose and I hope it's OK that I've put it on pastebin instead of here.
The data is very sparse. It is usually a big bunch of zeroes followed by a few shorts for each state. It would be trivial to for instance run-length encode it but it would spoil the
linear access time.
Flex apparently has two pairs of tables, base and default for the first pair and next and check for the second pair. These tables seems to index one another in ways I don't understand. The dragon book attempts to explain this, but as is often the case with that tome of arcane knowledge what it says is lost on lesser minds such as mine.
This paper, http://www.syst.cs.kumamoto-u.ac.jp/~masato/cgi-bin/rp/files/p606-tarjan.pdf, describes a method for compressing sparse tables, and might be of interest.
Are you tables known beforehand, and you just need an efficient way to store and access them?
I'm not really familiar with the problem domain, but if your table has a fix size along one axis (256), then would a array of size 256, where each element was a vector of variable length work? Do you want to be able to pick out an element given a (x,y) pair?
Another cool solution that I've always wanted to use for something is a perfect hash table, http://burtleburtle.net/bob/hash/perfect.html, where you generate a hash function from your data, so you will get minimal space requirements, and O(1) lookups (ie no collisions).
None of these solutions employ any type of compression, tho, they just minimize the amount of space wasted..
What's unclear is if your table has "sequence property" in one dimension or another.
Sequence property naturally happens in human speech, since a word is composed of many letters, and the sequence of letters is likely to appear later on. It's also very common in binary program, source code, etc.
On the other hand, sampled data, such as raw audio, seismic values, etc. do not advertise sequence property. Their data can still be compressed, but using another model (such as a simple "delta model" followed by "entropy").
If your data has "sequence property" in any of the 2 dimensions, then you can use common compression algorithm, which will give you both speed and reliability. You just need to provide it with an input which is "sequence friendly" (i.e. select your dimension).
If speed is a concern for you, you can have a look at this C# implementation of a fast compressor which is also a very fast decompressor : https://github.com/stangelandcl/LZ4Sharp

What is faster in xml parsing: elements or attributes?

I am writing code that parses XML.
I would like to know what is faster to parse: elements or attributes.
This will have a direct effect over my XML design.
Please target the answers to C# and the differences between LINQ and XmlReader.
Thanks.
Design your XML schema so that representation of the information actually makes sense. Usually, the decision between making something in attribute or an element will not affect performance.
Performance problems with XML are in most cases related to large amounts of data that are represented in a very verbose XML dialect. A typical countermeasures is to zip the XML data when storing or transmitting them over the wire.
If that is not sufficient then switching to another format such as JSON, ASN.1 or a custom binary format might be the way to go.
Addressing the second part of your question: The main difference between the XDocument (LINQ) and the XmlReader class is that the XDocument class builds a full document object model (DOM) in memory, which might be an expensive operation, whereas the XmlReader class gives you a tokenized stream on the input document.
With XML, speed is dependent on a lot of factors.
With regards to attributes or elements, pick the one that more closely matches the data. As a guideline, we use attributes for, well, attributes of an object; and elements for contained sub objects.
Depending on the amount of data you are talking about using attributes can save you a bit on the size of your xml streams. For example, <person id="123" /> is smaller than <person><id>123</id></person> This doesn't really impact the parsing, but will impact the speed of sending the data across a network wire or loading it from disk... If we are talking about thousands of such records then it may make a difference to your application.
Of course, if that actually does make a difference then using JSON or some binary representation is probably a better way to go.
The first question you need to ask is whether XML is even required. If it doesn't need to be human readable then binary is probably better. Heck, a CSV or even a fixed-width file might be better.
With regards to LINQ vs XmlReader, this is going to boil down to what you do with the data as you are parsing it. Do you need to instantiate a bunch of objects and handle them that way or do you just need to read the stream as it comes in? You might even find that just doing basic string manipulation on the data might be the easiest/best way to go.
Point is, you will probably need to examine the strengths of each approach beyond just "what parses faster".
Without having any hard numbers to prove it, I know that the WCF team at Microsoft chose to make the DataContractSerializer their standard for WCF. It's limited in that it doesn't support XML attributes, but it is indeed up to 10-15% faster than the XmlSerializer.
From that information, I would assume that using XML attributes will be slower to parse than if you use only XML elements.

In-memory search index for application takes up too much memory - any suggestions?

In our desktop application, we have implemented a simple search engine using an inverted index.
Unfortunately, some of our users' datasets can get very large, e.g. taking up ~1GB of memory before the inverted index has been created. The inverted index itself takes up a lot of memory, almost as much as the data being indexed (another 1GB of RAM).
Obviously this creates problems with out of memory errors, as the 32 bit Windows limit of 2GB memory per application is hit, or users with lesser spec computers struggle to cope with the memory demand.
Our inverted index is stored as a:
Dictionary<string, List<ApplicationObject>>
And this is created during the data load when each object is processed such that the applicationObject's key string and description words are stored in the inverted index.
So, my question is: is it possible to store the search index more efficiently space-wise? Perhaps a different structure or strategy needs to be used? Alternatively is it possible to create a kind of CompressedDictionary? As it is storing lots of strings I would expect it to be highly compressible.
If it's going to be 1GB... put it on disk. Use something like Berkeley DB. It will still be very fast.
Here is a project that provides a .net interface to it:
http://sourceforge.net/projects/libdb-dotnet
I see a few solutions:
If you have the ApplicationObjects in an array, store just the index - might be smaller.
You could use a bit of C++/CLI to store the dictionary, using UTF-8.
Don't bother storing all the different strings, use a Trie
I suspect you may find you've got a lot of very small lists.
I suggest you find out roughly what the frequency is like - how many of your dictionary entries have single element lists, how many have two element lists etc. You could potentially store several separate dictionaries - one for "I've only got one element" (direct mapping) then "I've got two elements" (map to a Pair struct with the two references in) etc until it becomes silly - quite possibly at about 3 entries - at which point you go back to normal lists. Encapsulate the whole lot behind a simple interface (add entry / retrieve entries). That way you'll have a lot less wasted space (mostly empty buffers, counts etc).
If none of this makes much sense, let me know and I'll try to come up with some code.
I agree with bobwienholt, but If you are indexing datasets I assume these came from a database somewhere. Would it make sense to just search that with a search engine like DTSearch or Lucene.net?
You could take the approach Lucene did. First, you create a random access in-memory stream (System.IO.MemoryStream), this stream mirrors a on-disk one, but only a portion of it (if you have the wrong portion, load up another one off the disk). This does cause one headache, you need a file-mappable format for your dictionary. Wikipedia has a description of the paging technique.
On the file-mappable scenario. If you open up Reflector and reflect the Dictionary class you will see that is comprises of buckets. You can probably use each of these buckets as a page and physical file (this way inserts are faster). You can then also loosely delete values by simply inserting a "item x deleted" value to the file and every so often clean the file up.
By the way, buckets hold values with identical hashes. It is very important that your values that you store override the GetHashCode() method (and the compiler will warn you about Equals() so override that as well). You will get a significant speed increase in lookups if you do this.
How about using Memory Mapped File Win32 API to transparently back your memory structure?
http://www.eggheadcafe.com/articles/20050116.asp has the PInvokes necessary to enable it.
Is the index only added to or do you remove keys from it as well?

Categories