Options for dealing with very very large strings - c#

I have a C# program that uses a production grammar to generate 3D models of trees and flowers and similar organic entities (see wikipedia entry for more info on L-Systems) - when I'm generating a large tree with leaves, I (expectedly) get exponential growth in the string that would go up to 100's of gigs if I'd let it (and I'd like to).
Constraints - I have to do this (sort of) in C# - the C++/native side is busy compiling and rendering the rather immense geometry that's produced.
So StringBuilder is right out --- even if it could handle it, I don't have enough memory!
I don't want to do a pure file based solution - waaaaaayyyyyyyy toooooooooooo sloooooooooooowwww!
I can't change the grammar - I realize I could compress the standard L-Systems notation, but it's a context sensitive grammar, so once you've got it working, you become positively superstitious about fiddling with it.
Things I've considered
Memory mapped files - I don't mind using P/Invoke to get to the native layer to support things, I just don't want to rewrite the whole production system in C++ - but I haven't found much in the way of handy libraries for C# to access this functionality
Low level mucking about with memory management/page faulting, etc - but hey, if I did that I might as well sell it as a product - makes the slow pure file based solution not look like such a bad idea
Anybody got any ideas here ? How do I effeciently traverse/manipulate/expand multigig strings produced by a production grammar ?

If you can upgrade to .net 4.0 then then you can use memory mapped files without needing to P/Invoke.
http://msdn.microsoft.com/en-us/library/dd997372.aspx

You're quite right that the typical approach to compression involves the notion of a pre-existing plaintext. What I'm talking about here is something like the idea of using a trie data structure as opposed to a dictionary. It's not just about passively compressing, but rather using an inherently more compact representation that encodes the redundancies implicitly. If you're at the 100G mark today, you're within an order of magnitude of bursting past the limits of affordable hard drives, so you might benefit from rethinking the solution.

If this is only for your development machines then a "back to the future" solution might be a RAM Disk, aka RAM Drive.
A RAM disk or RAM drive is a block of RAM (primary storage or volatile memory) that a computer's software is treating as if the memory were a disk drive (secondary storage).
One product for example. Search for RAM Disk or RAM drive and you'll get a cornucopia of choices.

Related

Is there a difference between storing data in your program or a document file?

I'm developing a text based game and was wondering if there will be any issues if I were to just write all the text in the code instead of making something like a csv file to read the data from. It won't be as organised, but I was wondering if the game would take more memory or have worse performance if I were to put the game in code instead of a text document or csv file.
Some advantages and disadvantages:
In-a-file
Easy to modify, easy to localize
Not very secure, anyone can look at it and hack it
Has to be parsed at runtime, what to do about errors if badly formed?
Tiny performance hit on load (probably not worth worrying about)
In an embedded resource
No separate file lying on the hard drive for your users to examine and hack (easily)
Can localize fairly easily
In-code
No need to parse the input file, syntax and structure is checked at compile time
Can define your game resources using strongly-typed values, e.g. Room cave = new Room("Cave", "Long description ...").
Can define more complex relationships between objects without resorting to string-id references between them. cave.ConnectsTo(passageway), cave.Contains(sword), ...
In terms of memory consumption it's a wash - the strings will be in memory in either case - unless you are writing a huge game in which case a database would be more appropriate with the ability to easily load individual areas of the map and eject ones no longer needed from memory.

Effectively cache large data structure

I have application which gathers information from the File system. For about 200GB hard drive the data structures in my applications use up to 1GB Ram. My data structures are in particular Dictionaries and Tries. So my questions is how to effectively cache these large data structures. What I can think of is
Serialization - plain text/xml, which would be faster ?
SWAP space - Is it possible ? I couldn't find any helpful resource
Local Database - I don't think this will be fast since It will have to handle a lot of queries.
Anything else - feel free to let me know
Thank you in advance guys.
Take a look at memory-mapped files.

MemoryMappedFile and b-tree for cache application

This is just an idea, I don't yet have any code, I need some design advice. I would implement a cache ( non distributed in first instance ) by using the MemoryMappedFile in c#. I think it would be good to have a b-tree as an undelying structure, but this is debatable as well.
So the question are:
Is B-tree a good strategy to use to fast search items when the undelaying support is a memory mapped files ?
What tip and trick do we have with memory mapped files ? How much the view can be large, what are the drawbacks when it is too small or too large ?
Multithread consideration: how we deal with memory mapped file and concurrency ? Cache are supposed to be higly hitten by clients, what strategy is better to have something performant ?
As #Internal Server Error asked, I integrate the question with this:
Key would be a string, about 64 chars max len. The data would be a byte[] about 1024 bytes long but consider an average at 128 bytes, or better: what I want to cache are OR/M entities, let's consider how long is a serialized entity in bytes with something like a BSOn serializer.
B-Tree is good (with memory-mapped files), but if the file is not always entirely kept in resident memory then a page-aligned B+Tree is much better. See also.
The trick with memory-mapped files is to use a 64-bit architecture so that you can map the entire file into memory, otherwise you'd have to only map the parts and the cached reads might be faster than mmaps.
Try CAS (compare-and-swap) over the shared memory. See also.

Algorithm for Source Control System?

I need to write a simple source control system and wonder what algorithm I would use for file differences?
I don't want to look into existing source code due to license concerns. I need to have it licensed under MPL so I can't look at any of the existing systems like CVS or Mercurial as they are all GPL licensed.
Just to give some background, I just need some really simple functions - binary files in a folder. no subfolders and every file behaves like it's own repository. No Metadata except for some permissions.
Overall really simple stuff, my single concern really is how to store only the differences of a file from revision to revision without wasting too much space but also without being too inefficient (Maybe store a full version every X changes, a bit like Keyframes in Videos?)
Longest Common Subsequence algorithms are the primary mechanism used by diff-like tools, and can be leveraged by a source code control system.
"Reverse Deltas" are a common approach to storage, since you primarily need to move backwards in time from the most recent revision.
Patience Diff is a good algorithm for finding deltas between two files that are likely to make sense to people. This often gives better results than the naive "longest common subsequence" algorithm, but results are subjective.
Having said that, many modern revision control systems store complete files at each stage, and compute the actual differences later, only when needed. For binary files (which probably aren't terribly compressible), you may find that storing reverse deltas might be ultimately more efficient.
How about looking the source code of Subversion ? its licensed under Apache License 2.0
Gene Myers has written a good paper An O(ND) Difference Algorithm and its Variations. When it comes to comparing sequences, Myers is the man. You probably should also read Walter Tichy's paper on RCS; it explains how to store a set of files by storing the most recent version plus differences.
The idea of storing deltas (forwards or backwards) is classic with respect to version control. The issue has always been, "what delta do you store?"
Lots of source control systems store deltas as computed essentially by "diff", e.g, line-oriented complement of longest-common-subsequences. But you can compute deltas for specific types of documents in a way specific to those documents, to get smaller (and often more understandable) deltas.
For programming languages source code, one can compute Levenshtein distances over program structures. A set of tools for doing essentially this for a variety of popular programming langauges can be found at Smart Differencer
If you are storing non-text files, you might be able to take advantage of their structure to compute smaller deltas.
Of course, if what you want is a minimal implementation, then just storing the complete image of each file version is easy. Terabyte disks make that solution workable if not pretty. (The PDP10 file system used to do this implicitly).
Though fossil is GPL, the delta algorithm is based on rsync and described here
I was actually thinking about something similar to this the other day... (odd, huh?)
I don't have a great answer for you but I did come to the conclusion that if I were to write a file diff tool, that I would do so with an algorithm (for finding diffs) that functions somewhat like how REGEXes function with their greediness.
As for storing DIFFs... If I were you, instead of storing forward-facing DIFFs (i.e. you start with your original file and then computer 150 diffs against it when you're working with version 151), use stored DIFFs for your history but have your latest file stored as a full version. If you do it this way, then whenever you're working with the latest file (which is probably 99% of the time), you'll get the best performance.

Should I store localization content in the application state

I am developing my first multilingual C# site and everything is going ok except for one crucial aspect. I'm not 100% sure what the best option is for storing strings (typically single words) that will be translated by code from my code behind pages.
On the front end of the site I am going to use asp.net resource files for the wording on the pages. This part is fine. However, this site will make XML calls and the XML responses are only ever in english. I have been given an excel sheet with all the words that will be returned by the XML broken into the different languages but I'm not sure how best to store/access this information. There are roughly 80 words x 7 languages.
I am thinking about creating a dictionary object for each language that is created by my global.asax file at application run time and just keeping it stored in memory. The plus side for doing this is that the dictionary object will only have to be created once (until IIS restarts) and can be accessed by any user without needing to be rebuilt but the downside is that I have 7 dictionary objects constantly stored in memory. The server is a Win 2008 64bit with 4GB of RAM so should I even be concerned with memory taken up by using this method?
What do you guys think would be the best way to store/retrieve different language words that would be used by all users?
Thanks for your input.
Rich
From what you say, you are looking at 560 words which need to differ based on locale. This is a drop in the ocean. The resource file method which you have contemplated is fit for purpose and I would recommend using them. They integrate with controls so you will be making the most from them.
If it did trouble you, you could have them on a sliding cache, i.e. sliding cache of 20mins for example, But I do not see anything wrong with your choice in this solution.
OMO
Cheers,
Andrew
P.s. have a read through this, to see how you can find and bind values in different resource files to controls and literals and use programatically.
http://msdn.microsoft.com/en-us/magazine/cc163566.aspx
As long as you are aware of the impact of doing so then yes, storing this data in memory would be fine (as long as you have enough to do so). Once you know what is appropriate for the current user then tossing it into memory would be fine. You might look at something like MemCached Win32 or Velocity though to offload the storage to another app server. Use this even on your local application for the time being that way when it is time to push this to another server or grow your app you have a clear separation of concerns defined at your caching layer. And keep in mind that the more languages you support the more stuff you are storing in memory. Keep an eye on the amount of data being stored in memory on your lone app server as this could become overwhelming in time. Also, make sure that the keys you are using are specific to the language. Otherwise you might find that you are storing a menu in german for an english user.

Categories