I a making a state saving replay system in Unity3D and I want the replay to be written to a file. What file format is best to use when saving an replay? Xml maybe? For now I'm storing the transform data and I've implemented the option to add additional frame data.
It highly depends on what you are trying to do. There are tradeoffs for every solution in this case.
Based on the extra information you have given in the comments, the best solution I can think of in this case is marshalling your individual "recording sessions" onto a file. However, there is a little overhead to be done in order to achieve this.
Create a class called Frame, create another class called Record which has a List<Frame> frames. That way, you can place any information that you would like to be captured in each frame as attributes in the Frame class.
Since you can't marshal a generic type, you will have to marshall each frame individually. I suggest implementing a method in the Record class called MarshalRecording() that handles that for you.
However, in doing that, you will find it difficult to unmarshal your records because they may have different sizes in binary form and they wouldn't have a separator indicating where a frame ends and where the next frame begins. I suggest appending the size information at the beginning of each marshalled frame, that way you will be able to unmarshal all of the frames even if they have different sizes.
As #PiotrK pointed out in his answer, you could use protobuf. However, I don't recommend it for your specific use. Personally, I think this is overkill (Too much work for too little results, protobufs can be a PITA some times).
If you are worried about storage size, you could LZ4 the whole thing (if you are concatenating the binary information in memory), or LZ4 each frame (if you are processing each frame individually and then appending it to a file). I recommend the latter one because, depending on the amount of frames, you may run out of memory while you marshal your record.
Ps: Never use XML, it's cancerous!
The best way I know is to use some kind of format generator (like Google Protobuf). This has several advantage over using default C# serializer:
Versioning of format is as easy as possible; You can add new feature to the format and that will not break replays already existing in the field
The result is stored as either text or binary, with binary favorable and having very small footprint (for example if some of your data has default values, it won't be present in output at all - loader will handle them gracefully)
It's Google technology! :-)
Related
I'm developing a PC app in Visual Studio where I'm showing the status of hundreds of sensors that are connected via WiFi. The thing is that I need to hold on to the sensor data even after I close the app, so I'm considering some form of permanent storage. These are the options I've considered:
1) My Sensor object is relatively compact with only a few properties. I could serialize all the objects before closing the app and load them every time the app starts anew.
2) I could throw all the properties (which are mostly strings and doubles) into a simple text file and create a custom protocol for storage and retrieval.
3) I could integrate a database with my app. Someone told me this is the best way to go about it, but I'm a bit hesitant seeing as I'm not familiar with DBs.
Which method would yield the best results in terms of resource usage and speed? Or is there some other, better way to go about this?
First thing you need is to understand is your problem. For example, when the program is running do you need to have everything in memory at the same time or do you work with your sensors one at a time?
What is a "large amount of data"? For example, to me that will never be less than million (or billion in some cases).
Once you know that you shouldn't be scared of using something just because you are not familiar to it. Otherwise you are not looking for the best solution for your problem, you are just hacking around it in a way that you feel comfortable.
This being said, you have several ways of doing this. Like you said you can serialize data, using json to store and a few other alternatives but if we are talking about a "large amount of data that we want to persist" I would always call for the use of Databases (the name says a lot). If you don't need to have everything in memory at the same time then I believe that this is you best option.
I personally don't like them (again, personal choice) but one way of not learning SQL (a lot) while you still use your objects is to use an ORM like NHibernate (you will also need to learn how to use it so you don't get things a slower).
If you need to have everything loaded at the same time (most often that is not the case so be sure of this) you need to know what you want to keep and serialize it. If you want that data to be readable by another tool or organize in a given way consider a data format like XML or JSON.
Also, you can use mmap-file.
File is permanent, and keep data between program run.
So, you just keep your data structs in the mmap-ed area, and no more.
MSDN manual here:
https://msdn.microsoft.com/en-us/library/windows/desktop/aa366556%28v=vs.85%29.aspx
Since you need to load all the data once at the start of the program, the database case seems doubtful. The DB necessary when you need to load a bit of data many times.
So first two cases seem more preferred. I would advice to hide a specific solution behind an interface, then you'll can change it later.
Standard .NET serialization of sensors' array is more simple probably, and it will be easier to expand.
I am implementing a file-based queue of serialized objects, using C#.
Push() will serialize an object as binary and append it to the end of the file.
Pop() should deserialize an object from the beginning of the file (this part I got working). Then, the deserialized part should be removed from the file, making the next object to be "first".
From the standpoint of file system, that would just mean copying file header several bytes further on the disk, and then moving the "beginning of the file" pointer. The question is how to implement this in C#? Is it at all possible?
Easiest that I can see
1) stream out (like a log, dump it into file),
(note: you'd need some delimiters and a 'consistent format' of your 'file' - based on what your data is)
2) and later stream in (just read file from start, in one go, and process w/o removing anything)
and that'd work fine, FIFO (first in first out).
So, my suggestion - don't try to optimize that by removing, skipping
etc. (rather regroup and use more files.
3) If you worry about the scale of things - then just 'partition' that into small enough files, e.g. each 100 or 1,000 records (depends, do some calculations).
You may need to make some sort of 'virtualizer' here, which maps files, keeps track of your 'database' as, if it's over multiple files. The simplest is to just use the file-system and check file times etc. Or add some basic code to improve that.
However, I think you may have problems if you have to ensure
'transactions' - i.e. what if things fail so you need to keep track of
where the file left off, retrace etc.
That might be an issue, but you know best if it's really necessary to have that (how critical). You can always work 'per file' and per smaller files. If it fails, rollback and do the file again (or log problems). If it succeeds you can delete file (after success) and go on like that.
This is very 'hand made' approach but should get you going with a simple and not too demanding solution (like you're describing). Or something along those lines.
I should probably add...
You could also save you some trouble and use some portable database for that or something similar. This is was purely based on the idea of hand-coding a simplest solution (and we could probably come up with something smarter, but being late this is what I have :).
Files don't work that way. You can trim off the end, but not the beginning. In order to mutate a file to remove content at the beginning you need to re-write the entire file.
I expect you'll want to find some other way to solve your problem. But a linear file is totally inappropriate for representing a FIFO queue.
I'm writing a lexer generator as a spare time project, and I'm wondering about how to go about table compression. The tables in question are 2D arrays of short and very sparse. They are always 256 characters in one dimension. The other dimension is varying in size according to the number of states in the lexer.
The basic requirements of the compression is that
The data should be accessible without decompressing the full data set. And accessible in constant O(1) time.
Reasonably fast to compute the compressed table.
I understand the row displacement method, which is what I currently have implemented. It might be my naive implementation, but what I have is horrendously slow to generate, although quite fast to access. I suppose I could make this go faster using some established algorithm for string searching such as one of the algorithms found here.
I suppose an option would be to use a Dictionary, but that feels like cheating, and I would like the fast access times that I would be able to get if I use straight arrays with some established algorithm. Perhaps I'm worrying needlessly about this.
From what I can gather, flex does not use this algorithm for it's lexing tables. Instead it seems to use something called row/column equivalence which I haven't really been able to find any explanation for.
I would really like to know how this row/column equivalence algorithm that flex uses works, or if there is any other good option that I should consider for this task.
Edit: To clarify more about what this data actually is. It is state information for state transitions in the lexer. The data needs to be stored in a compressed format in memory since the state tables can potentially be huge. It's also from this memory that the actual values will be accessed directly, without decompressing the tables. I have a working solution using row displacement, but it's murderously slow to compute - in partial due to my silly implementation.
Perhaps my implementation of the row displacement method will make it clearer how this data is accessed. It's a bit verbose and I hope it's OK that I've put it on pastebin instead of here.
The data is very sparse. It is usually a big bunch of zeroes followed by a few shorts for each state. It would be trivial to for instance run-length encode it but it would spoil the
linear access time.
Flex apparently has two pairs of tables, base and default for the first pair and next and check for the second pair. These tables seems to index one another in ways I don't understand. The dragon book attempts to explain this, but as is often the case with that tome of arcane knowledge what it says is lost on lesser minds such as mine.
This paper, http://www.syst.cs.kumamoto-u.ac.jp/~masato/cgi-bin/rp/files/p606-tarjan.pdf, describes a method for compressing sparse tables, and might be of interest.
Are you tables known beforehand, and you just need an efficient way to store and access them?
I'm not really familiar with the problem domain, but if your table has a fix size along one axis (256), then would a array of size 256, where each element was a vector of variable length work? Do you want to be able to pick out an element given a (x,y) pair?
Another cool solution that I've always wanted to use for something is a perfect hash table, http://burtleburtle.net/bob/hash/perfect.html, where you generate a hash function from your data, so you will get minimal space requirements, and O(1) lookups (ie no collisions).
None of these solutions employ any type of compression, tho, they just minimize the amount of space wasted..
What's unclear is if your table has "sequence property" in one dimension or another.
Sequence property naturally happens in human speech, since a word is composed of many letters, and the sequence of letters is likely to appear later on. It's also very common in binary program, source code, etc.
On the other hand, sampled data, such as raw audio, seismic values, etc. do not advertise sequence property. Their data can still be compressed, but using another model (such as a simple "delta model" followed by "entropy").
If your data has "sequence property" in any of the 2 dimensions, then you can use common compression algorithm, which will give you both speed and reliability. You just need to provide it with an input which is "sequence friendly" (i.e. select your dimension).
If speed is a concern for you, you can have a look at this C# implementation of a fast compressor which is also a very fast decompressor : https://github.com/stangelandcl/LZ4Sharp
I need to generate etags for image files on the web. One of the possible solutions I thought of would be to calculate CRCs for the image files, and then use those as the etag.
This would require CRCs to be calculated every time someone requests an image on the server, so its very important that it can be done fast.
So, how fast are algorithms to generate CRCs? Or is this a stupid idea?
Use instead a more robust hashing algo such as SHA1.
Speed depends on the size of the image. Most time will be spent on loading data from the disk, rather than in CPU processing. You can cache your generated hashes.
But I also advise on creating etag based on last update date of the file which is much quicker and does not require loading the whole file.
Remember, etag must only be unique for a particular resource so if two different images have the same last update time, it is fine.
Most implementations use the last modified date or other file headers as the ETag including Microsoft's own, and I suggest you use that method.
Depends on the method used, and the length. Generally pretty fast, but why not cache them?
If there won't be changes to the files more often than the resolution of the system used to store it (that is, of file modification times for the filesystem or of SQLServer datetime if stored in a database), then why not just use the date of modification to the relevant resolution?
I know RFC 2616 advises against the use of timestamps, but this is only because HTTP timestamps are 1sec resolution and there can be changes more frequent than that. However:
That's still fine if you don't change images more than once a second.
It's also fine to base your e-tag on the time as long as the precision is great enough that it won't end up with the same for two versions of the same resource.
With this approach you are guaranteed a unique e-tag (collisions are unlikely with a large CRC but certainly possible), which is what you want.
Of course, if you don't ever change the image at a given URI, it's even easier as you can just use a fixed string (I prefer string "immutable").
I would suggest calculate hash when adding a image into a data base once and then just return it by SELECT along with a image itself.
If you are usign Sql Server and images not very large (max 8000 bytes) you can leverage HASHBYTES() function which able to generate SHA-1, MD5, ...
In our desktop application, we have implemented a simple search engine using an inverted index.
Unfortunately, some of our users' datasets can get very large, e.g. taking up ~1GB of memory before the inverted index has been created. The inverted index itself takes up a lot of memory, almost as much as the data being indexed (another 1GB of RAM).
Obviously this creates problems with out of memory errors, as the 32 bit Windows limit of 2GB memory per application is hit, or users with lesser spec computers struggle to cope with the memory demand.
Our inverted index is stored as a:
Dictionary<string, List<ApplicationObject>>
And this is created during the data load when each object is processed such that the applicationObject's key string and description words are stored in the inverted index.
So, my question is: is it possible to store the search index more efficiently space-wise? Perhaps a different structure or strategy needs to be used? Alternatively is it possible to create a kind of CompressedDictionary? As it is storing lots of strings I would expect it to be highly compressible.
If it's going to be 1GB... put it on disk. Use something like Berkeley DB. It will still be very fast.
Here is a project that provides a .net interface to it:
http://sourceforge.net/projects/libdb-dotnet
I see a few solutions:
If you have the ApplicationObjects in an array, store just the index - might be smaller.
You could use a bit of C++/CLI to store the dictionary, using UTF-8.
Don't bother storing all the different strings, use a Trie
I suspect you may find you've got a lot of very small lists.
I suggest you find out roughly what the frequency is like - how many of your dictionary entries have single element lists, how many have two element lists etc. You could potentially store several separate dictionaries - one for "I've only got one element" (direct mapping) then "I've got two elements" (map to a Pair struct with the two references in) etc until it becomes silly - quite possibly at about 3 entries - at which point you go back to normal lists. Encapsulate the whole lot behind a simple interface (add entry / retrieve entries). That way you'll have a lot less wasted space (mostly empty buffers, counts etc).
If none of this makes much sense, let me know and I'll try to come up with some code.
I agree with bobwienholt, but If you are indexing datasets I assume these came from a database somewhere. Would it make sense to just search that with a search engine like DTSearch or Lucene.net?
You could take the approach Lucene did. First, you create a random access in-memory stream (System.IO.MemoryStream), this stream mirrors a on-disk one, but only a portion of it (if you have the wrong portion, load up another one off the disk). This does cause one headache, you need a file-mappable format for your dictionary. Wikipedia has a description of the paging technique.
On the file-mappable scenario. If you open up Reflector and reflect the Dictionary class you will see that is comprises of buckets. You can probably use each of these buckets as a page and physical file (this way inserts are faster). You can then also loosely delete values by simply inserting a "item x deleted" value to the file and every so often clean the file up.
By the way, buckets hold values with identical hashes. It is very important that your values that you store override the GetHashCode() method (and the compiler will warn you about Equals() so override that as well). You will get a significant speed increase in lookups if you do this.
How about using Memory Mapped File Win32 API to transparently back your memory structure?
http://www.eggheadcafe.com/articles/20050116.asp has the PInvokes necessary to enable it.
Is the index only added to or do you remove keys from it as well?