Can an implementation of Protobuf-Net beat what I currently have? [closed] - c#

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I posted a related but still different question regarding Protobuf-Net before, so here goes:
I wonder whether someone (esp. Marc) could comment on which of the following would most likely be faster:
(a) I currently store serialized built-in datatypes in a binary file. Specifically, a long(8 bytes), and 2 floats (2x 4 bytes). Each 3 of those later make up one object in deserialized state. The long type represents DateTimeTicks for lookup purposes. I use a binary search to find the start and end locations of a data request. A method then downloads the data in one chunk (from start to end location) knowing that each chunk consists of a packet of many of above described triplets(1 long, 1 float, 1 float) and each triplet is always 16 bytes long. Thus the number triples retrieved is always (endLocation - startLocation)/16. I then iterate over the retrieved byte array, deserialize (using BitConverter) each built-in type and then instantiate a new object made up of a triplet each and store the objects in a list for further processing.
(b) Would it be faster to do the following? Build a separate file (or implement a header) that functions as index for lookup purposes. Then I would not store individual binary versions of the built-in types but instead use Protbuf-net to serialize a List of above described objects (= triplet of int, float, float as source of object). Each List would contain exactly and always one day's worth of data (remember, the long represents DateTimeTick). Obviously each List would vary in size and thus my idea of generating another file or header for index lookup purposes because each data read request would only request a multiple of full days. When I want to retrieve the serialized list of one day I would then simply lookup the index, read the byte array, deserialize using Protobuf-Net and already have my List of objects. I guess why I am asking is because I do not fully understand how deserialization of collections in protobuf-net works.
To give a better idea about the magnitude of the data, each binary file is about 3gb large, thus contains many millions of serialized objects. Each file contains about 1000 days worth of data. Each data request may request any number of day's worth of data.
What in your opinion is faster in raw processing time? I wanted to garner some input before potentially writing a lot of code to implement (b), I currently have (a) and am able to process about 1.5 million objects per second on my machine (process = from data request to returned List of deserialized objects).
Summary: I am asking whether binary data can be faster read I/O and deserialized using approach (a) or (b).

I currently store serialized built-in datatypes in a binary file. Specifically, a long(8 bytes), and 2 floats (2x 4 bytes).
What you have is (and no offence intended) some very simple data. If you're happy dealing with raw data (and it sounds like you are) then it sounds to me like the optimum way to treat this is: as you are. Offsets are a nice clean multiple of 16, etc.
Protocol buffers generally (not just protobuf-net, which is a single implementation of the protobuf specification) is intended for a more complex set of scenarios:
nested/structured data (think: xml i.e. complex records, rather than csv i.e. simple records)
optional fields (some data may not be present at all in the data)
extensible / version tolerant (unexpected or only semi-expected values may be present)
in particular, can add/deprecate fields without it breaking
cross-platform / schema-based
and where the end-user doesn't need to get involved in any serialization details
It is a bit of a different use case! As part of this, protocol buffers uses a small but necessary field-header notation (usually one byte per field), and you would need a mechanism to separate records, since they aren't fixed-size - which is typically another 2 bytes per record. And, ultimately, the protocol buffers handling of float is IEEE-754, so you would be storing the exact same 2 x 4 bytes, but with added padding. The handling of a long integer can be fixed or variable size within the protocol buffers specification.
For what you are doing, and since you care about fastest raw processing time, simple seems best. I'd leave it "as is".

I think using a "chunk" per day together with an index is a good idea since it will let you do random access as long as each record is 16 byte fixed size. If you have an index keeping track of the offset to each day in the file, you can also use memory mapped files to create a very fast view of the data for a specific day or range of days.
One of the benefits of protocol buffers is that they make fixed size data variable sized, since it compresses values (e.g. a long value of zero is written using one byte). This may give you some issues with random access in huge volumes of data.
I'm not the protobuf expert (I have a feeling that Marc will fill you in here) but my feeling is that Protocol Buffers are really best suited for small to medium sized volumes of nontrivial structured data accessed as a whole (or at least in whole records). For very large random access streams of data I don't think there will be a performance gain as you may lose the ability to do simple random access when different records may be compressed by different amounts.

Related

Protobuf-net IsPacked=true for user defined structures

Is it currently possible to use IsPacked=true for user defined structures? If not, then is it planned in the future?
I'm getting the following exception when I tried to apply that attribute to a field of the type ColorBGRA8[]: System.InvalidOperationException : Only simple data-types can use packed encoding
My scenario is as follows: I'm writing a game and have tons of blitable structures for various things such as colors, vectors, matrices, vertices, constant buffers. Their memory layout needs to be precisely defined at compile time in order to match for example the constant buffer layout from a shader (where fields generally? need to be aligned on a 16 byte boundary).
I don't mean to waste anyone's time, but I couldn't find any recent information about this particular question.
Edit after it has been answered
I am currently testing a solution which uses protobuf-net for almost everything but large arrays of user defined, but blitable structures. All my fields of arrays of custom structures have been replaced by arrays of bytes, which can be packed. After protobuf-net is finished deserializing the data, I then use memcpy via p/invoke in order to be able to work with an array of custom structures again.
The following numbers are from a test which serializes one instance containing one field of either the byte[] or ColorBGRA8[]. The raw test data is ~38MiB of data, e.g. 1000000 entries in the color array. Serialization was one in memory using MemoryStream.
Writing
Platform.Copy + Protobuf: 51ms, Size: 38,15 MiB
Protobuf: 2093ms, Size: 109,45 MiB
Reading
Platform.Copy + Protobuf: 43ms
Protobuf: 2307ms
The test shows that for huge arrays of more or less random data, a noticeable memory overhead can occur. This wouldn't have been such a big deal, if not for the (de)serialization times. I understand protobuf-net might not be designed for my extreme case, let alone optimized for it, but it is something I am not willing to accept.
I think I will stick with this hybrid approach, as protobuf-net works extremely well for everything else.
Simply "does not apply". To quote from the encoding specification:
Only repeated fields of primitive numeric types (types which use the varint, 32-bit, or 64-bit wire types) can be declared "packed".
This doesn't work with custom structures or classes. The two approaches that apply here are strings (length-prefixed) and groups (start/end tokens). The latter is often cheaper to encode, but Google prefer the former.
Protobuf is not designed to arbitrarily match some other byte layout. It is its own encoding format and is only designed to process / output protobuf data. It would be like saying "I'm writing XML, but I want it to look like {non-xml} instead".

Which is fast to read, .xml, .ini or .txt? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
My application has to read the data stored in a file and get the values for the variables or arrays to work on them.
My question is, that which file format will be fast and easy for retrieval of data from the file.
I was thinking to use .xml, .ini , or just a simple .txt file. But to read .txt file i will have to write a lot of code with many if or else conditions.
I dont know how to use .ini and .xml. But if they will better and fast so i'll learn them first, and then i'll use them. Kindly guide me.
I will assume what you are indicating here is that raw performance is not a priority over robustness of the system.
For simple data which is a value paired with a name, an ini would probably be simplest solution. More complex structured data would lead you toward XML. According to a previously asked question if you are working in C# (and hence it's assumed in .Net) XML is generally preferred as it has been built into the .Net libraries. As xml is more flexible and can change with needs of the program, I would also personally recommend xml over ini as a file standard. It will take more work to learn the XML library, however it will quickly pay off and is a standardized system.
Text could be fast, but you would be sacrificing either a vast sum of robust parsing behavior for the sake of speed or spending far more man hours developing and maintaining a high speed specialized parser.
For references on reading in xml files: (natively supported in .Net libraries)
MSDN XMLTextReader Article
MSDN XMLReader Article
Writing Data to XML with XMLSerializer
For references on reading in ini files: (not natively supported in .Net libraries)
Related Question
if its a tabular data, then probably it is faster to just use CSV(comma separated values) files.
If it is a structured data(like a tree or something) then you can use the XML parser in C# which is faster (but will take some learning effort on your part)
If the data is like a dictionary, then INI will be a better option. It really depends on the type of data in your application
Or if you don't mind an RDBMS, then that would be a better option. Usually, a good RDBMS is optimized to handle large data and read them really quickly.
If you don't mind having a binary file (one that people can't read and modify themselves), the fastest would be serializing an array of numbers to a file, and deserializing it from the file.
The file will be smaller because data is stored more efficiently, requiring less I/O operations to read it. It will also require minimal parsing (really minimal), so reading will be lightening fast.
Suppose your numbers are located here:
int[] numbers = ..... ;
You save them to file with this code:
using(var file = new FileStream(filename, FileMode.Create))
{
var formatter = new BinaryFormatter();
formatter.Serialize(numbers, file);
}
To read the data from the file, you open it and then use:
numbers = (int[])formatter.Deserialize(file);
I think that #Ian T. Small addressed the difference between the file types well.
Given #Shaharyar's responses to #Aniket, I just wanted to add to the DBMS conversation as a solution given the limited scope info we have.
Will the data set grow? How may entries constitutes "Many Fields"?
I agree that an r-dbms (relational) is a potential solution far a large data set. The next question is what is a large data set.
When (and which) a DBMS is a good idea
When #Shaharyar says many fields I are we talking 10's or 100's of fields?
=> 10-20 fields wouldn't necessitate the overhead (install size, CRUD code, etc) of a r-DBMS. Xml serialization of the object is far simpler.
=> If, there is an indeterminate number of fields (ie: The number of fields increases over time), he needs ACID compliance, or has hundreds of fields, then I'd say #Aniket spot on.
#Matt's suggestion of NoSQL is also great. It will provide high throughput (far more then required for an update every few seconds) and simplified serialization/de-serialization.
The only downside I see here is application size/configuration. (Even the light weight, easy to configure MongoDB will add 10's of a MB for the DBMS facilites and driver. Not ideal for a small < 1MB application meant for fast easy distribution.) Oh and #Shaharyar, if you do require ACID compliance please be sure the check the database first. Mongo, for example, does not offer it. Not to say you will ever lose data, there are just no guarantees.
Another Option - No DBMS but increased throughput
The last suggestion I'd like to make will require a little code (specifically an object to act as a buffer).
If
1. the data set it small (10's not 100's)
2. the number of fields are fixed
3. there is no requirement for ACID compliance
4. you're concerned about increased transaction loads (ie: Lots of updates per second)
You can also just cache changes in a datastore object and flush on program close, or via a time every 'n' seconds/minutes/etc.
Per #Ian T. Small's post we would use native XML class serialization built into the .Net framework.
The following is just oversimplified pseudo-code but should give you an idea:
public class FieldContainer
{
bool ChangeMade
Timer timer = new Timer(5minutes)
private OnTimerTick(...)
{
If (ChangeMade)
UpdateXMLFlatFile()
}
}
How fast does it need to be?
txt will be the fastest option. But you have to program the parser yourself. (speed does come at a cost)
xml is probably easiest to implement, as you have xmlSerializer (or other classes) to to the hard work.
For small configuration files (~0,5MB and smaller) you won't be able to tell any difference in speed. When it comes to really big files, txt and a custom file format is probably the way to go. However, you can always choose either way: Look at projects like OpenStreetMap, they have huge xml Files (> 10 GB) and it is still usable.

Sparse matrix compression with fast access time

I'm writing a lexer generator as a spare time project, and I'm wondering about how to go about table compression. The tables in question are 2D arrays of short and very sparse. They are always 256 characters in one dimension. The other dimension is varying in size according to the number of states in the lexer.
The basic requirements of the compression is that
The data should be accessible without decompressing the full data set. And accessible in constant O(1) time.
Reasonably fast to compute the compressed table.
I understand the row displacement method, which is what I currently have implemented. It might be my naive implementation, but what I have is horrendously slow to generate, although quite fast to access. I suppose I could make this go faster using some established algorithm for string searching such as one of the algorithms found here.
I suppose an option would be to use a Dictionary, but that feels like cheating, and I would like the fast access times that I would be able to get if I use straight arrays with some established algorithm. Perhaps I'm worrying needlessly about this.
From what I can gather, flex does not use this algorithm for it's lexing tables. Instead it seems to use something called row/column equivalence which I haven't really been able to find any explanation for.
I would really like to know how this row/column equivalence algorithm that flex uses works, or if there is any other good option that I should consider for this task.
Edit: To clarify more about what this data actually is. It is state information for state transitions in the lexer. The data needs to be stored in a compressed format in memory since the state tables can potentially be huge. It's also from this memory that the actual values will be accessed directly, without decompressing the tables. I have a working solution using row displacement, but it's murderously slow to compute - in partial due to my silly implementation.
Perhaps my implementation of the row displacement method will make it clearer how this data is accessed. It's a bit verbose and I hope it's OK that I've put it on pastebin instead of here.
The data is very sparse. It is usually a big bunch of zeroes followed by a few shorts for each state. It would be trivial to for instance run-length encode it but it would spoil the
linear access time.
Flex apparently has two pairs of tables, base and default for the first pair and next and check for the second pair. These tables seems to index one another in ways I don't understand. The dragon book attempts to explain this, but as is often the case with that tome of arcane knowledge what it says is lost on lesser minds such as mine.
This paper, http://www.syst.cs.kumamoto-u.ac.jp/~masato/cgi-bin/rp/files/p606-tarjan.pdf, describes a method for compressing sparse tables, and might be of interest.
Are you tables known beforehand, and you just need an efficient way to store and access them?
I'm not really familiar with the problem domain, but if your table has a fix size along one axis (256), then would a array of size 256, where each element was a vector of variable length work? Do you want to be able to pick out an element given a (x,y) pair?
Another cool solution that I've always wanted to use for something is a perfect hash table, http://burtleburtle.net/bob/hash/perfect.html, where you generate a hash function from your data, so you will get minimal space requirements, and O(1) lookups (ie no collisions).
None of these solutions employ any type of compression, tho, they just minimize the amount of space wasted..
What's unclear is if your table has "sequence property" in one dimension or another.
Sequence property naturally happens in human speech, since a word is composed of many letters, and the sequence of letters is likely to appear later on. It's also very common in binary program, source code, etc.
On the other hand, sampled data, such as raw audio, seismic values, etc. do not advertise sequence property. Their data can still be compressed, but using another model (such as a simple "delta model" followed by "entropy").
If your data has "sequence property" in any of the 2 dimensions, then you can use common compression algorithm, which will give you both speed and reliability. You just need to provide it with an input which is "sequence friendly" (i.e. select your dimension).
If speed is a concern for you, you can have a look at this C# implementation of a fast compressor which is also a very fast decompressor : https://github.com/stangelandcl/LZ4Sharp

Is serialization a must in order to transfer data across the wire?

Below is something I read and was wondering if the statement is true.
Serialization is the process of
converting a data structure or object
into a sequence of bits so that it can
be stored in a file or memory buffer,
or transmitted across a network
connection link to be "resurrected"
later in the same or another computer
environment.[1] When the resulting
series of bits is reread according to
the serialization format, it can be
used to create a semantically
identical clone of the original
object. For many complex objects, such
as those that make extensive use of
references, this process is not
straightforward.
Serialization is just a fancy way of describing what you do when you want a certain data structure, class, etc to be transmitted.
For example, say I have a structure:
struct Color
{
int R, G, B;
};
When you transmit this over a network you don't say send Color. You create a line of bits and send it. I could create an unsigned char* and concatenate R, G, and B and then send these. I just did serialization
Serialization of some kind is required, but this can take many forms. It can be something like dotNET serialization, that is handled by the language, or it can be a custom built format. Maybe a series of bytes where each byte represents some "magic value" that only you and your application understand.
For example, in dotNET I can can create a class with a single string property, mark it as serializable and the dotNET framework takes care of most everything else.
I can also build my own custom format where the first 4 bytes represent the length of the data being sent and all subsequent bytes are characters in a string. But then of course you need to worry about byte ordering, unicode vs ansi encoding, etc etc.
Typically it is easier to make use of whatever framework your language/OS/dev framework uses, but it is not required.
Yes, serialization is the only way to transmit data over the wire. Consider what the purpose of serialization is. You define the way that the class is stored. In memory tho, you have no way to know exactly where each portion of the class is. Especially if you have, for instance, a list, if it's been allocated early but then reallocated, it's likely to be fragmented all over the place, so it's not one contiguous block of memory. How do you send that fragmented class over the line?
For that matter, if you send a List<ComplexType> over the wire, how does it know where each ComplexType begins and ends.
The real problem here is not getting over the wire, the problem is ending up with the same semantic object on the other side of the wire. For properly transporting data between dissimilar systems -- whether via TCP/IP, floppy, or punch card -- the data must be encoded (serialized) into a platform independent representation.
Because of alignment and type-size issues, if you attempted to do a straight binary transfer of your object it would cause Undefined Behavior (to borrow the definition from the C/C++ standards).
For example the size and alignment of the long datatype can differ between architectures, platforms, languages, and even different builds of the same compiler.
Is serialization a must in order to transfer data across the wire?
Literally no.
It is conceivable that you can move data from one address space to another without serializing it. For example, a hypothetical system using distributed virtual memory could move data / objects from one machine to another by sending pages ... without any specific serialization step.
And within a machine, the objects could be transferred by switch pages from one virtual address space to another.
But in practice, the answer is yes. I'm not aware of any mainstream technology that works that way.
For anything more complex than a primitive or a homogeneous run of primitives, yes.
Binary serialization is not the only option. You can also serialize an object as an XML file, for example. Or as a JSON.
I think you're asking the wrong question. Serialization is a concept in computer programming and there are certain requirements which must be satisfied for something to be considered a serialization mechanism.
Any means of preparing data such that it can be transmitted or stored in such a way that another program (including but not limited to another instance of the same program on another system or at another time) can read the data and re-instantiate whatever objects the data represents.
Note I slipped the term "objects" in there. If I write a program that stores a bunch of text in a file; and I later use some other program, or some instance of that first program to read that data ... I haven't really used a "serialization" mechanism. If I write it in such a way that the text is also stored with some state about how it was being manipulated ... that might entail serialization.
The term is used mostly to convey the concept that active combinations of behavior and state are being rendered into a form which can be read by another program/instance and instantiated. Most serialization mechanism are bound to a particular programming language, or virtual machine system (in the sense of a Java VM, a C# VM etc; not in the sense of "VMware" virtual machines). JSON (and YAML) are a notable exception to this. They represents data for which there are reasonably close object classes with reasonably similar semantics such that they can be instantiated in multiple different programming languages in a meaningful way.
It's not that all data transmission or storage entails "serialization" ... is that certain ways of storing and transmitting data can be used for serialization. At very list it must be possible to disambiguated among the types of data that the programming language supports. If it reads: 1 is has to know whether that's text or an integer or a real (equivalent to 1.0) or a bit.
Strictly speaking it isn't the only option; you could put an argument that "remoting" meets the meaning inthe text; here a fake object is created at the receiver that contains no state. All calls (methods, properties etc) are intercepted and only the call and result are transferred. This avoids the need to transfer the object itself, but can get very expensive if overly "chatty" usage is involved (I.e. Lots of calls)as each has the latency of the speed of light (which adds up).
However, "remoting" is now rather out of fashion. Most often, yes: the object will need to be serialised and deserialized in some way (there are lots of options here). The paragraph is then pretty-much correct.
Having a messages as objects and serializing into bytes is a better way of understanding and managing what is transmitted over wire. In the old days protocols and data was much simpler, often, programmers just put bytes into output stream. Common understanding was shared by having well-known and simple specifications.
I would say serialization is needed to store the objects in file for persistence, but dynamically allocated pointers in objects need to be build again when we de-serialize, But the serialization for transfer depends on the physical protocol and the mechanism used, for example if i use UART to transfer data then its serialized bit by bit but if i use parallel port then 8 bits together gets transferred , which is not serialized

Reading custom binary data formats in C# .NET

I'm trying to write a simple reader for AutoCAD's DWG files in .NET. I don't actually need to access all data in the file so the complexity that would otherwise be involved in writing a reader/writer for the whole file format is not an issue.
I've managed to read in the basics, such as the version, all the header data, the section locator records, but am having problems with reading the actual sections.
The problem seems to stem from the fact that the format uses a custom method of storing some data types. I'm going by the specs here:
http://www.opendesign.com/files/guestdownloads/OpenDesign_Specification_for_.dwg_files.pdf
Specifically, the types that depend on reading in of individual bits are the types I'm struggling to read. A large part of the problem seems to be that C#'s BinaryReader only lets you read in whole bytes at a time, when in fact I believe I need the ability to read in individual bits and not simply 8 bits or a multiple of at a time.
It could be that I'm misunderstanding the spec and how to interpret it, but if anyone could clarify how I might go about reading in individual bits from a stream, or even how to read in some of the variables types in the above spec that require more complex manipulation of bits than simply reading in full bytes then that'd be excellent.
I do realise there are commercial libraries out there for this, but the price is simply too high on all of them to be justifiable for the task at hand.
Any help much appreciated.
You can always use BitArray class to do bit wise manipulation. So you read bytes from file and load them into BitArray and then access individual bits.
For the price of any of those libraries you definitely cannot develop something stable yourself. How much time did you spend so far?

Categories