Reducing String Size By Zipping And Storing In Object

Reducing String Size By Zipping And Storing In Object - c#

Our application at work basically has to create over a million objects each night to run a numerical simulation involving some weather observations that were recorded during the day.
Each object contains a few string properties (and one very large xml property - About 2 MB) - Beacuse of the size of the large xml property we dont load this up and instead prefer to go the database when we need access to this xml blob (which we do for each object)
I was wondering if it makes sense to somehow retrieve the xml data (which is 2MB) compress it in memory and store it in the object - This prevents us having to do a database query for each object when we come to process it.
I would much rather zip the data, store it in the object and at processing time, unzip and process
Is it possible to zip a string in process and how can I do this without creating millions of MemoryStreams / zip streams for each object?

I would think that compression is not a good idea - it adds quite an overhead to processing, which already appears to be quite intensive.
Perhaps a light-weight format would be better - JSON or a binary serialized object representing the data.
Without more detail, it is difficult to give a definite answer, or better options.

Well, there is DotNetZip which has a simple API so you can do something like this:
byte[] compressedProperty;
public string MyProperty
{
get { DeflateStream.UncompressString(compressedProperty); }
set { compressedProperty = DeflateStream.CompressString(value); }
}
Not sure if it will work out performance wise for you though.
Update:
I only know the GZipStream and the DeflateStream class. Neither of them expose a string interface. Even DotNetZip uses a stream under the hood when you call the functions above, it's just wrapped around a nice interface (which you could do with the System.IO.Compression classes on your own). Not sure what your problem is with streams.
If you really want to avoid streams then you probably have to roll your own compression. Here is a guy who rolled a simple Huffman encoder to encode strings in F#. Don't know how well it works but I you want to avoid 3rd party libs and streams then you could give it a crack.

Related

Binary serialization in c#: result size and performance

I'm trying to understand and to decide the best approach to my problem.
I've an xsd that represents the schema of the information that I agreed with a client.
Now, in my application (c#, .net3.5) I use and consume an object that has been deserialized from an xml created according to the xsd schema.
As soon as I fill the object with data, I want to pass it to another application and also store it in a db. I have two questions:
I'd like to serialize the object to pass quickly to the other application: is better binary or xml serialization?
Unfortunately in the db I have a limited sized field to store the info, so I need a sort of compression of the serialized object. Binary serialization creates smaller data then xml serialization or I need in any case to compress this data? if yes, how?
Thanks!

I'd like to serialize the object to pass quickly to the other application: is better binary or xml serialization?
Neither is specific enough; binary can be good or bad; xml can be good or bad. Generally speaking binary is smaller and faster to process, but changing to such will be unusable from code that expects xml.
Binary serialization creates smaller data then xml serialization or I need in any case to compress this data?
It can be smaller; or it can be larger; indeed, compression can make things smaller or larger too.
If space is your primary concern, I would suggest running it through something like protobuf-net (a binary serializer without the versioning issues common to BinaryFormatter), and then speculatively try compressing it with GZipStream. If the compressed version is smaller: store that (and a marker - perhaps a preamble - that says "I'm compressed"). If the compressed version gets bigger than the original version, store the original (again with a preamble).
Here's a recent breakdown of the performance (speed and size) of the common .NET serializers: http://theburningmonk.com/2013/09/binary-and-json-serializer-benchmarks-updated/

Comparison of serializing methods [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Fastest serializer and deserializer with lowest memory footprint in C#?
I'm using BinaryFormatter class to serialize an structure or a class. (after serialization, I'm going to encrypt the serialized file before saving. (And of course decrypt it before deserialization))
But I heard that some other serialization classes are present in .Net Framework. Like XmlSerializer, JavaScriptSerializer, DataContractSerializer and protobuf-net.
I want to know, which one is best for me?
Less RAM space needed for serialize/deserialize is the most important thing for me. Also speed is important.

If your aim is to reduce memory demands, then don't serialize then encrypt: instead - serialize directly to an encrypting Stream. The Stream API is designed to be chained (decorator pattern) to perform multiple transformations without excessive buffering. Likewise: deserialize from a decrypting stream; don't decrypt then deserialize. Done this way, data is encrypted/decrypted on-the-fly as needed; in addition to reducing memory, it is also good for security - since this also means the entire data never exists in decrypted form as a single buffer. See CryptoStream on MSDN for a full example.
Some additional notes; if you do happen to use protobuf-net, there are ways of reducing any in-memory buffering by using "grouped" encoding; you see: the default for sub-messages (including lists) is "length prefixed" - and the way it usually does this is by buffering the data in memory to calculate the length. However, protobuf also supports a format that uses a start/end marker which never requires knowing the length, so never requires buffering - and so the entire sequence can be written in a single pass direct to output (well, it does still use a buffer internally to improve IO, but it pools the buffer here, for maximum re-use). This is as simple as, for sub-objects:
[ProtoMember(11, DatFormat = DataFormat.Grouped)]
public Customer Customer {get;set;} // a sub-object
(where there is no significance in the 11)

See http://code.google.com/p/protobuf-net/wiki/Performance for a comparison of performance.

Is serialization a must in order to transfer data across the wire?

Below is something I read and was wondering if the statement is true.
Serialization is the process of
converting a data structure or object
into a sequence of bits so that it can
be stored in a file or memory buffer,
or transmitted across a network
connection link to be "resurrected"
later in the same or another computer
environment.[1] When the resulting
series of bits is reread according to
the serialization format, it can be
used to create a semantically
identical clone of the original
object. For many complex objects, such
as those that make extensive use of
references, this process is not
straightforward.

Serialization is just a fancy way of describing what you do when you want a certain data structure, class, etc to be transmitted.
For example, say I have a structure:
struct Color
{
int R, G, B;
};
When you transmit this over a network you don't say send Color. You create a line of bits and send it. I could create an unsigned char* and concatenate R, G, and B and then send these. I just did serialization

Serialization of some kind is required, but this can take many forms. It can be something like dotNET serialization, that is handled by the language, or it can be a custom built format. Maybe a series of bytes where each byte represents some "magic value" that only you and your application understand.
For example, in dotNET I can can create a class with a single string property, mark it as serializable and the dotNET framework takes care of most everything else.
I can also build my own custom format where the first 4 bytes represent the length of the data being sent and all subsequent bytes are characters in a string. But then of course you need to worry about byte ordering, unicode vs ansi encoding, etc etc.
Typically it is easier to make use of whatever framework your language/OS/dev framework uses, but it is not required.

Yes, serialization is the only way to transmit data over the wire. Consider what the purpose of serialization is. You define the way that the class is stored. In memory tho, you have no way to know exactly where each portion of the class is. Especially if you have, for instance, a list, if it's been allocated early but then reallocated, it's likely to be fragmented all over the place, so it's not one contiguous block of memory. How do you send that fragmented class over the line?
For that matter, if you send a List<ComplexType> over the wire, how does it know where each ComplexType begins and ends.

The real problem here is not getting over the wire, the problem is ending up with the same semantic object on the other side of the wire. For properly transporting data between dissimilar systems -- whether via TCP/IP, floppy, or punch card -- the data must be encoded (serialized) into a platform independent representation.
Because of alignment and type-size issues, if you attempted to do a straight binary transfer of your object it would cause Undefined Behavior (to borrow the definition from the C/C++ standards).
For example the size and alignment of the long datatype can differ between architectures, platforms, languages, and even different builds of the same compiler.

Is serialization a must in order to transfer data across the wire?
Literally no.
It is conceivable that you can move data from one address space to another without serializing it. For example, a hypothetical system using distributed virtual memory could move data / objects from one machine to another by sending pages ... without any specific serialization step.
And within a machine, the objects could be transferred by switch pages from one virtual address space to another.
But in practice, the answer is yes. I'm not aware of any mainstream technology that works that way.

For anything more complex than a primitive or a homogeneous run of primitives, yes.

Binary serialization is not the only option. You can also serialize an object as an XML file, for example. Or as a JSON.

I think you're asking the wrong question. Serialization is a concept in computer programming and there are certain requirements which must be satisfied for something to be considered a serialization mechanism.
Any means of preparing data such that it can be transmitted or stored in such a way that another program (including but not limited to another instance of the same program on another system or at another time) can read the data and re-instantiate whatever objects the data represents.
Note I slipped the term "objects" in there. If I write a program that stores a bunch of text in a file; and I later use some other program, or some instance of that first program to read that data ... I haven't really used a "serialization" mechanism. If I write it in such a way that the text is also stored with some state about how it was being manipulated ... that might entail serialization.
The term is used mostly to convey the concept that active combinations of behavior and state are being rendered into a form which can be read by another program/instance and instantiated. Most serialization mechanism are bound to a particular programming language, or virtual machine system (in the sense of a Java VM, a C# VM etc; not in the sense of "VMware" virtual machines). JSON (and YAML) are a notable exception to this. They represents data for which there are reasonably close object classes with reasonably similar semantics such that they can be instantiated in multiple different programming languages in a meaningful way.
It's not that all data transmission or storage entails "serialization" ... is that certain ways of storing and transmitting data can be used for serialization. At very list it must be possible to disambiguated among the types of data that the programming language supports. If it reads: 1 is has to know whether that's text or an integer or a real (equivalent to 1.0) or a bit.

Strictly speaking it isn't the only option; you could put an argument that "remoting" meets the meaning inthe text; here a fake object is created at the receiver that contains no state. All calls (methods, properties etc) are intercepted and only the call and result are transferred. This avoids the need to transfer the object itself, but can get very expensive if overly "chatty" usage is involved (I.e. Lots of calls)as each has the latency of the speed of light (which adds up).
However, "remoting" is now rather out of fashion. Most often, yes: the object will need to be serialised and deserialized in some way (there are lots of options here). The paragraph is then pretty-much correct.

Having a messages as objects and serializing into bytes is a better way of understanding and managing what is transmitted over wire. In the old days protocols and data was much simpler, often, programmers just put bytes into output stream. Common understanding was shared by having well-known and simple specifications.

I would say serialization is needed to store the objects in file for persistence, but dynamically allocated pointers in objects need to be build again when we de-serialize, But the serialization for transfer depends on the physical protocol and the mechanism used, for example if i use UART to transfer data then its serialized bit by bit but if i use parallel port then 8 bits together gets transferred , which is not serialized

Reading custom binary data formats in C# .NET

I'm trying to write a simple reader for AutoCAD's DWG files in .NET. I don't actually need to access all data in the file so the complexity that would otherwise be involved in writing a reader/writer for the whole file format is not an issue.
I've managed to read in the basics, such as the version, all the header data, the section locator records, but am having problems with reading the actual sections.
The problem seems to stem from the fact that the format uses a custom method of storing some data types. I'm going by the specs here:
http://www.opendesign.com/files/guestdownloads/OpenDesign_Specification_for_.dwg_files.pdf
Specifically, the types that depend on reading in of individual bits are the types I'm struggling to read. A large part of the problem seems to be that C#'s BinaryReader only lets you read in whole bytes at a time, when in fact I believe I need the ability to read in individual bits and not simply 8 bits or a multiple of at a time.
It could be that I'm misunderstanding the spec and how to interpret it, but if anyone could clarify how I might go about reading in individual bits from a stream, or even how to read in some of the variables types in the above spec that require more complex manipulation of bits than simply reading in full bytes then that'd be excellent.
I do realise there are commercial libraries out there for this, but the price is simply too high on all of them to be justifiable for the task at hand.
Any help much appreciated.

You can always use BitArray class to do bit wise manipulation. So you read bytes from file and load them into BitArray and then access individual bits.

For the price of any of those libraries you definitely cannot develop something stable yourself. How much time did you spend so far?

Protocol Buffers c# (protobuf-net) Message::ByteSize

I am looking for the protobuf-net equivalent to the C++ API Message::ByteSize to find out the serialized message length in bytes.

I haven't played with the C++ API, so you'll have to give me a bit more context / information. What does this method do? Perhaps a sample usage?
If you are consuming data from a stream, there are "WithLengthPrefix" versions to automate limiting to discreet messages, or I believe the method to just read the next length from the stream is on the public API.
If you want to get a length in place of serializing, then currently I suspect the easiest option might be to serialize to a dummy stream and track the length. Oddly enough, an early version of protobuf-net did have "get the length without doing the work" methods, but after discussion on the protobuf-net I removed these. The data serialized is still tracked, obviously. However, because the API is different than the binary data length for objects is not available "for free".
If you clarify what the use-case is, I'm sure we can make it easily available (if it isn't already).
Re the comment; that is what I suspected. Because protobuf-net defers the binary translation to the last moment (because it is dealing with regular .NET types, not some self-generated code) there is no automatic way of getting this value without doing the work. I could add a mechanism to let you get this value by writing to Stream.Null? but if you need the data anyway you might benefit from just writing to MemoryStream and checking the .Length in advance of copying the data.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.