Proto Buffers not storing data in readable format

Proto Buffers not storing data in readable format - c#

I am saving my protobuf messages to file and the format is all messed. I have seen it done before where the protobug messages would be saved to disk in near the same format as the .proto file. I am doing it like:
using (Stream output = File.OpenWrite(#"logs\listings.txt"))
{
listingBook.AddClisting(_listing);
listingBook.Build().WriteTo(output);
}
But what I get is a mangled file that seems ENTER separated with strange tags. What I want it to look like when it is saved to disk is like the example:
# Textual representation of a protocol buffer.
# This is *not* the binary format used on the wire.
person {
name: "John Doe"
email: "jdoe#example.com"
}

Pay more attention to the comment
This is not the binary format used on the wire.
Protobuf messages are not designed to be human-readable. Storing them in a text file makes no sense; they are not text.

The primary protobuf encoding format is binary. There is a secondary text format exposed by some implementations, but it kinda loses a lot of the advantages of protobuf, and library support for it is patchy (if it is even formally defined). I would say: if you want human readable, use XML or json. Not protocol buffers.

Using PrintTo instead of WriteTo keeps the data in a readable format. Finally found it.

As protobuf is intended to be fast, binary compatible and optimal;Messages stored as human readable is mostly out of the question. There is the JSONFormatter Utility however:
It's primary purpose is that of what you asked for, but be aware that this probably makes everything significantly slower; While adding some overhead because of the conversion.

Related

Is there a preferred manner for sending data over a web socket connection?

Is there a 'correct' or preferred manner for sending data over a web socket connection?
In my case, I am sending the information from a C# application to a python (tornado) web server, and I am simply sending a string consisting of several elements separated by commas. In python, I use rudimentary techniques to split the string and then structure the elements into an object.
e.g:
'foo,0,bar,1'
becomes:
object = {
'foo': 0,
'bar': 1
}
In the other direction, I am sending the information as a JSON string which I then deserialise using Json.NET
I imagine there is no strictly right or wrong way of doing this, but are there significant advantages and disadvantages that I should be thinking of? And, somewhat related, is there a consensus for using string vs. binary formats?

Writing a custom encoding (eg, as "k,v,..") is different than 'using binary'.
It is still text, just a rigid under-defined one-off hand-rolled format that must be manually replicated. (What happens if a key or value contains a comma? What happens if the data needs to contain nested objects? How can null be interpreted differently than '' or 'null'?)
While JSON is definitely the most ubiquitous format for WebSockets one shouldn't (for interchange purposes) write JSON by hand - one uses an existing serialization library on both ends. (There are many reasons why JSON is ubiquitous which are covered in other answers - this doesn't mean it is always the 'best' format, however.)
To this end a binary serializer can also be used (BSON being a trivial example as it is effectively JSON-like in structure and operation). Just replace JSON.parse with FORMATX.parse as appropriate.
The only requirements are then:
There is a suitable serializer/deserializer for the all the clients and servers. JSON works well here because it is so popular and there is no shortage of implementations.
There are various binary serialization libraries with both Python and C# libraries, but it will require finding a 'happy intersection'.
The serialization format can represent the data. JSON usually works sufficiently and it has a very nice 1-1 correspondence with basic object graphs and simple values. It is also inherently schema-less.
Some formats are better are certain tasks and have different characteristics, features, or tool-chains. However most concepts (and arguably most DTOs) can be mapped onto JSON easily which makes it a good 'default' choice.
The other differences between different kinds of binary and text serializations is most mostly dressing - but if you'd like to start talking about schema vs. schema-less, extensibility, external tooling, metadata, non-compressed encoded sizes (or size after transport compression), compliance with a specific existing protocol, etc..
.. but the point to take away is don't create a 'new' one-off format. Unless of course, you just like making wheels or there is a very specific use-case to fit.

First advice would be to use the same format for both ways, not plain text in one direction and JSON in the other.
I personally think {'foo':0,'bar':1} is better than foo,0,bar,1 because everybody understands JSON but for your custom format they might not without some explanations. The idea is you are inventing a data interchange format when JSON is already one and #jfriend00 is right, pretty much every language now understands JSON, Python included.
Regarding text vs binary, there isn't any consensus. As # user2864740 mentions in the comments to my answer as long as the two sides understand each other, it doesn't really matter. This only becomes relevant if one of the sides has a preference for a format (consider for example opening the connection from the browser, using JavaScript - for that people might prefer JSON instead of binary).
My advice is to go with something simple as JSON and design your app so that you can change the wire format by swapping in another implementation without affecting the logic of your application.

Attach unencrypted tag data to encrypted file

I hope this is the right place for my question, since there's definitely more than one way to do it.
I have a file format (xml) that I compress and encrypt. The thing is now that I want to attach some basic unencrypted meta-data to my file for ease of access to certain parameters.
Is there a right way to do what I want to do, otherwise what are some best practices to keep in mind?
The approach that I'm thinking about now is to use Bouncy Castle in C# to encrypt my actual data while prepending my tag data to the front of the file.
e.g.
<metadata>
//tag information about the file
</metadata>
<secretdata>
//Grandma's secret recipe
</secretdata>
Encrypt secret data only
<metadata>
//tag information about the file
</metadata>
^&RF&^Tb87tyfg76rfvhjb8
hnjikhuhik*&GHd65rh87yn
NNCV&^FVU^R75rft78b875t

One challenge here is getting the plain-text XML out of the front of the file while leaving the input stream at exactly the start of the encrypted and compressed data. Since the XML reading libraries in C# were not built with this usage in mind, they may not behave well (e.g - the reader may read more bytes than it needs, leaving the underlying stream past the start of the encrypted data).
One possible way to handle it is to prepend a header in a well-known format that provides the length of the XML metadata. So the file would look something like:
Header (5 bytes):
Version* (1 byte, unsigned int) = 1
Metadata Length** (4 bytes, unsigned int) = N
Metadata (N bytes):
well formed XML
Encrypted Data (rest of file)
(* -including versioning when defining a file format is always a good idea)
(** - if you're going to be exceeding the range of a 32-bit uint for the length of the metadata, you should consider another solution.)
Then you can read the 5 byte header directly, parse out the length of the XML, read that many bytes out exactly, and the input stream should be in the right place to start decrypting and decompressing the rest of the file.
Of course, now that you've got a binary header, you could consider just having the metadata in the header itself, instead of putting it in XML.

Combining non-encrypted and encrypted data using XML like you do is indeed one way to go. There are a few drawbacks which may or may not be relevant in your situation:
The compression is rather limited. If encrypted data is large, you should consider storing it in binary format directly. Also, CDATA may be a compromise, although the range of characters you'll be able to put in a CDATA is limited as well.
Parsing of XML may be slow if the encrypted data is large. Also, it often requires to keep the whole document in memory, which is probably not what you want. Again, storing encrypted data directly in binary format is a solution. CDATA won't help here.
The benefit of XML is to be readable by a human. Although relevant for metadata, it seems weird when most of data is encrypted anyway.
Other alternatives you may consider:
Two files side by side. One will contain the binary data, and the other one (named identically but with a different extension) will have the metadata (for example in XML format). The difficulty is that you have to handle cases such as the presence of binary data file but not the corresponding metadata file or the opposite, as well as the copying/moving of data (NTFS has transactions, but you have to use Interop, unless the latest version of .NET Framework adds the support for Transactional NTFS).
Metadata and encrypted data stored in a single file in binary format. The answer by scottfavre shows one possibility to do it. I agree with his explanation, but would rather compress metadata as well for two reasons: (1) to save space and (2) to prevent the end users to modify the metadata by hand, which will make the header invalid.
I won't recommend the single binary file approach since it makes the format difficult to use; the valid case for this would be if you found (after making enough benchmarks and profiling) that there is an important performance benefit.
Metadata stored in Alternative Data Streams (which can be used in NTFS only, so beware of FAT-formatted flash drives). Here, the benefit is that you don't have to deal with offsets stored in a header: NTFS does that for you. But this is not an approach that I would recommend either, unless you absolutely need to keep the data together with the file, and you know that the file will always be stored on NTFS disks (and transferred with ADS-aware applications).

HTTP POST - Can it contain complex objects directly?

From all I've read it seems that it's always of the form string=string&string=string... (all the strings being encoded to exclude & and =) however, searching for it (e.g. Wikipedia, SO, ...) I haven't found that mentioned as an explicit restriction.
(Of course a base64 string of a binary of complex objects can be sent. That's not the question.) But:
Can POST contain complex objects directly or is it all sent as a string?

There is nothing in HTTP that prevents the posting of binary data. You do not have to convert binary data to base64 or other text encodings. Though the common "key1=val1&key=val2" usage is very widely conventional and convenient it is not required. It only depends upon what the sender and receiver agree upon. See these threads or google "http post binary data" or the like.
Sending binary data over http
How to correctly send binary data over HTTPS POST?

It is just a string, just like any binary stream. There's various ways to encode complex objects to fit into a string though. base64 is an option, and so is json (the latter probably being more desirable).
PHP has a specific way to deal with this.. This:
a[]=1&a[]=2
Will result in an array with 1, 2.
This:
a[foo]=bar&a[gir]=zim
Creates also an array with 2 keys.
I've also seen this format in some frameworks:
a.foo=bar&b.gir=zim
So while urlencoding does not have a specific, standard syntax to do this.. that does not mean you can add meaning and do your own post-processing.
If you're buidling an API, you are probably best off not using urlencoding at all... There's much more capable and better formats. You can use whatever Content-Type you'd like.

HTTP itself is just based on strings. There's no notion of "objects", only text. The definition of "object" is dependent on whatever data format you transport over HTTP (XML, JSON, binary files, ...).
So, POST can contain "complex objects" if they are appropriately encoded into text.

Alternative Method for reading Files in C#?

Sometimes in reading data involving many names and figures
reading line by line needs some serious concatenation work ,
is there any method that would allow me to read a specific data type? like the good old fscanf in C?
Thanks
Sara

I don't know if this is exactly what you are looking for, but the FileHelpers library has many utilities to help with reading fixed length and delimited text files.
From the site:
You can strong type your flat file (fixed or delimited) simply describing a class that maps to each record and later read/write your file as an strong typed .NET array
If what you are looking for is getting strongly typed objects from files, you should look at serialization and deserialization in the .net framework. These allow you to save object state into a file and read them back at a later time.

The System.IO namespace got everything you need.

You could use BinaryReader to read the data from the file, assuming the data is written out in binary form. For example, an Int32 would be written out as 4 bytes ie. the binary representation and not the text representation of the integer.
The usefullness of BinaryReader would depend on the control you have of the code generating the file, ie. can you write the data out using a BinaryWriter and of course how human readable you need the file to be.

FileHelpers-like data import/export utility for binary data?

I use the excellent FileHelpers library when I work with text data. It allows me to very easily dump text fields from a file or in-memory string into a class that represents the data.
In working with a big endian microcontroller-based system I need to read a serial data stream. In order to save space on the very limited microcontroller platform I need to write raw binary data which contains field of various multi-byte types (essentially just dumping a struct variable out the serial port).
I like the architecture of FileHelpers. I create a class that represents the data and tag it with attributes that tell the engine how to put data into the class. I can feed the engine a string representing a single record and get an deserialized representation of the data. However, this is different from object serialization in that the raw data is not delimited in any way, it's a simple binary fixed record format.
FileHelpers is probably not suitable for reading such binary data as it cannot handle the nulls that show up and* I suspect that there might be unicode issues (the engine takes input as a string, so I have to read bytes from the serial port and translate them into a unicode string before they go to my data converter classes). As an experiment I have set it up to read the binary stream and as long as I'm careful to not send nulls it works quite well so far. It is easy to set up new converters that read the raw data and account for endian foratting issues and such. It currently fails on nulls and cannot process multiple records (it expect a CRLF between records).
What I want to know is if anyone knows of an open-source library that works similarly to FileHelpers but that is designed to handle binary data.
I'm considering deriving something from FileHelpers to handle this task, but it seems like there ought to be something already available to do this.
*It turns out that it does not complain about nulls in the input stream. I had an unrelated bug in my test program that came up where I expected a problem with the nulls. Should have investigated a little deeper first!

I haven't used filehelpers, so I can't do a direct comparison; however, if you have an object-model that represents your objects, you could try protobuf-net; it is a binary serialization engine for .NET using Google's compact "protocol buffers" wire format. Much more efficient than things like xml, but without the need to write all your own serialization code.
Note that "protocol buffers" does include some very terse markers between fields (typically one byte); this adds a little padding, but greatly improves version tolerance. For "packed" data (i.e. blocks of ints, say, from an array) this can be omitted if desired.
So: if you just want a compact output, it might be good. If you need a specific output, probably less so.
Disclosure: I'm the author, so I'm biased; but it is free.

When I am fiddling with GPS data in the SIRFstarIII binary mode, I use the Python interactive prompt with the serial module to fetch the stream from the USB/serial port and the struct module to convert the bytes as needed (per some format defined by SIRF). Using the interactive prompt is very flexible because I can read the string to a variable, process it, view the results and try again if needed. After the prototyping stage is finished, I have the data format strings that I need to put into the final program.
Your question doesn't mention anything about why you have a C# tag. I understand FileHelpers is a C# library, but I that doesn't tell me what environment you are working in. There is an implementation of Python for .NET called IronPython.
I realize this answer might mean you have to learn a new language, but having an interactive prompt is a very powerful tool for any programmer.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.