Reading custom binary data formats in C# .NET

Reading custom binary data formats in C# .NET - c#

I'm trying to write a simple reader for AutoCAD's DWG files in .NET. I don't actually need to access all data in the file so the complexity that would otherwise be involved in writing a reader/writer for the whole file format is not an issue.
I've managed to read in the basics, such as the version, all the header data, the section locator records, but am having problems with reading the actual sections.
The problem seems to stem from the fact that the format uses a custom method of storing some data types. I'm going by the specs here:
http://www.opendesign.com/files/guestdownloads/OpenDesign_Specification_for_.dwg_files.pdf
Specifically, the types that depend on reading in of individual bits are the types I'm struggling to read. A large part of the problem seems to be that C#'s BinaryReader only lets you read in whole bytes at a time, when in fact I believe I need the ability to read in individual bits and not simply 8 bits or a multiple of at a time.
It could be that I'm misunderstanding the spec and how to interpret it, but if anyone could clarify how I might go about reading in individual bits from a stream, or even how to read in some of the variables types in the above spec that require more complex manipulation of bits than simply reading in full bytes then that'd be excellent.
I do realise there are commercial libraries out there for this, but the price is simply too high on all of them to be justifiable for the task at hand.
Any help much appreciated.

You can always use BitArray class to do bit wise manipulation. So you read bytes from file and load them into BitArray and then access individual bits.

For the price of any of those libraries you definitely cannot develop something stable yourself. How much time did you spend so far?

Related

Attach unencrypted tag data to encrypted file

I hope this is the right place for my question, since there's definitely more than one way to do it.
I have a file format (xml) that I compress and encrypt. The thing is now that I want to attach some basic unencrypted meta-data to my file for ease of access to certain parameters.
Is there a right way to do what I want to do, otherwise what are some best practices to keep in mind?
The approach that I'm thinking about now is to use Bouncy Castle in C# to encrypt my actual data while prepending my tag data to the front of the file.
e.g.
<metadata>
//tag information about the file
</metadata>
<secretdata>
//Grandma's secret recipe
</secretdata>
Encrypt secret data only
<metadata>
//tag information about the file
</metadata>
^&RF&^Tb87tyfg76rfvhjb8
hnjikhuhik*&GHd65rh87yn
NNCV&^FVU^R75rft78b875t

One challenge here is getting the plain-text XML out of the front of the file while leaving the input stream at exactly the start of the encrypted and compressed data. Since the XML reading libraries in C# were not built with this usage in mind, they may not behave well (e.g - the reader may read more bytes than it needs, leaving the underlying stream past the start of the encrypted data).
One possible way to handle it is to prepend a header in a well-known format that provides the length of the XML metadata. So the file would look something like:
Header (5 bytes):
Version* (1 byte, unsigned int) = 1
Metadata Length** (4 bytes, unsigned int) = N
Metadata (N bytes):
well formed XML
Encrypted Data (rest of file)
(* -including versioning when defining a file format is always a good idea)
(** - if you're going to be exceeding the range of a 32-bit uint for the length of the metadata, you should consider another solution.)
Then you can read the 5 byte header directly, parse out the length of the XML, read that many bytes out exactly, and the input stream should be in the right place to start decrypting and decompressing the rest of the file.
Of course, now that you've got a binary header, you could consider just having the metadata in the header itself, instead of putting it in XML.

Combining non-encrypted and encrypted data using XML like you do is indeed one way to go. There are a few drawbacks which may or may not be relevant in your situation:
The compression is rather limited. If encrypted data is large, you should consider storing it in binary format directly. Also, CDATA may be a compromise, although the range of characters you'll be able to put in a CDATA is limited as well.
Parsing of XML may be slow if the encrypted data is large. Also, it often requires to keep the whole document in memory, which is probably not what you want. Again, storing encrypted data directly in binary format is a solution. CDATA won't help here.
The benefit of XML is to be readable by a human. Although relevant for metadata, it seems weird when most of data is encrypted anyway.
Other alternatives you may consider:
Two files side by side. One will contain the binary data, and the other one (named identically but with a different extension) will have the metadata (for example in XML format). The difficulty is that you have to handle cases such as the presence of binary data file but not the corresponding metadata file or the opposite, as well as the copying/moving of data (NTFS has transactions, but you have to use Interop, unless the latest version of .NET Framework adds the support for Transactional NTFS).
Metadata and encrypted data stored in a single file in binary format. The answer by scottfavre shows one possibility to do it. I agree with his explanation, but would rather compress metadata as well for two reasons: (1) to save space and (2) to prevent the end users to modify the metadata by hand, which will make the header invalid.
I won't recommend the single binary file approach since it makes the format difficult to use; the valid case for this would be if you found (after making enough benchmarks and profiling) that there is an important performance benefit.
Metadata stored in Alternative Data Streams (which can be used in NTFS only, so beware of FAT-formatted flash drives). Here, the benefit is that you don't have to deal with offsets stored in a header: NTFS does that for you. But this is not an approach that I would recommend either, unless you absolutely need to keep the data together with the file, and you know that the file will always be stored on NTFS disks (and transferred with ADS-aware applications).

Which is fast to read, .xml, .ini or .txt? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
My application has to read the data stored in a file and get the values for the variables or arrays to work on them.
My question is, that which file format will be fast and easy for retrieval of data from the file.
I was thinking to use .xml, .ini , or just a simple .txt file. But to read .txt file i will have to write a lot of code with many if or else conditions.
I dont know how to use .ini and .xml. But if they will better and fast so i'll learn them first, and then i'll use them. Kindly guide me.

I will assume what you are indicating here is that raw performance is not a priority over robustness of the system.
For simple data which is a value paired with a name, an ini would probably be simplest solution. More complex structured data would lead you toward XML. According to a previously asked question if you are working in C# (and hence it's assumed in .Net) XML is generally preferred as it has been built into the .Net libraries. As xml is more flexible and can change with needs of the program, I would also personally recommend xml over ini as a file standard. It will take more work to learn the XML library, however it will quickly pay off and is a standardized system.
Text could be fast, but you would be sacrificing either a vast sum of robust parsing behavior for the sake of speed or spending far more man hours developing and maintaining a high speed specialized parser.
For references on reading in xml files: (natively supported in .Net libraries)
MSDN XMLTextReader Article
MSDN XMLReader Article
Writing Data to XML with XMLSerializer
For references on reading in ini files: (not natively supported in .Net libraries)
Related Question

if its a tabular data, then probably it is faster to just use CSV(comma separated values) files.
If it is a structured data(like a tree or something) then you can use the XML parser in C# which is faster (but will take some learning effort on your part)
If the data is like a dictionary, then INI will be a better option. It really depends on the type of data in your application
Or if you don't mind an RDBMS, then that would be a better option. Usually, a good RDBMS is optimized to handle large data and read them really quickly.

If you don't mind having a binary file (one that people can't read and modify themselves), the fastest would be serializing an array of numbers to a file, and deserializing it from the file.
The file will be smaller because data is stored more efficiently, requiring less I/O operations to read it. It will also require minimal parsing (really minimal), so reading will be lightening fast.
Suppose your numbers are located here:
int[] numbers = ..... ;
You save them to file with this code:
using(var file = new FileStream(filename, FileMode.Create))
{
var formatter = new BinaryFormatter();
formatter.Serialize(numbers, file);
}
To read the data from the file, you open it and then use:
numbers = (int[])formatter.Deserialize(file);

I think that #Ian T. Small addressed the difference between the file types well.
Given #Shaharyar's responses to #Aniket, I just wanted to add to the DBMS conversation as a solution given the limited scope info we have.
Will the data set grow? How may entries constitutes "Many Fields"?
I agree that an r-dbms (relational) is a potential solution far a large data set. The next question is what is a large data set.
When (and which) a DBMS is a good idea
When #Shaharyar says many fields I are we talking 10's or 100's of fields?
=> 10-20 fields wouldn't necessitate the overhead (install size, CRUD code, etc) of a r-DBMS. Xml serialization of the object is far simpler.
=> If, there is an indeterminate number of fields (ie: The number of fields increases over time), he needs ACID compliance, or has hundreds of fields, then I'd say #Aniket spot on.
#Matt's suggestion of NoSQL is also great. It will provide high throughput (far more then required for an update every few seconds) and simplified serialization/de-serialization.
The only downside I see here is application size/configuration. (Even the light weight, easy to configure MongoDB will add 10's of a MB for the DBMS facilites and driver. Not ideal for a small < 1MB application meant for fast easy distribution.) Oh and #Shaharyar, if you do require ACID compliance please be sure the check the database first. Mongo, for example, does not offer it. Not to say you will ever lose data, there are just no guarantees.
Another Option - No DBMS but increased throughput
The last suggestion I'd like to make will require a little code (specifically an object to act as a buffer).
If
1. the data set it small (10's not 100's)
2. the number of fields are fixed
3. there is no requirement for ACID compliance
4. you're concerned about increased transaction loads (ie: Lots of updates per second)
You can also just cache changes in a datastore object and flush on program close, or via a time every 'n' seconds/minutes/etc.
Per #Ian T. Small's post we would use native XML class serialization built into the .Net framework.
The following is just oversimplified pseudo-code but should give you an idea:
public class FieldContainer
{
bool ChangeMade
Timer timer = new Timer(5minutes)
private OnTimerTick(...)
{
If (ChangeMade)
UpdateXMLFlatFile()
}
}

How fast does it need to be?
txt will be the fastest option. But you have to program the parser yourself. (speed does come at a cost)
xml is probably easiest to implement, as you have xmlSerializer (or other classes) to to the hard work.
For small configuration files (~0,5MB and smaller) you won't be able to tell any difference in speed. When it comes to really big files, txt and a custom file format is probably the way to go. However, you can always choose either way: Look at projects like OpenStreetMap, they have huge xml Files (> 10 GB) and it is still usable.

C# binary writer customization

Is there a way to customize the way a binary writer writes out to files so that I could read the file from a C++ program?
Eg:
myBinaryWriter.Write(myInt);
myBinaryWriter.Write(myBool);
And in C++:
fread(&myInt, 1, sizeof(int), fileHandle);
fread(&myBool, 1, sizeof(bool), fileHandle);
EDIT: From what I can see, if the length of a string is small enough to fit into one byte then that's how it writes it, which is bad if I want to read it back in in C++.

If you want to guarantee binary compatibility, possibly the easiest approach from c# is to ditch binary writer and just use a stream to write the bytes yourself. That way you get full control of the output data
Another approach would be to create an assembly that can write the data using c++/cli, so you can get direct compatibility with c++ from managed code.

You have some options you can choose:
Writing it byte-byte by yourself. this is possibly the worst option. it requires more work in both sides (the serializing and the de-serializing sides).
Use some cross-platforms serializers like Protobuf. it has ports for almost any platform, including C# (protobuf-net) and C++. and it's also easy to use and has really good performance.
you can use Struct and convert it yo byte array using Marshal.PtrToStructure (if you need the other way, you can use Marshal.StructureToPtr). there are plenty of examples in the internet. If you can use Managed CPP, you can use the same object so you can change the struct in one place and it will change in both places.
if you use Managed CPP, you can use the built-in serializers, such as BinaryFormatter...

Protocol Buffers c# (protobuf-net) Message::ByteSize

I am looking for the protobuf-net equivalent to the C++ API Message::ByteSize to find out the serialized message length in bytes.

I haven't played with the C++ API, so you'll have to give me a bit more context / information. What does this method do? Perhaps a sample usage?
If you are consuming data from a stream, there are "WithLengthPrefix" versions to automate limiting to discreet messages, or I believe the method to just read the next length from the stream is on the public API.
If you want to get a length in place of serializing, then currently I suspect the easiest option might be to serialize to a dummy stream and track the length. Oddly enough, an early version of protobuf-net did have "get the length without doing the work" methods, but after discussion on the protobuf-net I removed these. The data serialized is still tracked, obviously. However, because the API is different than the binary data length for objects is not available "for free".
If you clarify what the use-case is, I'm sure we can make it easily available (if it isn't already).
Re the comment; that is what I suspected. Because protobuf-net defers the binary translation to the last moment (because it is dealing with regular .NET types, not some self-generated code) there is no automatic way of getting this value without doing the work. I could add a mechanism to let you get this value by writing to Stream.Null? but if you need the data anyway you might benefit from just writing to MemoryStream and checking the .Length in advance of copying the data.

FileHelpers-like data import/export utility for binary data?

I use the excellent FileHelpers library when I work with text data. It allows me to very easily dump text fields from a file or in-memory string into a class that represents the data.
In working with a big endian microcontroller-based system I need to read a serial data stream. In order to save space on the very limited microcontroller platform I need to write raw binary data which contains field of various multi-byte types (essentially just dumping a struct variable out the serial port).
I like the architecture of FileHelpers. I create a class that represents the data and tag it with attributes that tell the engine how to put data into the class. I can feed the engine a string representing a single record and get an deserialized representation of the data. However, this is different from object serialization in that the raw data is not delimited in any way, it's a simple binary fixed record format.
FileHelpers is probably not suitable for reading such binary data as it cannot handle the nulls that show up and* I suspect that there might be unicode issues (the engine takes input as a string, so I have to read bytes from the serial port and translate them into a unicode string before they go to my data converter classes). As an experiment I have set it up to read the binary stream and as long as I'm careful to not send nulls it works quite well so far. It is easy to set up new converters that read the raw data and account for endian foratting issues and such. It currently fails on nulls and cannot process multiple records (it expect a CRLF between records).
What I want to know is if anyone knows of an open-source library that works similarly to FileHelpers but that is designed to handle binary data.
I'm considering deriving something from FileHelpers to handle this task, but it seems like there ought to be something already available to do this.
*It turns out that it does not complain about nulls in the input stream. I had an unrelated bug in my test program that came up where I expected a problem with the nulls. Should have investigated a little deeper first!

I haven't used filehelpers, so I can't do a direct comparison; however, if you have an object-model that represents your objects, you could try protobuf-net; it is a binary serialization engine for .NET using Google's compact "protocol buffers" wire format. Much more efficient than things like xml, but without the need to write all your own serialization code.
Note that "protocol buffers" does include some very terse markers between fields (typically one byte); this adds a little padding, but greatly improves version tolerance. For "packed" data (i.e. blocks of ints, say, from an array) this can be omitted if desired.
So: if you just want a compact output, it might be good. If you need a specific output, probably less so.
Disclosure: I'm the author, so I'm biased; but it is free.

When I am fiddling with GPS data in the SIRFstarIII binary mode, I use the Python interactive prompt with the serial module to fetch the stream from the USB/serial port and the struct module to convert the bytes as needed (per some format defined by SIRF). Using the interactive prompt is very flexible because I can read the string to a variable, process it, view the results and try again if needed. After the prototyping stage is finished, I have the data format strings that I need to put into the final program.
Your question doesn't mention anything about why you have a C# tag. I understand FileHelpers is a C# library, but I that doesn't tell me what environment you are working in. There is an implementation of Python for .NET called IronPython.
I realize this answer might mean you have to learn a new language, but having an interactive prompt is a very powerful tool for any programmer.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.