I want to read a binary file which was created outside of my program. One obvious way in C# to read a binary file is to define class representing the file and then use a BinaryReader and read from the file via the Read* methods and assign the return values to the class properties.
What I don't like with the approach is that I manually have to write code that reads the file, although the defined structure represents how the file is stored. I also have to keep the order correct when I read.
After looking a bit around I came across the BinaryFormatter which can automatically serialize and deserialze object in binary format. One great advantage would be that I can read and also write the file without creating additional code. However I wonder if this approach is good for files created from other programs on not just serialized .NET objects. Take for example a graphics format file like BMP. Would it be a good idea to read the file with a BinaryFormatter or is it better to manually and write via BinaryReader and BinaryWriter? Or are there any other approaches which suit better? I'am not looking for concrete examples but just for an advice what is the best way to implement that.
You'd have to be very VERY lucky to find an external file format that happened to map perfectly to the format the BinaryFormatter puts out. The BinaryFormatter obviously adds information on the types/things you're serializing, as well as the data itself, whereas a "normal" binary file format will generally be "these bytes are this, then these bytes are this".
When I've done this in the past (reading SWF headers springs to mind recently) I've always just used a file stream and processed/mapped it manually.
Related
I'm trying to understand and to decide the best approach to my problem.
I've an xsd that represents the schema of the information that I agreed with a client.
Now, in my application (c#, .net3.5) I use and consume an object that has been deserialized from an xml created according to the xsd schema.
As soon as I fill the object with data, I want to pass it to another application and also store it in a db. I have two questions:
I'd like to serialize the object to pass quickly to the other application: is better binary or xml serialization?
Unfortunately in the db I have a limited sized field to store the info, so I need a sort of compression of the serialized object. Binary serialization creates smaller data then xml serialization or I need in any case to compress this data? if yes, how?
Thanks!
I'd like to serialize the object to pass quickly to the other application: is better binary or xml serialization?
Neither is specific enough; binary can be good or bad; xml can be good or bad. Generally speaking binary is smaller and faster to process, but changing to such will be unusable from code that expects xml.
Binary serialization creates smaller data then xml serialization or I need in any case to compress this data?
It can be smaller; or it can be larger; indeed, compression can make things smaller or larger too.
If space is your primary concern, I would suggest running it through something like protobuf-net (a binary serializer without the versioning issues common to BinaryFormatter), and then speculatively try compressing it with GZipStream. If the compressed version is smaller: store that (and a marker - perhaps a preamble - that says "I'm compressed"). If the compressed version gets bigger than the original version, store the original (again with a preamble).
Here's a recent breakdown of the performance (speed and size) of the common .NET serializers: http://theburningmonk.com/2013/09/binary-and-json-serializer-benchmarks-updated/
Ok. I know how to use Serialization and such, but since that only applies to Objects that's been marked with Serialization attribute - how can I for example load data and use it in an application without using Serialization? Say a data file.
Or, create a datacontainer with serialization that holds files not serialized.
Methods I've used is Binary Serialization and XML Serialization. Any other ways that can load unknown data and perhaps somehow use it in C#?
JSON serialization using JSON.NET
This eats everything! Including anonymous types.
Edit
I know you said "you don't want serialization", but based on your statement "[...]Objects that's been marked with Serialization attribute", I believe you didn't try JSON serialization using JSON.NET!
Maybe a definition of terms is in order; serialization is "the process of converting a data structure or object state into a format that can be stored and "resurrected" later in the same or another computer environment". Pretty much any method of converting "volatile" memory into persistent data and back is "serialization", so even if you roll your own scheme to do it, you're "serializing".
That said, it sounds like you simply don't want to use .NET binary serialization. That's actually the right idea; binary serialization is simple, but very code- and environment-dependent. Moving a serializable class to a different namespace, or serializing a file using the Microsoft CLR and then trying to deserialize it in Mono, can break binary serialization.
First and foremost, you MUST be able to determine what type of object you should try to create based on the file. You simply cannot open some "random" file and expect to be able to get anything meaningful out of it without knowing how the data is structured within the file. The easiest way is for the file to tell you, by specifying the type name of the object it was created from (which you will hopefully have available in your codebase). Most built-in serializers do it this way. Other ways the file can inform consumers of its format include file, row and/or field header codes (very common in older standards as they economize on file size) and extension/MIME type.
With that sorted out, deserialization can take place. If the file was serialized using a built-in serializer, simply use that, but if it's an older format (CSV, fixed-length) then you will have to parse the file, line by line, into objects representing lines, collected within a main object representing the file.
Have a look at the ETL (Extract-Transform-Load) process pattern. This is a modular, scaleable architecture pattern for taking files and turning them into data the program can work with:
Extract - This part of the system is pointed at the filesystem, or other incoming "pipe" for raw data, and its job is to open the file, extract the data into a very basic object format that can be further manipulated, and put those objects into an in-memory "queue" for the Transform step. The goal is to get data from the pipe as fast and efficiently as possible, but you are required at this point to have some knowledge of the data you are working with so that you can effectively encapsulate it for further processing; actually turning the data into the format you really want happens later.
Transform - This part of the system takes the extracted data, and performs the logic that will put that data into a hydrated object from your codebase. This is where, given information from the Extract step about the type of file the data was extracted from, you instantiate a domain object that represents the data model, slice the raw data up into the chunks that will be stored as data members, perform any type conversions (data you get from a file is usually either in string format or in raw bits and must be marshalled or otherwise converted into data types that better represent the concept of the data), and validate that the internal structure of the new object is consistent and meets known business rules. Hydrated, valid objects are placed in an output queue to be processed by the Load step.
Load - This step takes the hydrated, valid business objects from the Transform step and persists them into the data store that is used by your system (such as a SQL database or the program's native flat file format).
Well, the old fashioned way was to use stream access operations and read out the data you wanted. This way you could read/write to pretty much any file.
Serialization simply automates this process based on some contract.
Based on your comment, I'm guessing that your requirement is to read any kind of file without having a contract in the first place.
Let's say you have a raw file with the first byte specifying the length of a string and the next set of bytes representing the string;
For example, 5 | H | e | l | l | o
var stream = File.Open(filename);
var length = stream.ReadByte();
byte[] b = new byte[length];
stream.Read(b, 0, length);
var string = Encoding.ASCII.GetString(b);
Binary I/O is as raw as it gets.
Check MSDN for more.
I have a FileStream that consists of several files put into one file, and I have a list of the lengths of the files, in other words I can easely calculate the position and length of all the files. What I want to create is a Open-method that takes a fileindex and returns a stream containing only that file. Currently I've implemented this using a memory-stream, but that forces me to copy the whole (not the container, but the whole contained) file into memory, and I don't want to do that. So, what I would like to be able to do is create a class that implements stream and takes another stream, a offset and a length parameter and then is readable and seekable, only when you do Seek(0) you should get to the offset of the underlaying stream. So like an adapter-class, and I was wondering if this was possible, or even a good idea, or if anyone has any better ideas of how to solve this problem. I realize that if I do it the way I just described I need to make sure that access to the underlaying stream is synchronized, and that all of the partial streams open holds a private variable telling them where currently in the stream they are, but this should probably be dooable, right? has anyone done anything like this before? Or is there a simpel .NET-class I can just use? Any help would be appreciated.
Oh, and sorry for bad english, I forgot to install my browser in english, so spellchecker tells me everything is wrong.
If you're using .NET 4.0, you could use memory-mapped files. They do pretty much what you've described: you can map a "view" of a large file, specified by an offset and a length, into memory, and access just that part of the file using a Stream.
Otherwise, I think your approach sounds good. Just watch out for corner cases involving reading or writing beyond the boundaries of the intended file!
Our application at work basically has to create over a million objects each night to run a numerical simulation involving some weather observations that were recorded during the day.
Each object contains a few string properties (and one very large xml property - About 2 MB) - Beacuse of the size of the large xml property we dont load this up and instead prefer to go the database when we need access to this xml blob (which we do for each object)
I was wondering if it makes sense to somehow retrieve the xml data (which is 2MB) compress it in memory and store it in the object - This prevents us having to do a database query for each object when we come to process it.
I would much rather zip the data, store it in the object and at processing time, unzip and process
Is it possible to zip a string in process and how can I do this without creating millions of MemoryStreams / zip streams for each object?
I would think that compression is not a good idea - it adds quite an overhead to processing, which already appears to be quite intensive.
Perhaps a light-weight format would be better - JSON or a binary serialized object representing the data.
Without more detail, it is difficult to give a definite answer, or better options.
Well, there is DotNetZip which has a simple API so you can do something like this:
byte[] compressedProperty;
public string MyProperty
{
get { DeflateStream.UncompressString(compressedProperty); }
set { compressedProperty = DeflateStream.CompressString(value); }
}
Not sure if it will work out performance wise for you though.
Update:
I only know the GZipStream and the DeflateStream class. Neither of them expose a string interface. Even DotNetZip uses a stream under the hood when you call the functions above, it's just wrapped around a nice interface (which you could do with the System.IO.Compression classes on your own). Not sure what your problem is with streams.
If you really want to avoid streams then you probably have to roll your own compression. Here is a guy who rolled a simple Huffman encoder to encode strings in F#. Don't know how well it works but I you want to avoid 3rd party libs and streams then you could give it a crack.
Sometimes in reading data involving many names and figures
reading line by line needs some serious concatenation work ,
is there any method that would allow me to read a specific data type? like the good old fscanf in C?
Thanks
Sara
I don't know if this is exactly what you are looking for, but the FileHelpers library has many utilities to help with reading fixed length and delimited text files.
From the site:
You can strong type your flat file (fixed or delimited) simply describing a class that maps to each record and later read/write your file as an strong typed .NET array
If what you are looking for is getting strongly typed objects from files, you should look at serialization and deserialization in the .net framework. These allow you to save object state into a file and read them back at a later time.
The System.IO namespace got everything you need.
You could use BinaryReader to read the data from the file, assuming the data is written out in binary form. For example, an Int32 would be written out as 4 bytes ie. the binary representation and not the text representation of the integer.
The usefullness of BinaryReader would depend on the control you have of the code generating the file, ie. can you write the data out using a BinaryWriter and of course how human readable you need the file to be.