Writing disparate objects to a single file - c#

This is an offshoot of my question found at Pulling objects from binary file and putting in List<T> and the question posed by David Torrey at Serializing and Deserializing Multiple Objects.
I don't know if this is even possible or not. Say that I have an application that uses five different classes to contain the necessary information to do whatever it is that the application does. The classes do not refer to each other and have various numbers of internal variables and the like. I would want to save the collection of objects spawned from these classes into a single save file and then be able to read them back out. Since the objects are created from different classes they can't go into a list to be sent to disc. I originally thought that using something like sizeof(this) to record the size of the object to record in a table that is saved at the beginning of a file and then have a common GetObjectType() that returns an actionable value as the kind of object it is would have worked, but apparently sizeof doesn't work the way I thought it would and so now I'm back at square 1.

Several options: wrap all of your objects in a larger object and serialize that. The problem with that is that you can't deserialize just one object, you have to load all of them. If you have 10 objects that each is several megs, that isn't a good idea. You want random access to any of the objects, but you don't know the offsets in the file. The "wrapper" objcect can be something as simple as a List<object>, but I'd use my own object for better type safety.
Second option: use a fixed size file header/footer. Serialize each object to a MemoryStream, and then dump the memory streams from the individual objects into the file, remembering the number of bytes in each. Finally add a fixed size block at the start or end of the file to record the offsets in the file where the individual objects begin. In the example below the header of the file has first the number of objects in the file (as an integer), then one integer for each object giving the size of each object.
// Pseudocode to serialize two objects into the same file
// First serialize to memory
byte[] bytes1 = object1.Serialize();
byte[] bytes2 = object2.Serialize();
// Write header:
file.Write(2); // Number of objects
file.Write(bytes1.Length); // Size of first object
file.Write(bytes2.Length); // Size of second object
// Write data:
file.Write(bytes1);
file.Write(bytes2);
The offset of objectN, is the size of the header PLUS the sum of all sizes up to N. So in this example, to read the file, you read the header like so (pseudo, again)
numObjs = file.readInt();
for(i=0..numObjs)
size[i] = file.readInt();
After which you can compute and seek to the correct location for object N.
Third option: Use a format that does option #2 for you, for example zip. In .NET you can use System.IO.Packaging to create a structured zip (OPC format), or you can use a third party zip library if you want to roll your own zip format.

Related

Keep space for arrays/buffers for repeated use

I'm using large arrays (many MB per array) but at any one time there is only one array - one gets disposed and another created to take its place. They are not of equal length, but the length does have an upper bound.
Instead of having a new array allocated every time, is there a way to allocate space for the largest array possible (which I can find out) and use whatever length of that needed to create every new array. I can't use the exact same array and use a variable for the length, because I need to feed the array on to other methods which I do not control, which need to be exactly the length of the data contained in them (which is not constant). I remember reading about some class that can do that and we ask it for a buffer and then return it to the class.
You can create your own memory manager to create a newest array when the one you have is too small or return the previously allocated one.
You can also use an InMemoryRandomAccessStream to store your data. This stream will resize itself to hold the data you have to store.
Using DataWriter or DataReader, you can then easily insert/read data to/from the stream.
To get a input or output stream from the InMemoryRandomAccessStream, you can use : GttInputStreamAt(0) or GetOutputStreamAt(0)

Can I create a new array reference which contains a subset of an existing array, without copying memory?

I am building a file reader in C#, and large volumes of data will be enumerated. I want to use the same buffer for each element I read out, then pass the buffer on for further processing by the client. The API would be cleaner if I could return a byte[] of the correct size, rather than the raw buffer and a length.
Is it possible to do this in C# without copying memory?
You can use ArraySegment<T>
http://msdn.microsoft.com/en-us/library/1hsbd92d.aspx
That lets you specify the start and end of the segment you want to pass on without copying any data.
If you can change your API parameter types, I think you could use an ArraySegment.
ArraySegment type is a generic struct that allows us to store information about an array range. It is useful for storing array ranges. The ArraySegment facilitates optimizations that reduce memory copying and heap allocations.
From MSDN;
The Array property returns the entire original array, not a copy of
the array; therefore, changes made to the array returned by the Array
property are made to the original array.
Here is a DEMO.

Finding "empty" portions in a file

EDIT 1:
I build a torrent application; Downloading from diffrent clients simultaneously. Each download represent a portion for my file and diffrent clients have diffrent portions.
After a download is complete, I need to know which portion I need to achieve now by Finding "empty" portions in my file.
One way to creat a file with fixed size:
File.WriteAllBytes(#"C:\upload\BigFile.rar", new byte[Big Size]);
My portion Arr that represent my file as portions:
BitArray TorrentPartsState = new BitArray(10);
For example:
File size is 100.
TorrentPartsState[0] = true; // thats mean that in my file, from position 0 until 9 I **dont** need to fill in some information.
TorrentPartsState[1] = true; // thats mean that in my file, from position 10 until 19 I **need** to fill in some information.
I seatch an effective way to save what the BitArray is containing even if the computer/application is shut down. One way I tought of, is by xml file and to update it each time a portion is complete.
I don't think its smart and effective solution. Any idea for other one?
It sounds like you know the following when you start a transfer:
The size of the final file.
The (maximum) number of streams you intend to use for the file.
Create the output file and allocate the required space.
Create a second "control" file with a related filename, e.g. add you own extension. In that file maintain an array of stream status structures corresponding to the network streams. Each status consists of the starting offset and number of bytes transferred. Periodically flush the stream buffers and then update the control file to reflect the progress made and committed.
Variations on the theme:
The control file can define segments to be transferred, e.g. 16MB chunks, and treated as a work queue by threads that look for an incomplete segment and a suitable server from which to retrieve it.
The control file could be a separate fork within the result file. (Who am I kidding?)
You could use a BitArray (in System.Collections).
Then, when you visit on offset in the file, you can set the BitArray at that offset to true.
So for your 10,000 byte file:
BitArray ba = new BitArray(10000);
// Visited offset, mark in the BitArray
ba[4] = true;
Implement a file system (like on a disk) in your file - just use something simple, should be something available in the FOS arena

Is there a way to mark the end of each protobuf-net record

I am saving a series of protobuf-net objects in a database cell as a Byte[] of length-prefixed protobuf-net objects:
//retrieve existing protobufs from database and convert to Byte[]
object q = sql_agent_cmd.ExecuteScalar();
older-pbfs = (Byte[])q;
// serialize the new pbf to add into MemoryStream m
//now write p and the new pbf-net Byte[] into a memory stream and retrieve the sum
var s = new System.IO.MemoryStream();
s.Write(older-pbfs, 0, older-pbfs.Length);
s.Write(m.GetBuffer(), 0, m.ToArray().Length); // append new bytes at the end of old
Byte[] sum-pbfs = s.ToArray();
//sum-pbfs = old pbfs + new pbf. Insert sum-pbfs into database
This works fine. My concern is what happens if there is slight db corruption. It will no longer be possible to know which byte is the length prefix and the entire cell contents would have to be discarded. Wouldn't it be advisable to also use some kind of a end-of-pbf-object indicator (kind of like the \n or EOF indicators used in text files). This way even if a record gets corrupted, the other records would be recoverable.
If so, what is the recommended way to add end-of-record indicators at the end of each pbf.
Using protobuf-netv2 and C# on Visual Studio 2010.
Thanks
Manish
If you use a vanilla message via Serialize / Deserialize, then no: that isn't part of the specification (because the format is designed to be appendable).
If, however, you use SerializeWithLengthPrefix, it will dump the length at the start of the message; it will then know in advance how much data is expected. You deserialize with DeserializeWithLengthPrefix, and it will complain loudly if it doesn't have enough data. However! It will not complain if you have extra data, since this too is designed to be appendable.
In terms of Jon's reply, the default usage of the *WithLengthPrefix method is in terms of the data stored exactly identical to what Jon suggests; it pretends there is a wrapper object and behaves accordingly. The differences are:
no wrapper object actually exists
the "withlengthprefix" methods explicitly stop after a single occurrence, rather than merging any later data into the same object (useful for, say, sending multiple discreet objects to a single file, or down a single socket)
The difference in the two "appendable"s here is that the first means "merge into a single object", where-as the second means "I expect multiple records".
Unrelated suggestion:
s.Write(m.GetBuffer(), 0, m.ToArray().Length);
should be:
s.Write(m.GetBuffer(), 0, (int)m.Length);
(no need to create an extra buffer)
(Note: I don't know much about protobuf-net itself, but this is generally applicable to Protocol Buffer messages.)
Typically if you want to record multiple messages, it's worth just putting a "wrapper" message - make the "real" message a repeated field within that. Each "real" message will then be length prefixed by the natural wire format of Protocol Buffers.
This won't detect corruption of course - but to be honest, if the database ends up getting corrupted you've got bigger problems. You could potentially detect corruption, e.g. by keeping a hash along with each record... but you need to consider the possibility of the corruption occurring within the length prefix, or within the hash itself. Think about what you're really trying to achieve here - what scenarios you're trying to protect against, and what level of recovery you need.

which data structure is good for temporary big binary data storage?

Plan to have a data structure to store temporary binary data in the memory for analysis.
The max size of the data will be about 10MB.
data will be added at the end 408 bytes at a time.
no search, retrieve operations on those temporary binary data.
data will be wipe out and the storage will be reused for next analysis.
questions:
which structure is good for this purpose? byte[10MB], List<bytes>(10MB), List<MyStruct>(24000), or ...?
how to quickly wipe out the data (not List.Clear(), just set the value to 0) for List or array?
If I say List.Clear(), the memory for this List will shrink or the capacity (memory) of the List is still there and no memory allocation when I call List.AddRange() after the Clear()?
List.Insert() will make the List larger or it just replace the existing item?
You will have to describe what you are doing more to give better answers but it sounds like you are worried about efficiency/perf so
byte[]
no need to clear the array, just keep track of where the 'end' of your current cycle is
n/a
n/a
If your data is usually the same size, and always under a certain size, use a byte array.
Create a byte[], and a int that lets you know where the end of the "full" part of that buffer stops and the "free" part starts. You never need to clear it; just overwrite what was there. The only problem with this is if your data is sometimes 100 kb, sometimes 10 MB, and sometimes a bit larger than you originally planned for.
List will be slower to use and larger in memory, although they handle various sizes of data out of the box.
using (System.IO.MemoryStream memStream = new System.IO.MemoryStream())
{
Do stuff
} // the using ensures proper simple disposal occurs here so you don't have to worry about cleaning up.

Categories