I am using the ProtoWriter/ProtoReader classes to implement something similar to the DataTableSerializer included with the Protobuf-net source. One difference is that after the initial transfer of the table contents all future updates are serialised incrementally.
Currently I'm not disposing the ProtoWriter instance until the program ends (as I want all future updates to be serialised with the same writer). This has the effect of delaying all writing to the output stream until the internal buffer size of 1024 bytes is reached.
Should I be creating a new ProtoWriter for each incremental update? Is there another way to force the writer to write to the stream?
Sample code:
private readonly ProtoWriter _writer;
private void WriteUpdate(IEnumerable<IReactiveColumn> columns, int rowIndex)
{
// Start the row group
ProtoWriter.WriteFieldHeader(ProtobufOperationTypes.Update, WireType.StartGroup, _writer);
var token = ProtoWriter.StartSubItem(rowIndex, _writer);
var rowId = rowIndex;
// Send the row id so that it can be matched against the local row id at the other end.
ProtoWriter.WriteFieldHeader(ProtobufFieldIds.RowId, WireType.Variant, _writer);
ProtoWriter.WriteInt32(rowId, _writer);
foreach (var column in columns)
{
var fieldId = _columnsToFieldIds[column.ColumnId];
WriteColumn(column, fieldId, rowId);
}
ProtoWriter.EndSubItem(token, _writer);
}
Interesting question. The flush method isn't exposed because internally it is not always the case that it is appropriate to flush, but I guess there's not a huge reason not to expose this and just let it no-op. On the other hand:
it is already a lightweight wrapper around a stream: you could dispose and recreate
or you could just keep writing and make full use of the extra buffering
Related
I'm creating files that have a certain structure to them. They begin with a Header, then contain a block of DataElements. (The exact details don't matter to this question.)
I have a DataFileWriter connected to a FileStream for output. The problem is, the service that's consuming the files I'm building will reject any data file whose size is larger than the arbitrary value TOOBIG.
Given these constraints:
Every file must start with a Header
Every file must contain one or more DataElements, which must be written out completely; a file ending with an incomplete DataElement is invalid
It is perfectly valid to stop at the end of the current element and begin writing a new file with a new header, as long as each DataElement is written exactly once
The DataFileWriter doesn't and should not know that it's writing to a FileStream as opposed to some other type of stream; all it knows is that it has a Stream and in other cases that could be a completely different setup.
DataElement does not have a fixed size, but it's reasonable to assume any given element won't exceed 4 KB in size.
What's the best way to set up a system that will ensure, assuming that no massive DataElements come through, that no file exceeding a size of TOOBIG will be created? Basic architecture is given below; how would I need to modify it?
public class DataFileWriter : IDisposable
{
private readonly Stream _output;
private readonly IEnumerable<DataElement> _input;
private const int TOOBIG = 4 * 1024 * 1024 * 1024; // 4 GB
public DataFileWriter(IEnumerable<DataElement> input, Stream output)
{
_input = input;
_output = output;
}
public void Write()
{
WriteHeader(); // writes the header to _output
foreach (var element in input)
{
WriteElement(element); // serializes the record to _output
}
}
public void Dispose()
{
_output.Dispose();
}
}
I have an application I'm writing that runs script plugins to automate what a user used to have to do manually through a serial terminal. So, I am basically implementing the serial terminal's functionality in code. One of the functions of the terminal was to send a command which kicked off the terminal receiving continuously streamed data from a device until the user pressed space bar, which would then stop the streaming of the data. While the data was streaming, the user would then set some values in another application on some other devices and watch the data streamed in the terminal change.
Now, the streamed data can take different shapes, depending on the particular command that's sent. For instance, one response may look like:
---RESPONSE HEADER---
HERE: 1
ARE: 2 SOME:3
VALUES: 4
---RESPONSE HEADER---
HERE: 5
ARE: 6 SOME:7
VALUES: 8
....
another may look like:
here are some values
in cols and rows
....
So, my idea is to have a different parser based on the command I send. So, I have done the following:
public class Terminal
{
private SerialPort port;
private IResponseHandler pollingResponseHandler;
private object locker = new object();
private List<Response1Clazz> response1;
private List<Response2Clazz> response2;
//setter omited for brevity
//get snapshot of data at any point in time while response is polling.
public List<Response1Clazz> Response1 {get { lock (locker) return new List<Response1Clazz>(response1); }
//setter omited for brevity
public List<Response2Clazz> Response2 {get { lock (locker) return new List<Response1Clazz>(response2); }
public Terminal()
{
port = new SerialPort(){/*initialize data*/}; //open port etc etc
}
void StartResponse1Polling()
{
Response1 = new List<Response1Clazz>();
Parser<List<Response1Clazz>> parser = new KeyValueParser(Response1); //parser is of type T
pollingResponseHandler = new PollingResponseHandler(parser);
//write command to start polling response 1 in a task
}
void StartResponse2Polling()
{
Response2 = new List<Response2Clazz>();
Parser<List<Response2Clazz>> parser = new RowColumnParser(Response2); //parser is of type T
pollingResponseHandler = new PollingResponseHandler(parser); // this accepts a parser of type T
//write command to start polling response 2
}
OnSerialDataReceived(object sender, Args a)
{
lock(locker){
//do some processing yada yada
//we pass in the serial data to the handler, which in turn delegates to the parser.
pollingResponseHandler.Handle(processedSerialData);
}
}
}
the caller of the class would then be something like
public class Plugin : BasePlugin
{
public override void PluginMain()
{
Terminal terminal = new Terminal();
terminal.StartResponse1Polling();
//update some other data;
Response1Clazz response = terminal.Response1;
//process response
//update more data
response = terminal.Response1;
//process response
//terminal1.StopPolling();
}
}
My question is quite general, but I'm wondering if this is the best way to handle the situation. Right now I am required to pass in an object/List that I want modified, and it's modified via a side effect. For some reason this feels a little ugly because there is really no indication in code that this is what is happening. I am purely doing it because the "Start" method is the location that knows which parser to create and which data to update. Maybe this is Kosher, but I figured it is worth asking if there is another/better way. Or at least a better way to indicate that the "Handle" method produces side effects.
Thanks!
I don't see problems in modifying List<>s that are received as a parameter. It isn't the most beautiful thing in the world but it is quite common. Sadly C# doesn't have a const modifier for parameters (compare this with C/C++, where unless you declare a parameter to be const, it is ok for the method to modify it). You only have to give the parameter a self-explaining name (like outputList), and put a comment on the method (you know, an xml-comment block, like /// <param name="outputList">This list will receive...</param>).
To give a more complete response, I would need to see the whole code. You have omitted an example of Parser and an example of Handler.
Instead I see a problem with your lock in { lock (locker) return new List<Response1Clazz>(response1); }. And it seems to be non-sense, considering that you then do Response1 = new List<Response1Clazz>();, but Response1 only has a getter.
I've been sifting through the posts and forums but could not find a way to achieve this.
I have an array of 10,000,000 Person objects. I'm sending these objects over the network using a Streamed WCF Net.Tcp web service.
The problem is I want to read the first, for example, 5000 Person objects of the array as is it arrives and process only those. Afterwhich I will advance the stream and read another 5000, etc...
I haven't been able to find a way to do this because as far as I can tell there is no explicit size of objects in C#. As in, I can't just read the first 312 Bytes of the stream and say "Yes this is the first Person object. Now read the next 312 Bytes to get the next person.".
I ideally would like to use ProtoBuf-Net to serialize my objects but the .NET BinaryFormatter is fine as well.
I'm also open to sending the data in chunks, such as arrays of 5000. But I want to do so without opening a brand new tcp connection everytime. If only there was a way to tell the code that reads the stream: "Ok, deserialize everything I just sent you (Array of 5000) and then I will continue writing another 5000 to the stream".
Any ideas?
Thanks.
There may not be an explicit size for most objects in .NET but you can find the size of a serialized object. First send the size (in bytes) of the serialized object, then send the serialized object.
// psuedo-code
byte[] serializedObj = DoSerialization(Person); // we see length on an array
using (var writer = new StreamWriter(stream)) {
writer.Write(serializedObj.Length);
stream.Write(serializedObj);
}
You can also do this in bulk by modifying what and how you send your objects. You could create a List<Person>, add N number of Person, serialize the List and send as before.
Although I am not sure if sending the size before sending the data is necessary, it can help when you are reading the stream, to know how many bytes you are expecting.
You can do this with protobuf-net simply by using a ObservableCollection<Person> in your receiving system. When the collection grows larger than 5000 objects during deserialization, remove and processing the items in an ObservableCollection<T>.CollectionChanged callback. Then process any remaining items in an [OnDeserialized] callback.
For instance, consider the following root object:
[ProtoContract]
public class RootObject
{
public RootObject()
{
this.People = new ObservableCollection<Person>();
}
[ProtoMember(1)]
public ObservableCollection<Person> People { get; private set; }
public event EventHandler<EventArgs<StreamingContext>> OnDeserialized;
[OnDeserialized]
internal void OnDeserializedMethod(StreamingContext context)
{
var onDeserialized = OnDeserialized;
if (onDeserialized != null)
onDeserialized(this, new EventArgs<StreamingContext> { Value = context });
}
}
public class EventArgs<T> : EventArgs
{
public T Value { get; set; }
}
Say you have a method you would like to call to process each 5000 Person objects as they are added to the collection, for instance:
const int ProcessIncrement = 5000;
void ProcessItems(ICollection<Person> people, bool force)
{
if (people == null || people.Count == 0)
return;
if (people.Count >= ProcessIncrement || force)
{
// Remove and process the items, possibly on a different thread.
Console.WriteLine(string.Format("Processing {0} people." people.Count));
people.Clear();
}
}
You can pre-allocate your RootObject and add listeners with the necessary logic, and merge contents of the serialization stream into the root:
// Allocate a new RootObject
var newRoot = new RootObject();
// Add listeners to process chunks of Person objects as they are added
newRoot.People.CollectionChanged += (o, e) =>
{
// Process each chunk of 5000.
var collection = (ICollection<Person>)o;
ProcessItems(collection, false);
};
newRoot.OnDeserialized += (o, e) =>
{
// Forcibly process any remaining no matter how many.
ProcessItems(((RootObject)o).People, true);
};
// Deserialize from the stream onto the pre-allocated newRoot
Serializer.Merge(stream, newRoot);
As required, ProcessItems will be called every time an object is added to the collection, processing them in increments of 5000 then processing the remainder unconditionally.
Now, the only question is, does protobuf-net load the entire stream into memory before deserializing the collection, or does it do streaming deserialization? As it turns out, it does the latter, as shown by this sample fiddle that shows the stream position being gradually incremented as the items in the People collection are added, processed and removed.
Here I added the listeners to RootObject manually before deserialization. If you were to add them in the constructor itself, you could use ProtoBuf.Serializer.Deserialize<RootObject>(Stream stream) instead of Serializer.Merge onto a pre-allocated root object, which might be easier to integrate into your current architecture.
Incidentally, this technique should work with XmlSerializer and Json.NET as well.
Which of the following approaches is better? I meant to ask, is it better to copy the stream locally, close it and do whatever operations that are needed to be done using the data? or just perform operations with the stream open? Assume that the input from the stream is huge.
First method:
public static int calculateSum(string filePath)
{
int sum = 0;
var list = new List<int>();
using (StreamReader sr = new StreamReader(filePath))
{
while (!sr.EndOfStream)
{
list.Add(int.Parse(sr.ReadLine()));
}
}
foreach(int item in list)
sum += item;
return sum;
}
Second method:
public static int calculateSum(string filePath)
{
int sum = 0;
using (StreamReader sr = new StreamReader(filePath))
{
while (!sr.EndOfStream)
{
sum += int.Parse(sr.ReadLine());
}
}
return sum;
}
If the file is modified often, then read the data in and then work with it. If it is not accessed often, then you are fine to read the file one line at a time and work with each line separately.
In general, if you can do it in a single pass, then do it in a single pass. You indicate that the input is huge, so it might not all fit into memory. If that's the case, then your first option isn't even possible.
Of course, there are exceptions to every rule of thumb. But you don't indicate that there's anything special about the file or the access pattern (other processes wanting to access it, for example) that prevents you from keeping it open longer than absolutely necessary to copy the data.
I don't know if your example is a real-world scenario or if you're just using the sum thing as a placeholder for more complex processing. In any case, if you're processing a file line-by-line, you can save yourself a lot of trouble by using File.ReadLines:
int sum = 0;
foreach (var line in File.ReadLines(filePath))
{
sum += int.Parse(line);
}
This does not read the entire file into memory at once. Rather, it uses an enumerator to present one line at a time, and only reads as much as it must to maintain a relatively small (probably four kilobyte) buffer.
I have the following logic:
public void InQueueTable(DataTable Table)
{
int incomingRows = Table.Rows.Count;
if (incomingRows >= RowsThreshold)
{
// asyncWriteRows(Table)
return;
}
if ((RowsInMemory + incomingRows) >= RowsThreshold)
{
// copy and clear internal table
// asyncWriteRows(copyTable)
}
internalTable.Merge(Table);
}
There is one problem with this lagorithm:
Given RowsThreshold = 10000
If incomingRows puts RowsInMemory
over RowsThreshold: (1)
asynchronously write out data, (2)
merge incoming data
If incomingRows is over
RowsThreshold, asynchronously write
incoming data
But what if??? Assume a second thread spins up and calls asyncWriteRows(xxxTable); also, that each thread owning the asynchronous method will be writing to the same table in SqlServer: Does SqlServer handle this sort of multi-threaded write functionality to the same table?
Follow up
Based on Greg D's suggestion:
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(connectionString,
sqlBulkCopyOptions.KeepIdentity | SqlBulkCopyOptions.UseInternalTransaction))
{
// perform bulkcopy
}
Regardless, I still have the issue of signaling the asyncWriteRows(copyTable). The algorithm needs to determine the need to go ahead and copy internalTable, clear internalTable, and asyncWriteRows(copyTable). I think that what I need to do is move the internalTable.Copy() call to it's own method:
private DataTable CopyTable (DataTable srcTable)
{
lock (key)
{
return srcTable.Copy();
}
}
...and then the following changes to the InQueue method:
public void InQueueTable(DataTable Table)
{
int incomingRows = Table.Rows.Count;
if (incomingRows >= RowsThreshold)
{
// asyncWriteRows(Table)
return;
}
if ((RowsInMemory + incomingRows) >= RowsThreshold)
{
// copy and clear internal table
// asyncWriteRows(CopyTable(Table))
}
internalTable.Merge(Table);
}
...finally, add a callback method:
private void WriteCallback(Object iaSyncResult)
{
int rowCount = (int)iaSyncResult.AsyncState;
if (RowsInMemory >= rowCount)
{
asyncWriteRows(CopyTable(internalTable));
}
}
This is what I have determined as a solution. Any feedback?
Is there some reason you can't use transactions?
I'll admit now that I'm not an expert in this field.
With transactions and cursors you will get lock escalation if your operation is large. E.g. your operation will start locking a row, then a page then a table if it needs to, preventing other operations from functioning.
The idiot that I was assumed that SQL Server would just queue these blocked operations up and wait for locks to be released, but it just returns errors and it's up to the API programmer to keep retrying (someone correct me if I'm wrong, or if it's fixed in a later version).
If you are happy to be reading possibly old data that you then copy over, like we were, we changed our isolation mode to stop the server blocking operations unnecessarily.
ALTER DATABASE [dbname] SET READ_COMMITTED_SNAPSHOT ON;
You may also alter your insert statments to use NOLOCK. But please read up on this.