How to write JSON to Event Hub correctly - c#

I'm batching serialized records (in a JArray) to send to Event Hub. When I'm writing the data to Event Hubs it seems to be inserting extra speech marks around the JSON i.e. what is written "{"myjson":"blah"}" not {"myjson":"blah"} so downstream I'm having trouble reading it.
Based on this guidance, I must convert JSON to string and then use GetBytes to pass it into an EventData object. I suspect my attempt at following this guidance is where my issue is arising.
using Newtonsoft.Json;
using Newtonsoft.Json.Linq;
public static class EventDataTransform
{
public static EventData ToEventData(dynamic eventObject, out int payloadSize)
{
string json = eventObject.ToString(Formatting.None);
payloadSize = Encoding.UTF8.GetByteCount(json);
var payload = Encoding.UTF8.GetBytes(json);
var eventData = new EventData(payload)
{
};
return eventData;
}
}
How should an item from a JArray containing serialized data be converted into the contents of an EventData message?
Code call location - used in batching upto 256kb parcels
public bool MoveNext()
{
var batch = new List<EventData>(_allEvents.Count);
var batchSize = 0;
for (int i = _lastBatchedEventIndex; i < _allEvents.Count; i++)
{
dynamic evt = _allEvents[i];
int payloadSize = 0;
var eventData = EventDataTransform.ToEventData(evt, out payloadSize);
var eventSize = payloadSize + EventDataOverheadBytes;
if (batchSize + eventSize > MaxBatchSizeBytes)
{
break;
}
batch.Add(eventData);
batchSize += eventSize;
}
_lastBatchedEventIndex += batch.Count();
_currentBatch = batch;
return _currentBatch.Count() > 0;
}

Sounds like the JArray already contains serialized objects (strings). Calling .ToString(Formatting.None) will serialize it again a second time (wrapping it in quotes).
Interestingly enough, if you call .ToString() without passing in a Formatting, it would not serialize it again.
This fiddle demonstrates this: https://dotnetfiddle.net/H4p6KL

Related

Protocol Buffer Field Fails to Serialize, Ends Up as Null

I'm using Google Protobuf 3.17.3.
PS C:\Users\Name> protoc --version
libprotoc 3.17.3
I'm running into an issue where, upon serialization of a given object, I'm finding that a field that is non-null is not being serialized. I know this because, when the proto-message gets to the server-side, it's fields are null despite being non-null on the client-side.
For example, in the code snippet below, while Google.Protobuf.ExchangeProto.Update's UpdateNewOrder field is non-null, after desterilizing this field on the server-side, it ends up as null. The UpdateType field serializes just fine.
public async Task SendNewOrderAsync(NewOrder newOrder)
{
var networkStream = _tcpClient.GetStream();
using var ms = new MemoryStream();
{
using var gms = new Google.Protobuf.CodedOutputStream(ms);
var update = new Google.Protobuf.ExchangeProto.Update()
{
UpdateType = Google.Protobuf.ExchangeProto.Update.Types.UpdateType.NewOrder,
UpdateNewOrder = ProtoAdapter.NewOrderToProto(newOrder), // < -- Problematic field!
};
update.WriteTo(gms);
}
await networkStream.WriteAsync(ms.ToArray());
}
It appears this isn't a server-side problem, though. I noticed post-serialization on the client-side, that the underlying MemoryStream object is particularly small (only 27 bytes).
Client Side Memory Stream Object Details
Server Side Object Post-Parse - Missing UpdateNewOrder Field
ProtoAdapter is where the to-and-from proto logic resides. This particular NewOrderToProto message looks as such:
public static Google.Protobuf.ExchangeProto.NewOrder NewOrderToProto(NewOrder newOrder)
{
return new Google.Protobuf.ExchangeProto.NewOrder()
{
Symbol = newOrder.Symbol,
IsBuy = newOrder.IsBuy,
OrderId = newOrder.OrderID,
Price = newOrder.Price,
Quantity = newOrder.Quantity,
};
}
The update message I'm attempting to serialize looks like the following:
message Update
{
enum UpdateType
{
UpdateType_NewOrder = 0;
UpdateType_CancelOrder = 1;
}
UpdateType update_type = 1;
NewOrder update_new_order = 2;
CancelOrder update_cancel_order = 3;
}
The struct that isn't serializing properly in the Update message's update_new_order field looks like such:
message NewOrder
{
string symbol = 1;
int32 quantity = 2;
int64 price = 3;
bool is_buy = 4;
uint64 order_id = 5;
}
Question: Why is the UpdateNewOrder field not being serialized?
So I looked into this, and started playing around with the order of operations on the server-side. The issue wasn't on the client-side. I found that out by deserializing the message on the client-side right after serializing it, noticing that the UpdateNewOrder field is non-null, exactly what I need to rule out a client-side issue.
I then started playing with the server-side, and noticed that the Parser's ParseFrom method can take in any object that inherits from stream, but seems to work best when passing in a CodedInputStream as shown below.

C# - OutOfMemoryException saving a List on a JSON file

I'm trying to save the streaming data of a pressure map.
Basically I have a pressure matrix defined as:
double[,] pressureMatrix = new double[e.Data.GetLength(0), e.Data.GetLength(1)];
Basically, I'm getting one of this pressureMatrix every 10 milliseconds and I want to save all the information in a JSON file to be able to reproduce it later.
What I do is, first of all, write what I call the header with all the settings used to do the recording like this:
recordedData.softwareVersion = Assembly.GetExecutingAssembly().GetName().Version.Major.ToString() + "." + Assembly.GetExecutingAssembly().GetName().Version.Minor.ToString();
recordedData.calibrationConfiguration = calibrationConfiguration;
recordedData.representationConfiguration = representationSettings;
recordedData.pressureData = new List<PressureMap>();
var json = JsonConvert.SerializeObject(csvRecordedData, Formatting.None);
File.WriteAllText(this.filePath, json);
Then, every time I get a new pressure map I create a new Thread to add the new PressureMatrix and re-write the file:
var newPressureMatrix = new PressureMap(datos, DateTime.Now);
recordedData.pressureData.Add(newPressureMatrix);
var json = JsonConvert.SerializeObject(recordedData, Formatting.None);
File.WriteAllText(this.filePath, json);
After about 20-30 min I get an OutOfMemory Exception because the system cannot hold the recordedData var because the List<PressureMatrix> in it is too big.
How can I handle this to save a the data? I would like to save the information of 24-48 hours.
Your basic problem is that you are holding all of your pressure map samples in memory rather than writing each one individually and then allowing it to be garbage collected. What's worse, you are doing this in two different places:
You serialize your entire list of samples to a JSON string json before writing the string to a file.
Instead, as explained in Performance Tips: Optimize Memory Usage, you should serialize and deserialize directly to and from your file in such situations. For instructions on how to do this see this answer to Can Json.NET serialize / deserialize to / from a stream? and also Serialize JSON to a file.
The recordedData.pressureData = new List<PressureMap>(); accumulates all pressure map samples, then writes all of them every time a sample is made.
A better solution would be to write each sample once and forget it, but the requirement for each sample to be nested inside some container objects in the JSON makes it nonobvious how to do that.
So, how to attack issue #2?
First, let's modify your data model as follows, partitioning the header data into a separate class:
public class PressureMap
{
public double[,] PressureMatrix { get; set; }
}
public class CalibrationConfiguration
{
// Data model not included in question
}
public class RepresentationConfiguration
{
// Data model not included in question
}
public class RecordedDataHeader
{
public string SoftwareVersion { get; set; }
public CalibrationConfiguration CalibrationConfiguration { get; set; }
public RepresentationConfiguration RepresentationConfiguration { get; set; }
}
public class RecordedData
{
// Ensure the header is serialized first.
[JsonProperty(Order = 1)]
public RecordedDataHeader RecordedDataHeader { get; set; }
// Ensure the pressure data is serialized last.
[JsonProperty(Order = 2)]
public IEnumerable<PressureMap> PressureData { get; set; }
}
Option #1 is a version of the producer-comsumer pattern. It involves spinning up two threads: one to generate PressureData samples, and one to serialize the RecordedData. The first thread will generate samples and add them to a BlockingCollection<PressureMap> collection that is passed to the second thread. The second thread will then serialize BlockingCollection<PressureMap>.GetConsumingEnumerable()
as the value of RecordedData.PressureData.
The following code gives a skeleton for how to do this:
var sampleCount = 400; // Or whatever stopping criterion you prefer
var sampleInterval = 10; // in ms
using (var pressureData = new BlockingCollection<PressureMap>())
{
// Adapted from
// https://learn.microsoft.com/en-us/dotnet/standard/collections/thread-safe/blockingcollection-overview
// https://learn.microsoft.com/en-us/dotnet/api/system.collections.concurrent.blockingcollection-1?view=netframework-4.7.2
// Spin up a Task to sample the pressure maps
using (Task t1 = Task.Factory.StartNew(() =>
{
for (int i = 0; i < sampleCount; i++)
{
var data = GetPressureMap(i);
Console.WriteLine("Generated sample {0}", i);
pressureData.Add(data);
System.Threading.Thread.Sleep(sampleInterval);
}
pressureData.CompleteAdding();
}))
{
// Spin up a Task to consume the BlockingCollection
using (Task t2 = Task.Factory.StartNew(() =>
{
var recordedDataHeader = new RecordedDataHeader
{
SoftwareVersion = softwareVersion,
CalibrationConfiguration = calibrationConfiguration,
RepresentationConfiguration = representationConfiguration,
};
var settings = new JsonSerializerSettings
{
ContractResolver = new CamelCasePropertyNamesContractResolver(),
};
using (var stream = new FileStream(this.filePath, FileMode.Create))
using (var textWriter = new StreamWriter(stream))
using (var jsonWriter = new JsonTextWriter(textWriter))
{
int j = 0;
var query = pressureData
.GetConsumingEnumerable()
.Select(p =>
{
// Flush the writer periodically in case the process terminates abnormally
jsonWriter.Flush();
Console.WriteLine("Serializing item {0}", j++);
return p;
});
var recordedData = new RecordedData
{
RecordedDataHeader = recordedDataHeader,
// Since PressureData is declared as IEnumerable<PressureMap>, evaluation will be lazy.
PressureData = query,
};
Console.WriteLine("Beginning serialization of {0} to {1}:", recordedData, this.filePath);
JsonSerializer.CreateDefault(settings).Serialize(textWriter, recordedData);
Console.WriteLine("Finished serialization of {0} to {1}.", recordedData, this.filePath);
}
}))
{
Task.WaitAll(t1, t2);
}
}
}
Notes:
This solution uses the fact that, when serializing an IEnumerable<T>, Json.NET will not materialize the enumerable as a list. Instead it will take full advantage of lazy evaluation and simply enumerate through it, writing then forgetting each individual item encountered.
The first thread samples PressureData and adds them to the blocking collection.
The second thread wraps the blocking collection in an IEnumerable<PressureData> then serializes that as RecordedData.PressureData.
During serialization, the serializer will enumerate through the IEnumerable<PressureData> enumerable, streaming each to the JSON file then proceeding to the next -- effectively blocking until one becomes available.
You will need to do some experimentation to make sure that the serialization thread can "keep up" with the sampling thread, possibly by setting a BoundedCapacity during construction. If not, you may need to adopt a different strategy.
PressureMap GetPressureMap(int count) should be some method of yours (not shown in the question) that returns the current pressure map sample.
In this technique the JSON file remains open for the duration of the sampling session. If sampling terminates abnormally the file may be truncated. I make some attempt to ameliorate the problem by flushing the writer periodically.
While data serialization will no longer require unbounded amounts of memory, deserializing a RecordedData later will deserialize the PressureData array into a concrete List<PressureMap>. This may possibly cause memory issues during downstream processing.
Demo fiddle #1 here.
Option #2 would be to switch from a JSON file to a Newline Delimited JSON file. Such a file consists of sequences of JSON objects separated by newline characters. In your case, you would make the first object contain the RecordedDataHeader information, and the subsequent objects be of type PressureMap:
var sampleCount = 100; // Or whatever
var sampleInterval = 10;
var recordedDataHeader = new RecordedDataHeader
{
SoftwareVersion = softwareVersion,
CalibrationConfiguration = calibrationConfiguration,
RepresentationConfiguration = representationConfiguration,
};
var settings = new JsonSerializerSettings
{
ContractResolver = new CamelCasePropertyNamesContractResolver(),
};
// Write the header
Console.WriteLine("Beginning serialization of sample data to {0}.", this.filePath);
using (var stream = new FileStream(this.filePath, FileMode.Create))
{
JsonExtensions.ToNewlineDelimitedJson(stream, new[] { recordedDataHeader });
}
// Write each sample incrementally
for (int i = 0; i < sampleCount; i++)
{
Thread.Sleep(sampleInterval);
Console.WriteLine("Performing sample {0} of {1}", i, sampleCount);
var map = GetPressureMap(i);
using (var stream = new FileStream(this.filePath, FileMode.Append))
{
JsonExtensions.ToNewlineDelimitedJson(stream, new[] { map });
}
}
Console.WriteLine("Finished serialization of sample data to {0}.", this.filePath);
Using the extension methods:
public static partial class JsonExtensions
{
// Adapted from the answer to
// https://stackoverflow.com/questions/44787652/serialize-as-ndjson-using-json-net
// by dbc https://stackoverflow.com/users/3744182/dbc
public static void ToNewlineDelimitedJson<T>(Stream stream, IEnumerable<T> items)
{
// Let caller dispose the underlying stream
using (var textWriter = new StreamWriter(stream, new UTF8Encoding(false, true), 1024, true))
{
ToNewlineDelimitedJson(textWriter, items);
}
}
public static void ToNewlineDelimitedJson<T>(TextWriter textWriter, IEnumerable<T> items)
{
var serializer = JsonSerializer.CreateDefault();
foreach (var item in items)
{
// Formatting.None is the default; I set it here for clarity.
using (var writer = new JsonTextWriter(textWriter) { Formatting = Formatting.None, CloseOutput = false })
{
serializer.Serialize(writer, item);
}
// http://specs.okfnlabs.org/ndjson/
// Each JSON text MUST conform to the [RFC7159] standard and MUST be written to the stream followed by the newline character \n (0x0A).
// The newline charater MAY be preceeded by a carriage return \r (0x0D). The JSON texts MUST NOT contain newlines or carriage returns.
textWriter.Write("\n");
}
}
// Adapted from the answer to
// https://stackoverflow.com/questions/29729063/line-delimited-json-serializing-and-de-serializing
// by Yuval Itzchakov https://stackoverflow.com/users/1870803/yuval-itzchakov
public static IEnumerable<TBase> FromNewlineDelimitedJson<TBase, THeader, TRow>(TextReader reader)
where THeader : TBase
where TRow : TBase
{
bool first = true;
using (var jsonReader = new JsonTextReader(reader) { CloseInput = false, SupportMultipleContent = true })
{
var serializer = JsonSerializer.CreateDefault();
while (jsonReader.Read())
{
if (jsonReader.TokenType == JsonToken.Comment)
continue;
if (first)
{
yield return serializer.Deserialize<THeader>(jsonReader);
first = false;
}
else
{
yield return serializer.Deserialize<TRow>(jsonReader);
}
}
}
}
}
Later, you can process the newline delimited JSON file as follows:
using (var stream = File.OpenRead(filePath))
using (var textReader = new StreamReader(stream))
{
foreach (var obj in JsonExtensions.FromNewlineDelimitedJson<object, RecordedDataHeader, PressureMap>(textReader))
{
if (obj is RecordedDataHeader)
{
var header = (RecordedDataHeader)obj;
// Process the header
Console.WriteLine(JsonConvert.SerializeObject(header));
}
else
{
var row = (PressureMap)obj;
// Process the row.
Console.WriteLine(JsonConvert.SerializeObject(row));
}
}
}
Notes:
This approach looks simpler because the samples are added incrementally to the end of the file, rather than inserted inside some overall JSON container.
With this approach both serialization and downstream processing can be done with bounded memory use.
The sample file does not remain open for the duration of sampling, so is less likely to be truncated.
Downstream applications may not have built-in tools for processing newline delimited JSON.
This strategy may integrate more simply with your current threading code.
Demo fiddle #2 here.

Validate object against a schema before serialization

I want to serialize a C# object as JSON into a stream, but to avoid the serialization if the object is not valid according to a schema. How should I proceed with this task using JSON.NET and Json.NET Schema? From what I see there is no method in the JSON.NET library which allows the validation of a C# object against a JSON schema. It seems somewhat weird that there is no direct method to just validate the C# object without encoding it. Do you have any idea why this method is not available?
It seems this API not currently available. At a guess, this might be because recursively generating the JSON values to validate involves most of the work of serializing the object. Or it could just be because no one at Newtonsoft ever designed, specified, implemented, tested, documented and shipped that feature.
If you want, you could file an enhancement request requesting this API, probably as a part of the SchemaExtensions class.
In the meantime, if you do need to test-validate a POCO without generating a complete serialization of it (because e.g. the result would be very large), you could grab NullJsonWriter from Reference to automatically created objects, wrap it in a JSchemaValidatingWriter and test-serialize your object as shown in Validate JSON with JSchemaValidatingWriter. NullJsonWriter doesn't actually write anything, and so using it eliminates the performance and memory overhead of generating a complete serialization (either as a string or as a JToken).
First, add the following static method:
public static class JsonExtensions
{
public static bool TestValidate<T>(T obj, JSchema schema, SchemaValidationEventHandler handler = null, JsonSerializerSettings settings = null)
{
using (var writer = new NullJsonWriter())
using (var validatingWriter = new JSchemaValidatingWriter(writer) { Schema = schema })
{
int count = 0;
if (handler != null)
validatingWriter.ValidationEventHandler += handler;
validatingWriter.ValidationEventHandler += (o, a) => count++;
JsonSerializer.CreateDefault(settings).Serialize(validatingWriter, obj);
return count == 0;
}
}
}
// Used to enable Json.NET to traverse an object hierarchy without actually writing any data.
class NullJsonWriter : JsonWriter
{
public NullJsonWriter()
: base()
{
}
public override void Flush()
{
// Do nothing.
}
}
Then use it like:
// Example adapted from
// https://www.newtonsoft.com/jsonschema/help/html/JsonValidatingWriterAndSerializer.htm
// by James Newton-King
string schemaJson = #"{
'description': 'A person',
'type': 'object',
'properties': {
'name': {'type':'string'},
'hobbies': {
'type': 'array',
'maxItems': 3,
'items': {'type':'string'}
}
}
}";
var schema = JSchema.Parse(schemaJson);
var person = new
{
Name = "James",
Hobbies = new [] { ".Net", "Blogging", "Reading", "XBox", "LOLCATS" },
};
var settings = new JsonSerializerSettings { ContractResolver = new CamelCasePropertyNamesContractResolver() };
var isValid = JsonExtensions.TestValidate(person, schema, (o, a) => Console.WriteLine(a.Message), settings);
// Prints Array item count 5 exceeds maximum count of 3. Path 'hobbies'.
Console.WriteLine("isValid = {0}", isValid);
// Prints isValid = False
Watch out for cases by the way. Json.NET schema is case sensitive so you will need to use an appropriate contract resolver when test-validating.
Sample fiddle.
You cannot do that from the JSON string, you need an object and a schema to compare with first..
public void Validate()
{
//...
JsonSchema schema = JsonSchema.Parse("{'pattern':'lol'}");
JToken stringToken = JToken.FromObject("pie");
stringToken.Validate(schema);

Is it possible to include the initializer from for loop in the name of a variable that will be declared inside it?

static void Main(string[] args)
{
for (int i = 0; i < int.MaxValue; i++)
{
client client[i] = new client();
}
}
I need to make the loop add an automated variable that gets declared with every loop. The variable would be called client1 then client2 etc. Or is there a better way to loop methods?
Instead of creating n variables for n iterations, why not use a List<clinet> instead like so:
var clients = new List<client>();
for (int i = 0; i < int.MaxValue; i++)
{
clients.Add(new client());
}
Or even simpler:
var clients = Enumerable.Range(0, int.MaxValue).Select(x => new client()).ToList();
Or still simpler:
var clients = Enumerable.Repeat(new client(), int.MaxValue);
Now you can access any client by it´s index, e.g.:
client c = clients[0];
Anyway be aware that you´re creating int.MaxValue number of client-instances. Depending on what client is you´re burning your memory.
NB: Please consider the naming-conventions, e.g. by calling classes like your client-class UpperCase: Client.
If you really need the access to all your variables, #HimBromBeere is right. But if you, for example, have static field in the Client class and only need current variable, you can do something like:
while(true) { var c = new Client(); }
In this case you can check your current state. For example: c.Name; will give you information about the client, which is processed at the current iteration. It will work, if you do all your staff for each client and doesn't need to store info about rest of clients any longer.
Update
My answer is ambiguous. I meant that Client constructor can do something like:
class Client {
static int count = 0;
public string Name { get; set; }
public Client() {
Name = string.Format("client{0}", count++);
}
}
In this case class have static count this tells how many clients did we have. We doesn't use it in our code, but we can understand it while using current client by his name.

Classification of instances in Weka

I'm trying to use Weka in my C# application. I've used IKVM to bring the Java parts into my .NET application. This seems to be working quite well. However, I am at a loss when it comes to Weka's API. How exactly do I classify instances if they are programmatically passed around in my application and not available as ARFF files.
Basically, I am trying to integrate a simple co-reference analysis using Weka's classifiers. I've built the classification model in Weka directly and saved it to disk, from where my .NET application opens it and uses the IKVM port of Weka to predict the class value.
Here is what I've got so far:
// This is the "entry" method for the classification method
public IEnumerable<AttributedTokenDecorator> Execute(IEnumerable<TokenPair> items)
{
TokenPair[] pairs = items.ToArray();
Classifier model = ReadModel(); // reads the Weka generated model
FastVector fv = CreateFastVector(pairs);
Instances instances = new Instances("licora", fv, pairs.Length);
CreateInstances(instances, pairs);
for(int i = 0; i < instances.numInstances(); i++)
{
Instance instance = instances.instance(i);
double classification = model.classifyInstance(instance); // array index out of bounds?
if(AsBoolean(classification))
MakeCoreferent(pairs[i]);
}
throw new NotImplementedException(); // TODO
}
// This is a helper method to create instances from the internal model files
private static void CreateInstances(Instances instances, IEnumerable<TokenPair> pairs)
{
instances.setClassIndex(instances.numAttributes() - 1);
foreach(var pair in pairs)
{
var instance = new Instance(instances.numAttributes());
instance.setDataset(instances);
for (int i = 0; i < instances.numAttributes(); i++)
{
var attribute = instances.attribute(i);
if (pair.Features.ContainsKey(attribute.name()) && pair.Features[attribute.name()] != null)
{
var value = pair.Features[attribute.name()];
if (attribute.isNumeric()) instance.setValue(attribute, Convert.ToDouble(value));
else instance.setValue(attribute, value.ToString());
}
else
{
instance.setMissing(attribute);
}
}
//instance.setClassMissing();
instances.add(instance);
}
}
// This creates the data set's attributes vector
private FastVector CreateFastVector(TokenPair[] pairs)
{
var fv = new FastVector();
foreach (var attribute in _features)
{
Attribute att;
if (attribute.Type.Equals(ArffType.Nominal))
{
var values = new FastVector();
ExtractValues(values, pairs, attribute.FeatureName);
att = new Attribute(attribute.FeatureName, values);
}
else
att = new Attribute(attribute.FeatureName);
fv.addElement(att);
}
{
var classValues = new FastVector(2);
classValues.addElement("0");
classValues.addElement("1");
var classAttribute = new Attribute("isCoref", classValues);
fv.addElement(classAttribute);
}
return fv;
}
// This extracts observed values for nominal attributes
private static void ExtractValues(FastVector values, IEnumerable<TokenPair> pairs, string featureName)
{
var strings = (from x in pairs
where x.Features.ContainsKey(featureName) && x.Features[featureName] != null
select x.Features[featureName].ToString())
.Distinct().ToArray();
foreach (var s in strings)
values.addElement(s);
}
private Classifier ReadModel()
{
return (Classifier) SerializationHelper.read(_model);
}
private static bool AsBoolean(double classifyInstance)
{
return classifyInstance >= 0.5;
}
For some reason, Weka throws an IndexOutOfRangeException when I call model.classifyInstance(instance). I have no idea why, nor can I come up with an idea how to rectify this issue.
I am hoping someone might know where I went wrong. The only documentation for Weka I found relies on ARFF files for prediction, and I don't really want to go there.
For some odd reason, this exception was raised by the DTNB classifier (I was using three in a majority vote classification model). Apparently, not using DTNB "fixed" the issue.

Categories