Microsoft Avro serializer is mangling provided schemas

Microsoft Avro serializer is mangling provided schemas - c#

I'm trying to write out Avro files, and having some real trouble with the serialization. I'm using Microsoft.Avro.Core, and recently discovered that when I give it a schema containing a type with an associated logicalType, it will inexplicably extract the inner type and use that to replace it! This means that my DateTime declaration of "type": {"type": "long", "logicalType": "timestamp-micros"} is now a simple "type": "long", which the recipient is unable to interpret properly.
If it were simply doing this internally to understand what data types it's working with, that would be one thing. But this mangled schema is actually being written to the output file, which is completely incorrect behavior. Does anyone know a way to fix or work around this?
(And yes, the library hasn't been updated in 5 years and is probably completely unsupported. But it was the only .NET Avro serializer I could find that fulfills one crucial requirement: allowing me to work with arbitrary types not known at compile-time. Everything else seems to want to only use generic serializers of type T, but my use case can't provide that T. So I can't abandon this library for something better unless there actually is something better that I can use. But if there is, I'd be open to it.)

After a fair amount of searching and poking through poorly-documented source code, I found a solution. Even though the official Avro library from Apache does require a generic type parameter for all of its readers and writers, if you specify the type as GenericRecord, it will let you work with the GenericRecord as a runtime-defined structure. (This is not an arbitrary dynamic type, as the values you assign to its fields still have to match the provided schema.)
Meanwhile, this library's type system has much wider support for Avro's type set than the abandoned Microsoft one does. It correctly works with Avro's logical types and converts between them and CLR types the way you would expect it should work, with one notable exception: serializing a DateTime will convert it from the system's local time to UTC. (There's probably a way to work around this but I haven't found it yet.)
Just leaving this here in case anyone runs into similar problems in the future.

I tried to reproduce the behavior and put together a small program, that doesn't use generics.
Indeed, from what I can see the DateTime sub type is still omitted from the schema, which is really confusing and frustrating, because the receiver needs to know up-front, possible by using the field name ?!, which long fields are of the type DateTime and which not.?! By default it uses Ticks to parse DateTimes. I looked a bit over the Avro library on github and I saw it uses runtimeType for DateTime and DateTimeOffset.
It probably doesn't help you too much, but maybe it helps someone out there with similar problems.
[DataContract]
public struct TestMsg
{
[DataMember]
public int Id;
[DataMember]
public double Amount;
[DataMember]
public DateTime OrderSubmitted;
public TestMsg(int id, double amount, DateTime orderSubmitted)
{
Id = id;
Amount = amount;
OrderSubmitted = orderSubmitted;
}
}
internal class Program
{
static void Main(string[] args)
{
string line = Environment.NewLine;
string fileName = "Messages.avro";
string filePath = null;
if (Environment.NewLine.Contains("\r"))
{
filePath = new DirectoryInfo(".") + #"\" + fileName;
}
else
{
filePath = new DirectoryInfo(".") + #"/" + fileName;
}
ArrayList al = new ArrayList
{
new TestMsg(1, 189.12, DateTime.Now),
new TestMsg(2, 345.94, new DateTime(2000, 1, 10, 15, 20, 23, 103))
};
var schema = #"
{
""type"" : ""record"",
""name"" : ""TestAvro.TestMsg"",
""fields"" : [
{
""name"" : ""Id"",
""type"": ""int""
},
{
""name"" : ""Amount"",
""type"": ""double""
},
{
""name"" : ""OrderSubmitted"",
""type"": ""long"",
""runtimeType"": ""DateTime""
}
]
}";
using (var dataStream = new FileStream(filePath, FileMode.Create))
{
var serializer = AvroSerializer.CreateGeneric(schema);
using (var avroWriter = AvroContainer.CreateGenericWriter(schema, dataStream, Codec.Null))
{
using (var seqWriter = new SequentialWriter<object>(avroWriter, al.Count))
{
var avroRecords = new List<AvroRecord>();
foreach (var item in al)
{
dynamic record = new AvroRecord(serializer.WriterSchema);
record.Id = ((TestMsg)item).Id;
record.Amount = ((TestMsg)item).Amount;
record.OrderSubmitted = ((TestMsg)item).OrderSubmitted.Ticks;
seqWriter.Write(record);
}
}
}
dataStream.Dispose();
}
Console.WriteLine("Now reading file.");
using (var dataStream = new FileStream(filePath, FileMode.Open))
{
using (var avroReader = AvroContainer.CreateGenericReader(schema, dataStream, true, new CodecFactory()))
{
using (var seqReader = new SequentialReader<object>(avroReader))
{
foreach (var dynamicItem in seqReader.Objects)
{
dynamic obj = (dynamic)dynamicItem;
Console.WriteLine($" {obj.Id} - {obj.Amount} - {new DateTime(obj.OrderSubmitted).ToString()}");
}
}
}
dataStream.Dispose();
}
Console.ReadLine();
}
}

Related

How to serialize/deserialize from ksql avro format to c# using confluent platform

I am using KsqlDb a table with the following form:
KSQL-DB Query
create table currency (id integer,name varchar) with (kafka_topic='currency',partitions=1,value_format='avro');
C# model
public class Currency
{
public int Id{get;set;}
public string Name{get;set;}
}
Now i want to know how should i write/read data from this topic in C# using the Confluent library:
Writing
IProducer<int, Currency> producer=....
Currency cur=new Currency();
Message<int,Currency> message = new Message<int, Currency>
{
Key = msg.Id,
Timestamp = new Timestamp(DateTime.UtcNow, TimestampType.CreateTime),
Value = msg
};
DeliveryResult<int,Currency> delivery = await this.producer.ProduceAsync(topic,message);
Reading
IConsumer<int,Currency> iconsumer = new ConsumerBuilder<int, Currency>(config)
.SetKeyDeserializer(Deserializers.Int32) //i assume i need to use the id from my dto
.SetValueDeserializer(...) //what deserializer
.Build();
ConsumeResult<int,Currency> result = consumer.Consume();
Currency message = // what deserializer JsonSerializer.Deserialize<Currency>(result.Message.Value);
I am not sure how to go about this so i tried looking for serializer. I found this library AvroSerializer , but i do not get where the author fetches the schema.
Any help on how to read/write to a specific topic that would match with my ksqldb models ?
Update
After some research and some answers here i have started using the schemaRegistry
var config = new ConsumerConfig
{
GroupId = kafkaConfig.ConsumerGroup,
BootstrapServers = kafkaConfig.ServerUrl,
AutoOffsetReset = AutoOffsetReset.Earliest
};
var schemaRegistryConfig = new SchemaRegistryConfig
{
Url = kafkaConfig.SchemaRegistryUrl
};
var schemaRegistry = new CachedSchemaRegistryClient(schemaRegistryConfig);
IConsumer<int,Currency> consumer = new ConsumerBuilder<int, Currency>(config)
.SetKeyDeserializer(new AvroDeserializer<int>(schemaRegistry).AsSyncOverAsync())
.SetValueDeserializer(new AvroDeserializer<Currency>(schemaRegistry).AsSyncOverAsync())
.Build();
ConsumeResult<int, Currency> result = consumer.Consume();
Now i am getting another error:
Expecting data framing of length 5 bytes or more but total data size
is 4 bytes
As someone kindly pointed out it seems i retrieving only the id from the schema registry.
How can i just : insert into currency (id,name) values (1,3) and retrieve it in C# as a POCO (listed above) ?
Update 2
After i have found this source program it seems i am not able to publish messages to tables for some reason.
There is no error when sending the message but it is not published to Kafka.

I found this library AvroSerializer , but i do not get where the author fetches the schema.
Unclear why you need to use a library other than the Confluent one, but they get it from the Schema Registry. You can use CachedSchemaRegistryClient to get the schema string easily, however you shouldn't need this in the code as the deserializer will download from the registry on its own.
If you refer to the examples/ in the confluent-kafka-dotnet repo for Specific Avro consumption, you can see they generate the User class from User.avsc file, which seems to be exactly what you want to do here for Currency rather than write it yourself

I have solved the problem by defining my custom serializer , thus implementing the ISerializer<T> and IDeserializer<T> interfaces which in their belly are just wrappers over System.Text.Json.JsonSerializer or NewtonsoftJson.
Serializer
public class MySerializer:ISerializer<T>
{
byte[] Serialize(T data, SerializationContext context)
{
var str=System.Text.Json.JsonSerializer.Serialize(data); //you can also use Newtonsoft here
var bytes=Encoding.UTF8.GetBytes(str);
return bytes;
}
}
Usage
var config = new ConsumerConfig
{
GroupId = kafkaConfig.ConsumerGroup,
BootstrapServers = kafkaConfig.ServerUrl,
AutoOffsetReset = AutoOffsetReset.Earliest
};
IConsumer<int,Currency> consumer = new ConsumerBuilder<int, Currency>(config)
.SetValueDeserializer(new MySerializer<Currency>())
.Build();
ConsumeResult<int, Currency> result = consumer.Consume();
P.S
I am not even using the schema registry here afteri implemented the interface

C# - Saving and Loading data to file

I decided to get into coding and am learning c#, after making a few small projects, I decided to step it up a little and make a text adventure game, with saving and loading, and if I get to feeling zany I'll try to add some multiplayer. While I haven't really hit a road block because of it, I can't help but feel that I am doing load function REALLY sub-optimally. The save is fine, I feel like it works for me, but the load I feel can be really simplified, I just don't know what to use.
I also wouldn't really mind, but with this way, if I add other attributes/skills or whatever else that needs to be saved, I will have to add everything to the load function as well, and it will be even longer.
I have tried to search around on here, the c# documentation, and other sites, but can't find a solution that works for this case. can anyone help me find a better way of doing this? Or is this the best I can really do since it's varying data types?
Edit: To simplify and clarify what answer I am searching for, I am trying to find a simpler and more scalable way to save and load the data to a file.
static void LoadGame(CharData PlayerData)
{
Console.WriteLine("Enter the name of the character to load as shown below.");
//getting current directory info, setting to di
DirectoryInfo di = new DirectoryInfo(Directory.GetCurrentDirectory());
//need to initialize these outside of a loop
int SaveFiles = 0;
string DisplayName = " ";
int DisplayNameLength = 0;
//looks through files in working directory ending in '.fasv', displays them in format '{x}. John Smith'
foreach (var fi in di.GetFiles("*.fasv"))
{
SaveFiles++;
DisplayNameLength = fi.Name.Length;
//remove .fasv from displayed name to make it look nicer
DisplayName = fi.Name.Remove(DisplayNameLength - 5, 5);
Console.WriteLine(SaveFiles.ToString() + ". " + DisplayName);
}
string toLoad = Console.ReadLine();
using StreamReader sr = new StreamReader(toLoad + ".fasv");
//the name is easy to get since it's a string. but integers...
PlayerData.Name = sr.ReadLine();
//... not so much. i hate all of this and i feel like it's gross, but i don't know how else to do it
int hp, xp, level, toughness, innovation, mind, empathy, spryness;
Int32.TryParse(sr.ReadLine(), out hp);
Int32.TryParse(sr.ReadLine(), out xp);
Int32.TryParse(sr.ReadLine(), out level);
Int32.TryParse(sr.ReadLine(), out toughness);
Int32.TryParse(sr.ReadLine(), out innovation);
Int32.TryParse(sr.ReadLine(), out mind);
Int32.TryParse(sr.ReadLine(), out empathy);
Int32.TryParse(sr.ReadLine(), out spryness);
PlayerData.Health = hp;
PlayerData.Level = level;
PlayerData.XP = xp;
PlayerData.Toughness = toughness;
PlayerData.Innovation = innovation;
PlayerData.Mind = mind;
PlayerData.Empathy = empathy;
PlayerData.Spryness = spryness;
sr.Close();
InGame(PlayerData);
}
static void SaveGame(CharData PlayerData)
{
using (StreamWriter sw = new StreamWriter(PlayerData.Name + ".fasv"))
{
foreach (System.Reflection.PropertyInfo stat in PlayerData.GetType().GetProperties())
{
//write player data properties to file line by line, using stat to iterate through the player data properties
sw.WriteLine(stat.GetValue(PlayerData));
}
sw.Close();
}
}

If you aren't set on a particular data format for the file data, I would recommend using a serializer such as JSON.NET. You can use NuGet to add newtonsoft.json to your project, and that would allow you to just do something similar to:
using (StreamWriter file = File.CreateText(pathToPlayerFile))
{
var serializer = new JsonSerializer();
serializer.Serialize(file, playerData);
}
And then your code to read from the file would be pretty similar:
using (var file = File.OpenText(pathToPlayerFile))
{
var serializer = new JsonSerializer();
return (CharData)serializer.Deserialize(file, typeof(CharData));
}
I borrowed those code snippets from newtonsoft.com. CreateText will create (or overwrite) the file and write the object as a JSON object.

C# GoogleAPI - How to set a time duration when variable type is "object"?

I'm stuck with my problem using " Google.Apis.Testing.v1.Data " and their documentation doesn't help me.
I have to set a "timeout" value (= a duration), but the variable type is "object" instead of "float" for example. I tried to put an int, a float, and a string but that doesn't work.
The object API doc is here. My variable is "TestTimeout" which is definitely a duration.
When I searched for a solution, I saw in java the variable type is string but that doesn't help (here)
Just for your information, I'm using this lib to execute my android application on their test devices. It's a service called TestLab in Firebase. The timeout value needs to be higher because I don't have enough time to execute my test. Here is my code, everything is working well besides this TimeOut.
TestMatrix testMatrix = new TestMatrix();
testMatrix.TestSpecification = new TestSpecification();
testMatrix.TestSpecification.TestTimeout = 600.0f; // I tested 600, 600.0f, "600", "30m", "500s"
testMatrix.EnvironmentMatrix = new EnvironmentMatrix();
testMatrix.EnvironmentMatrix.AndroidDeviceList = new AndroidDeviceList();
testMatrix.EnvironmentMatrix.AndroidDeviceList.AndroidDevices = new List<AndroidDevice>();
foreach (TestMatrixModel.TestData testData in _model.ListTests)
{
if (testData.IsSelected)
{
//Here I'm using my own data class to set GoogleAPI objects, it's simple
//as it asks me strings even for integer numbers, and it's working
foreach (int indice in testData.ChosenAndroidVersionsIndices)
{
AndroidDevice device = new AndroidDevice();
device.AndroidModelId = testData.ModelID;
device.AndroidVersionId = testData.AvailableAndroidVersions[indice];
device.Locale = testData.AvailableLocales[testData.ChosenLocale];
device.Orientation = testData.Orientation;
testMatrix.EnvironmentMatrix.AndroidDeviceList.AndroidDevices.Add(device);
}
}
}
Ok and here is the result of the request :
{
"testMatrixId": "matrix-2dntrwio3kco7",
"testSpecification": {
"testTimeout": "300s",
"testSetup": {},
"androidTestLoop": {
"appApk": {
"gcsPath": "gs://myLinkIntoGoogleCloudStorage.apk"
}
}
},
"environmentMatrix": {
"androidDeviceList": {
"androidDevices": [
{
"androidModelId": "grandpplte",
"androidVersionId": "23",
"locale": "en_001",
"orientation": "landscape"
},
{
"androidModelId": "hero2lte",
"androidVersionId": "23",
"locale": "en_001",
"orientation": "landscape"
},
etc.....
As you can see, it seems to be a string set to "300s"... so why "500s" cannot enter in ?
Thanks a lot.

Ok I got my answer :
testMatrix.TestSpecification.TestTimeout = "600s";
So it was a string and needed to finish with "s". Why that didn't work when I tried ? Just because my code was overrided with another TestSpecification after... my bad.

Deserialize Avro Spark

I'm pushing a stream of data to Azure EventHub with the following code leveraging Microsoft.Hadoop.Avro.. this code runs every 5 seconds, and simply plops the same two Avro serialised items 👍🏼:
var strSchema = File.ReadAllText("schema.json");
var avroSerializer = AvroSerializer.CreateGeneric(strSchema);
var rootSchema = avroSerializer.WriterSchema as RecordSchema;
var itemList = new List<AvroRecord>();
dynamic record_one = new AvroRecord(rootSchema);
record_one.FirstName = "Some";
record_one.LastName = "Guy";
itemList.Add(record_one);
dynamic record_two = new AvroRecord(rootSchema);
record_two.FirstName = "A.";
record_two.LastName = "Person";
itemList.Add(record_two);
using (var buffer = new MemoryStream())
{
using (var writer = AvroContainer.CreateGenericWriter(strSchema, buffer, Codec.Null))
{
using (var streamWriter = new SequentialWriter<object>(writer, itemList.Count))
{
foreach (var item in itemList)
{
streamWriter.Write(item);
}
}
}
eventHubClient.SendAsync(new EventData(buffer.ToArray()));
}
The schema used here is, again, v. simple:
{
"type": "record",
"name": "User",
"namespace": "SerDes",
"fields": [
{
"name": "FirstName",
"type": "string"
},
{
"name": "LastName",
"type": "string"
}
]
}
I have validated this is all good, with a simple view in Azure Stream Analytics on the portal:
So far so good, but i cannot, for the life of me correctly deserialize this in Databricks leverage the from_avro() command under Scala..
Load (the exact same) schema as a string:
val sampleJsonSchema = dbutils.fs.head("/mnt/schemas/schema.json")
Configure EventHub
val connectionString = ConnectionStringBuilder("<CONNECTION_STRING>")
.setEventHubName("<NAME_OF_EVENT_HUB>")
.build
val eventHubsConf = EventHubsConf(connectionString).setStartingPosition(EventPosition.fromEndOfStream)
val eventhubs = spark.readStream.format("eventhubs").options(eventHubsConf.toMap).load()
Read the data..
// this works, and i can see the serialised data
display(eventhubs.select($"body"))
// this fails, and with an exception: org.apache.spark.SparkException: Malformed records are detected in record parsing. Current parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
display(eventhubs.select(from_avro($"body", sampleJsonSchema)))
So essentially, what is going on here.. i am serialising the data with the same schema as deserializing, but something is malformed.. the documentation is incredibly sparse on this front (very very minimal on the Microsoft website).

The issue
After additional investigation, (and mainly with the help of this article) I found what my problem was: from_avro(data: Column, jsonFormatSchema: String) expects spark schema format and not avro schema format. The documentation is not very clear on this.
Solution 1
Databricks provides a handy method from_avro(column: Column, subject: String, schemaRegistryUrl: String)) that fetches needed avro schema from kafka schema registry and automatically converts to correct format.
Unfortunately, it is not available for pure spark, nor is it possible to use it without a kafka schema registry.
Solution 2
Use schema conversion provided by spark:
// define avro deserializer
class AvroDeserializer() extends AbstractKafkaAvroDeserializer {
override def deserialize(payload: Array[Byte]): String = {
val genericRecord = this.deserialize(payload).asInstanceOf[GenericRecord]
genericRecord.toString
}
}
// create deserializer instance
val deserializer = new AvroDeserializer()
// register deserializer
spark.udf.register("deserialize_avro", (bytes: Array[Byte]) =>
deserializer.deserialize(bytes)
)
// get avro schema from registry (but I presume that it should also work with schema read from a local file)
val registryClient = new CachedSchemaRegistryClient(kafkaSchemaRegistryUrl, 128)
val avroSchema = registryClient.getLatestSchemaMetadata(topic + "-value").getSchema
val sparkSchema = SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))
// consume data
df.selectExpr("deserialize_avro(value) as data")
.select(from_json(col("data"), sparkSchema.dataType).as("data"))
.select("data.*")

Migrate serialized objects to new version

I would like to migrate me previously serialized objects in database to new schema.
My previous object.
Public interface MyReport
{
string Id { get; set;}
string Name { get; set;}
Dictionary<string, string> PropColl { get; set;}
}
But for some reasons we had to make interface changes
Public interface IMarkme
{
}
Public interface MyReport<T> where T : Imarkme
{
string Id { get; set;}
string Name { get; set;}
T ExtendedProp { get; set;}
}
Public NewProp : Imarkme
{
/// some code here
}
So as you can see my interface has been modified and I would like to migrate my serialized objects which were serialized based on MyReport to MyReport
Can someone provide me some input as what kind of utility I should aim to write which can help me achieve migrating my serialized object to new modified interface version.
Thanks,
AG

I have actually done something similar recently, where I have created a simple console application to be able to transform some serialized objects from one version to another. I have simply used both versions of dlls and reflection to read and write the values of different properties. Probably you'll find this helpful as an inspiration ;)
static void Main(string[] args)
{
object test;
AppDomain.CurrentDomain.AssemblyResolve += domain_AssemblyResolve;
using (var con = new SqlConnection(connectionString))
{
using (var cmd = new SqlCommand())
{
cmd.CommandText = "select top 1 Data_Blob from dbo.Serialized";
cmd.CommandType = CommandType.Text;
cmd.Connection = con;
con.Open();
var blob = (byte[])cmd.ExecuteScalar();
var bf = new BinaryFormatter();
var stream = new MemoryStream(blob);
bf.AssemblyFormat = FormatterAssemblyStyle.Full;
test = bf.Deserialize(stream);
}
}
var objNewVersion = Activator.CreateInstance(Type.GetType("ObjectGraphLibrary.Test, ObjectGraphLibrary, Version=1.0.0.10, Culture=neutral, PublicKeyToken=33c7c38cf0d65826"));
var oldType = test.GetType();
var newType = objNewVersion.GetType();
var oldName = (string) oldType.GetProperty("Name").GetValue(test, null);
var oldAge = (int) oldType.GetProperty("Age").GetValue(test, null);
newType.GetProperty("Name").SetValue(objNewVersion, oldName, null);
newType.GetProperty("DateOfBirth").SetValue(objNewVersion, DateTime.Now.AddYears(-oldAge), null);
Console.Read();
}
static Assembly domain_AssemblyResolve(object sender, ResolveEventArgs args)
{
var assName = new AssemblyName(args.Name);
var uriBuilder = new UriBuilder(Assembly.GetExecutingAssembly().CodeBase);
var assemblyPath = Uri.UnescapeDataString(uriBuilder.Path);
var codeBase = Path.GetDirectoryName(assemblyPath);
var assPath = Path.Combine(codeBase, string.Format("old\\{0}.{1}.{2}.{3}\\{4}.dll", assName.Version.Major,
assName.Version.Minor, assName.Version.Build,
assName.Version.Revision, assName.Name));
return File.Exists(assPath) ? Assembly.LoadFile(assPath) : null;
}

1) Write a utility that reads the serialized objects in the old object definition.
2) The utility writes your objects into the DB in a non-serialized manner (ie, with one piece of data in every field, etc...).
Don't get into the habit of serializing objects, and storing the somewhere in persistent storage for retrieval (much) later. Serialization was not built for that.
You have run into the problem of C programmers in the old days: they would create a struct in memory, save that struct into a file. Then the struct's members would change, and they woudl wonder how to read it back, since the data was encoded differently.
then along came database formats, INI files, and so on, specifically to address this need, so saving data in one format, and then being able to read it without error.
So don't repeat errors of the past. Serialization was created to facilitate short-term binary storage and the ability to, say, transmit an object over TCP/IP.
At worst, store your data as XML, not as serialized binary stream. Also, there is no assurance that I know about from MS that says that serialized data from one version of .NET will be able to be read from another. Convert your data to a legible format while you can.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Microsoft Avro serializer is mangling provided schemas - c#

Related

How to serialize/deserialize from ksql avro format to c# using confluent platform

C# - Saving and Loading data to file

C# GoogleAPI - How to set a time duration when variable type is "object"?

Deserialize Avro Spark

Migrate serialized objects to new version

Categories

Resources