Deserialize Avro Spark - c#

I'm pushing a stream of data to Azure EventHub with the following code leveraging Microsoft.Hadoop.Avro.. this code runs every 5 seconds, and simply plops the same two Avro serialised items 👍🏼:
var strSchema = File.ReadAllText("schema.json");
var avroSerializer = AvroSerializer.CreateGeneric(strSchema);
var rootSchema = avroSerializer.WriterSchema as RecordSchema;
var itemList = new List<AvroRecord>();
dynamic record_one = new AvroRecord(rootSchema);
record_one.FirstName = "Some";
record_one.LastName = "Guy";
itemList.Add(record_one);
dynamic record_two = new AvroRecord(rootSchema);
record_two.FirstName = "A.";
record_two.LastName = "Person";
itemList.Add(record_two);
using (var buffer = new MemoryStream())
{
using (var writer = AvroContainer.CreateGenericWriter(strSchema, buffer, Codec.Null))
{
using (var streamWriter = new SequentialWriter<object>(writer, itemList.Count))
{
foreach (var item in itemList)
{
streamWriter.Write(item);
}
}
}
eventHubClient.SendAsync(new EventData(buffer.ToArray()));
}
The schema used here is, again, v. simple:
{
"type": "record",
"name": "User",
"namespace": "SerDes",
"fields": [
{
"name": "FirstName",
"type": "string"
},
{
"name": "LastName",
"type": "string"
}
]
}
I have validated this is all good, with a simple view in Azure Stream Analytics on the portal:
So far so good, but i cannot, for the life of me correctly deserialize this in Databricks leverage the from_avro() command under Scala..
Load (the exact same) schema as a string:
val sampleJsonSchema = dbutils.fs.head("/mnt/schemas/schema.json")
Configure EventHub
val connectionString = ConnectionStringBuilder("<CONNECTION_STRING>")
.setEventHubName("<NAME_OF_EVENT_HUB>")
.build
val eventHubsConf = EventHubsConf(connectionString).setStartingPosition(EventPosition.fromEndOfStream)
val eventhubs = spark.readStream.format("eventhubs").options(eventHubsConf.toMap).load()
Read the data..
// this works, and i can see the serialised data
display(eventhubs.select($"body"))
// this fails, and with an exception: org.apache.spark.SparkException: Malformed records are detected in record parsing. Current parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.
display(eventhubs.select(from_avro($"body", sampleJsonSchema)))
So essentially, what is going on here.. i am serialising the data with the same schema as deserializing, but something is malformed.. the documentation is incredibly sparse on this front (very very minimal on the Microsoft website).

The issue
After additional investigation, (and mainly with the help of this article) I found what my problem was: from_avro(data: Column, jsonFormatSchema: String) expects spark schema format and not avro schema format. The documentation is not very clear on this.
Solution 1
Databricks provides a handy method from_avro(column: Column, subject: String, schemaRegistryUrl: String)) that fetches needed avro schema from kafka schema registry and automatically converts to correct format.
Unfortunately, it is not available for pure spark, nor is it possible to use it without a kafka schema registry.
Solution 2
Use schema conversion provided by spark:
// define avro deserializer
class AvroDeserializer() extends AbstractKafkaAvroDeserializer {
override def deserialize(payload: Array[Byte]): String = {
val genericRecord = this.deserialize(payload).asInstanceOf[GenericRecord]
genericRecord.toString
}
}
// create deserializer instance
val deserializer = new AvroDeserializer()
// register deserializer
spark.udf.register("deserialize_avro", (bytes: Array[Byte]) =>
deserializer.deserialize(bytes)
)
// get avro schema from registry (but I presume that it should also work with schema read from a local file)
val registryClient = new CachedSchemaRegistryClient(kafkaSchemaRegistryUrl, 128)
val avroSchema = registryClient.getLatestSchemaMetadata(topic + "-value").getSchema
val sparkSchema = SchemaConverters.toSqlType(new Schema.Parser().parse(avroSchema))
// consume data
df.selectExpr("deserialize_avro(value) as data")
.select(from_json(col("data"), sparkSchema.dataType).as("data"))
.select("data.*")

Related

Microsoft Avro serializer is mangling provided schemas

I'm trying to write out Avro files, and having some real trouble with the serialization. I'm using Microsoft.Avro.Core, and recently discovered that when I give it a schema containing a type with an associated logicalType, it will inexplicably extract the inner type and use that to replace it! This means that my DateTime declaration of "type": {"type": "long", "logicalType": "timestamp-micros"} is now a simple "type": "long", which the recipient is unable to interpret properly.
If it were simply doing this internally to understand what data types it's working with, that would be one thing. But this mangled schema is actually being written to the output file, which is completely incorrect behavior. Does anyone know a way to fix or work around this?
(And yes, the library hasn't been updated in 5 years and is probably completely unsupported. But it was the only .NET Avro serializer I could find that fulfills one crucial requirement: allowing me to work with arbitrary types not known at compile-time. Everything else seems to want to only use generic serializers of type T, but my use case can't provide that T. So I can't abandon this library for something better unless there actually is something better that I can use. But if there is, I'd be open to it.)
After a fair amount of searching and poking through poorly-documented source code, I found a solution. Even though the official Avro library from Apache does require a generic type parameter for all of its readers and writers, if you specify the type as GenericRecord, it will let you work with the GenericRecord as a runtime-defined structure. (This is not an arbitrary dynamic type, as the values you assign to its fields still have to match the provided schema.)
Meanwhile, this library's type system has much wider support for Avro's type set than the abandoned Microsoft one does. It correctly works with Avro's logical types and converts between them and CLR types the way you would expect it should work, with one notable exception: serializing a DateTime will convert it from the system's local time to UTC. (There's probably a way to work around this but I haven't found it yet.)
Just leaving this here in case anyone runs into similar problems in the future.
I tried to reproduce the behavior and put together a small program, that doesn't use generics.
Indeed, from what I can see the DateTime sub type is still omitted from the schema, which is really confusing and frustrating, because the receiver needs to know up-front, possible by using the field name ?!, which long fields are of the type DateTime and which not.?! By default it uses Ticks to parse DateTimes. I looked a bit over the Avro library on github and I saw it uses runtimeType for DateTime and DateTimeOffset.
It probably doesn't help you too much, but maybe it helps someone out there with similar problems.
[DataContract]
public struct TestMsg
{
[DataMember]
public int Id;
[DataMember]
public double Amount;
[DataMember]
public DateTime OrderSubmitted;
public TestMsg(int id, double amount, DateTime orderSubmitted)
{
Id = id;
Amount = amount;
OrderSubmitted = orderSubmitted;
}
}
internal class Program
{
static void Main(string[] args)
{
string line = Environment.NewLine;
string fileName = "Messages.avro";
string filePath = null;
if (Environment.NewLine.Contains("\r"))
{
filePath = new DirectoryInfo(".") + #"\" + fileName;
}
else
{
filePath = new DirectoryInfo(".") + #"/" + fileName;
}
ArrayList al = new ArrayList
{
new TestMsg(1, 189.12, DateTime.Now),
new TestMsg(2, 345.94, new DateTime(2000, 1, 10, 15, 20, 23, 103))
};
var schema = #"
{
""type"" : ""record"",
""name"" : ""TestAvro.TestMsg"",
""fields"" : [
{
""name"" : ""Id"",
""type"": ""int""
},
{
""name"" : ""Amount"",
""type"": ""double""
},
{
""name"" : ""OrderSubmitted"",
""type"": ""long"",
""runtimeType"": ""DateTime""
}
]
}";
using (var dataStream = new FileStream(filePath, FileMode.Create))
{
var serializer = AvroSerializer.CreateGeneric(schema);
using (var avroWriter = AvroContainer.CreateGenericWriter(schema, dataStream, Codec.Null))
{
using (var seqWriter = new SequentialWriter<object>(avroWriter, al.Count))
{
var avroRecords = new List<AvroRecord>();
foreach (var item in al)
{
dynamic record = new AvroRecord(serializer.WriterSchema);
record.Id = ((TestMsg)item).Id;
record.Amount = ((TestMsg)item).Amount;
record.OrderSubmitted = ((TestMsg)item).OrderSubmitted.Ticks;
seqWriter.Write(record);
}
}
}
dataStream.Dispose();
}
Console.WriteLine("Now reading file.");
using (var dataStream = new FileStream(filePath, FileMode.Open))
{
using (var avroReader = AvroContainer.CreateGenericReader(schema, dataStream, true, new CodecFactory()))
{
using (var seqReader = new SequentialReader<object>(avroReader))
{
foreach (var dynamicItem in seqReader.Objects)
{
dynamic obj = (dynamic)dynamicItem;
Console.WriteLine($" {obj.Id} - {obj.Amount} - {new DateTime(obj.OrderSubmitted).ToString()}");
}
}
}
dataStream.Dispose();
}
Console.ReadLine();
}
}

How to serialize/deserialize from ksql avro format to c# using confluent platform

I am using KsqlDb a table with the following form:
KSQL-DB Query
create table currency (id integer,name varchar) with (kafka_topic='currency',partitions=1,value_format='avro');
C# model
public class Currency
{
public int Id{get;set;}
public string Name{get;set;}
}
Now i want to know how should i write/read data from this topic in C# using the Confluent library:
Writing
IProducer<int, Currency> producer=....
Currency cur=new Currency();
Message<int,Currency> message = new Message<int, Currency>
{
Key = msg.Id,
Timestamp = new Timestamp(DateTime.UtcNow, TimestampType.CreateTime),
Value = msg
};
DeliveryResult<int,Currency> delivery = await this.producer.ProduceAsync(topic,message);
Reading
IConsumer<int,Currency> iconsumer = new ConsumerBuilder<int, Currency>(config)
.SetKeyDeserializer(Deserializers.Int32) //i assume i need to use the id from my dto
.SetValueDeserializer(...) //what deserializer
.Build();
ConsumeResult<int,Currency> result = consumer.Consume();
Currency message = // what deserializer JsonSerializer.Deserialize<Currency>(result.Message.Value);
I am not sure how to go about this so i tried looking for serializer. I found this library AvroSerializer , but i do not get where the author fetches the schema.
Any help on how to read/write to a specific topic that would match with my ksqldb models ?
Update
After some research and some answers here i have started using the schemaRegistry
var config = new ConsumerConfig
{
GroupId = kafkaConfig.ConsumerGroup,
BootstrapServers = kafkaConfig.ServerUrl,
AutoOffsetReset = AutoOffsetReset.Earliest
};
var schemaRegistryConfig = new SchemaRegistryConfig
{
Url = kafkaConfig.SchemaRegistryUrl
};
var schemaRegistry = new CachedSchemaRegistryClient(schemaRegistryConfig);
IConsumer<int,Currency> consumer = new ConsumerBuilder<int, Currency>(config)
.SetKeyDeserializer(new AvroDeserializer<int>(schemaRegistry).AsSyncOverAsync())
.SetValueDeserializer(new AvroDeserializer<Currency>(schemaRegistry).AsSyncOverAsync())
.Build();
ConsumeResult<int, Currency> result = consumer.Consume();
Now i am getting another error:
Expecting data framing of length 5 bytes or more but total data size
is 4 bytes
As someone kindly pointed out it seems i retrieving only the id from the schema registry.
How can i just : insert into currency (id,name) values (1,3) and retrieve it in C# as a POCO (listed above) ?
Update 2
After i have found this source program it seems i am not able to publish messages to tables for some reason.
There is no error when sending the message but it is not published to Kafka.
I found this library AvroSerializer , but i do not get where the author fetches the schema.
Unclear why you need to use a library other than the Confluent one, but they get it from the Schema Registry. You can use CachedSchemaRegistryClient to get the schema string easily, however you shouldn't need this in the code as the deserializer will download from the registry on its own.
If you refer to the examples/ in the confluent-kafka-dotnet repo for Specific Avro consumption, you can see they generate the User class from User.avsc file, which seems to be exactly what you want to do here for Currency rather than write it yourself
I have solved the problem by defining my custom serializer , thus implementing the ISerializer<T> and IDeserializer<T> interfaces which in their belly are just wrappers over System.Text.Json.JsonSerializer or NewtonsoftJson.
Serializer
public class MySerializer:ISerializer<T>
{
byte[] Serialize(T data, SerializationContext context)
{
var str=System.Text.Json.JsonSerializer.Serialize(data); //you can also use Newtonsoft here
var bytes=Encoding.UTF8.GetBytes(str);
return bytes;
}
}
Usage
var config = new ConsumerConfig
{
GroupId = kafkaConfig.ConsumerGroup,
BootstrapServers = kafkaConfig.ServerUrl,
AutoOffsetReset = AutoOffsetReset.Earliest
};
IConsumer<int,Currency> consumer = new ConsumerBuilder<int, Currency>(config)
.SetValueDeserializer(new MySerializer<Currency>())
.Build();
ConsumeResult<int, Currency> result = consumer.Consume();
P.S
I am not even using the schema registry here afteri implemented the interface

Azure KeyPhrase API returning 400 at times

I'm getting mixed results with the Azure KeyPhrase API - sometimes successful (by that I mean 200 result) and others I'm getting 400 bad request. To test the service, I'm sending the contents from a Azure PDF on their NoSQL service.
The documentation says that each document may be upto 5k characters. So as to rule that out, (I started off with 5k) I'm limiting each to at most 1k characters.
How can I can get more info on what is the cause of the failure? I've already checked the Portal, but there's not much detail there.
I am using this endpoint: https://eastus.api.cognitive.microsoft.com/text/analytics/v2.0/keyPhrases
Some sample failures:
{"documents":[{"language":"en","id":1,"text":"David Chappell Understanding NoSQL on Microsoft Azure Sponsored by Microsoft Corporation Copyright © 2014 Chappell & Associates"}]}
{"documents":[{"language":"en","id":1,"text":"3 Relational technology has been the dominant approach to working with data for decades. Typically accessed using Structured Query Language (SQL), relational databases are incredibly useful. And as their popularity suggests, they can be applied in many different situations. But relational technology isn’t always the best approach. Suppose you need to work with very large amounts of data, for example, too much to store on a single machine. Scaling relational technology to work effectively across many servers (physical or virtual) can be challenging. Or suppose your application works with data that’s not a natural fit for relational systems, such as JavaScript Object Notation (JSON) documents. Shoehorning the data into relational tables is possible, but a storage technology expressly designed to work with this kind of information might be simpler. NoSQL technologies have been created to address problems like these. As the name suggests, the label encompasses a variety of storage"}]}
** added my quick/dirty poc code ***
List<string> sendRequest(object data)
{
string url = "https://eastus.api.cognitive.microsoft.com/text/analytics/v2.0/keyPhrases";
string key = "api-code-here";
string hdr = "Ocp-Apim-Subscription-Key";
var wc = new WebClient();
wc.Headers.Add(hdr, key);
wc.Headers.Add(HttpRequestHeader.ContentType, "application/json");
TextAnalyticsResult results = null;
string json = JsonConvert.SerializeObject(data);
try
{
var bytes = Encoding.Default.GetBytes(json);
var d2 = wc.UploadData(url, bytes);
var dataString = Encoding.Default.GetString(d2);
results = JsonConvert.DeserializeObject<TextAnalyticsResult>(dataString);
}
catch (Exception ex)
{
var s = ex.Message;
}
System.Threading.Thread.Sleep(125);
if (results != null && results.documents != null)
return results.documents.SelectMany(x => x.keyPhrases).ToList();
else
return new List<string>();
}
Called by:
foreach (var k in vals)
{
data.documents.Clear();
int countSpaces = k.Count(Char.IsWhiteSpace);
if (countSpaces > 3)
{
if (k.Length > maxLen)
{
var v = k;
while (v.Length > maxLen)
{
var tmp = v.Substring(0, maxLen);
var idx = tmp.LastIndexOf(" ");
tmp = tmp.Substring(0, idx).Trim();
data.documents.Add(new
{
language = "en",
id = data.documents.Count() + 1,
text = tmp
});
v = v.Substring(idx + 1).Trim();
phrases.AddRange(sendRequest(data));
data.documents.Clear();
}
data.documents.Add(new
{
language = "en",
id = data.documents.Count() + 1,
text = v
});
phrases.AddRange(sendRequest(data));
data.documents.Clear();
}
else
{
data.documents.Add(new
{
language = "en",
id = 1,
text = k
});
phrases.AddRange(sendRequest(data));
data.documents.Clear();
};
}
}
I manually created some requests using the document samples that you indicated had errors and they were processed by the service correctly and returned key phrases. So an encoding issue looks likely.
In the future, you can also look at the inner error returned by the service. Usually you'll see some more details like in the response sample below.
{
"code": "BadRequest",
"message": "Invalid request",
"innerError": {
"code": "InvalidRequestContent",
"message": "Request contains duplicated Ids. Make sure each document has a unique Id."
}
}
Also, there is a .NET SDK for Text Analytics that can help simplify calling the service.
https://github.com/Azure/azure-rest-api-specs/tree/current/specification/cognitiveservices/data-plane/TextAnalytics
Try changing this line
var bytes = Encoding.Default.GetBytes(json);
to
var bytes = Encoding.UTF8.GetBytes(json);

Elasticsearch - MapperParsingException[Malformed content, must start with an object]

I am trying to bulk index document into ES using BulkDescriptor in C#. i am using V1.7 ES. Following is my piece of code,
public IBulkResponse IndexBulk(string index, string type, List<string> documents)
{
BulkDescriptor descriptor = new BulkDescriptor();
foreach (var doc in documents)
{
JObject data = JObject.Parse(documents);
descriptor.Index<object>(i => i
.Index(index)
.Type(type)
.Id(data["Id"].toString())
.Document(doc));
}
return _Client.Bulk(descriptor);
}
But it is not inserting the documents, When i verified the response i saw the following message MapperParsingException[Malformed content, must start with an object]
Sample JSON document
{
"a" : "abc",
"b": { "c": ["1","2"]}
}
What went wrong in it?
Issue here is passing raw json through strongly typed fluent bulk method.
What you are actually sending to elasticsearch is
{"index":{"_index":"test1","_type":"string"}}
"{"a" : "abc","b": { "c": ["1","2"]}}"
which is not correct.
Few ideas what you can do about this:
use JObject to send correctly serialized object to elasticsearch
descriptor.Index<JObject>(i => i
.Index(index)
.Type(type)
.Id(data["Id"].toString())
.Document(JObject.Parse(doc)));
take advantage of using .Raw client to send raw json
var json = new StringBuilder();
json.AppendLine(#"{""index"":{""_index"":""indexName"",""_type"":""typeName""}}");
json.AppendLine(#"{""a"" : ""abc"",""b"": { ""c"": [""1"",""2""]}}");
_Client.Raw.Bulk(json2.ToString());
Hope it helps.

How do I convert json to json-ld in.Net

I am trying to convert json to json-ld. So far I have tried the json-ld.net liberary from nuget (it is part of nuget3): https://www.nuget.org/packages/json-ld.net/
var jtoken = JsonLD.Util.JSONUtils.FromString(response);
var options = new JsonLdOptions();
options.SetBase("http://json-ld.org/test-suite/tests/");
options.SetProduceGeneralizedRdf(true);
var context = JsonLD.Util.JSONUtils.FromString(Properties.Resources.jasonldcontext);
options.SetExpandContext((JObject)context);
var jtokenout = JsonLdProcessor.Compact(jtoken, context, options);
var sz = JSONUtils.ToString(jtokenout);
the context resource:
{"#context": {
"ex": "http://example.org/",
"term1": {"#id": "ex:term1", "#type": "ex:datatype"},
"term2": {"#id": "ex:term2", "#type": "#id"}
}}
My json is present and valid. It comes from REST service. (response), and jtoken is populated. However, sz only contains the context:
context":{"ex":"http://example.org/","term1":
{"#id":"ex:term1","#type":"ex:datatype"},"term2":
{"#id":"ex:term2","#type":"#id"}}}
MXTires Microdata .NET is a good one. Converts .Net classes to Schema.org structured data in form of JSON-LD.
Nuget Link | Usage Link
I think I framed the question incorrectly. POCO to JSON-LD can be accomplished easily with JsonLD.Entities on GitHub. If I start with POCO or convert JSON to POCO, then this works easily.
var person = new Person
{
Id = new Uri("http://t-code.pl/#tomasz"),
Name = "Tomasz",
LastName = "Pluskiewicz"
};
var #context = JObject.Parse("{ '#context': 'http://example.org/context/Person' }");
var contextProvider = new StaticContextProvider();
contextProvider.SetContext(typeof(Person), #context);
// when
IEntitySerializer serializer = new EntitySerializer(contextProvider);
dynamic json = serializer.Serialize(person);

Categories