C# - OutOfMemoryException saving a List on a JSON file - c#

I'm trying to save the streaming data of a pressure map.
Basically I have a pressure matrix defined as:
double[,] pressureMatrix = new double[e.Data.GetLength(0), e.Data.GetLength(1)];
Basically, I'm getting one of this pressureMatrix every 10 milliseconds and I want to save all the information in a JSON file to be able to reproduce it later.
What I do is, first of all, write what I call the header with all the settings used to do the recording like this:
recordedData.softwareVersion = Assembly.GetExecutingAssembly().GetName().Version.Major.ToString() + "." + Assembly.GetExecutingAssembly().GetName().Version.Minor.ToString();
recordedData.calibrationConfiguration = calibrationConfiguration;
recordedData.representationConfiguration = representationSettings;
recordedData.pressureData = new List<PressureMap>();
var json = JsonConvert.SerializeObject(csvRecordedData, Formatting.None);
File.WriteAllText(this.filePath, json);
Then, every time I get a new pressure map I create a new Thread to add the new PressureMatrix and re-write the file:
var newPressureMatrix = new PressureMap(datos, DateTime.Now);
recordedData.pressureData.Add(newPressureMatrix);
var json = JsonConvert.SerializeObject(recordedData, Formatting.None);
File.WriteAllText(this.filePath, json);
After about 20-30 min I get an OutOfMemory Exception because the system cannot hold the recordedData var because the List<PressureMatrix> in it is too big.
How can I handle this to save a the data? I would like to save the information of 24-48 hours.

Your basic problem is that you are holding all of your pressure map samples in memory rather than writing each one individually and then allowing it to be garbage collected. What's worse, you are doing this in two different places:
You serialize your entire list of samples to a JSON string json before writing the string to a file.
Instead, as explained in Performance Tips: Optimize Memory Usage, you should serialize and deserialize directly to and from your file in such situations. For instructions on how to do this see this answer to Can Json.NET serialize / deserialize to / from a stream? and also Serialize JSON to a file.
The recordedData.pressureData = new List<PressureMap>(); accumulates all pressure map samples, then writes all of them every time a sample is made.
A better solution would be to write each sample once and forget it, but the requirement for each sample to be nested inside some container objects in the JSON makes it nonobvious how to do that.
So, how to attack issue #2?
First, let's modify your data model as follows, partitioning the header data into a separate class:
public class PressureMap
{
public double[,] PressureMatrix { get; set; }
}
public class CalibrationConfiguration
{
// Data model not included in question
}
public class RepresentationConfiguration
{
// Data model not included in question
}
public class RecordedDataHeader
{
public string SoftwareVersion { get; set; }
public CalibrationConfiguration CalibrationConfiguration { get; set; }
public RepresentationConfiguration RepresentationConfiguration { get; set; }
}
public class RecordedData
{
// Ensure the header is serialized first.
[JsonProperty(Order = 1)]
public RecordedDataHeader RecordedDataHeader { get; set; }
// Ensure the pressure data is serialized last.
[JsonProperty(Order = 2)]
public IEnumerable<PressureMap> PressureData { get; set; }
}
Option #1 is a version of the producer-comsumer pattern. It involves spinning up two threads: one to generate PressureData samples, and one to serialize the RecordedData. The first thread will generate samples and add them to a BlockingCollection<PressureMap> collection that is passed to the second thread. The second thread will then serialize BlockingCollection<PressureMap>.GetConsumingEnumerable()
as the value of RecordedData.PressureData.
The following code gives a skeleton for how to do this:
var sampleCount = 400; // Or whatever stopping criterion you prefer
var sampleInterval = 10; // in ms
using (var pressureData = new BlockingCollection<PressureMap>())
{
// Adapted from
// https://learn.microsoft.com/en-us/dotnet/standard/collections/thread-safe/blockingcollection-overview
// https://learn.microsoft.com/en-us/dotnet/api/system.collections.concurrent.blockingcollection-1?view=netframework-4.7.2
// Spin up a Task to sample the pressure maps
using (Task t1 = Task.Factory.StartNew(() =>
{
for (int i = 0; i < sampleCount; i++)
{
var data = GetPressureMap(i);
Console.WriteLine("Generated sample {0}", i);
pressureData.Add(data);
System.Threading.Thread.Sleep(sampleInterval);
}
pressureData.CompleteAdding();
}))
{
// Spin up a Task to consume the BlockingCollection
using (Task t2 = Task.Factory.StartNew(() =>
{
var recordedDataHeader = new RecordedDataHeader
{
SoftwareVersion = softwareVersion,
CalibrationConfiguration = calibrationConfiguration,
RepresentationConfiguration = representationConfiguration,
};
var settings = new JsonSerializerSettings
{
ContractResolver = new CamelCasePropertyNamesContractResolver(),
};
using (var stream = new FileStream(this.filePath, FileMode.Create))
using (var textWriter = new StreamWriter(stream))
using (var jsonWriter = new JsonTextWriter(textWriter))
{
int j = 0;
var query = pressureData
.GetConsumingEnumerable()
.Select(p =>
{
// Flush the writer periodically in case the process terminates abnormally
jsonWriter.Flush();
Console.WriteLine("Serializing item {0}", j++);
return p;
});
var recordedData = new RecordedData
{
RecordedDataHeader = recordedDataHeader,
// Since PressureData is declared as IEnumerable<PressureMap>, evaluation will be lazy.
PressureData = query,
};
Console.WriteLine("Beginning serialization of {0} to {1}:", recordedData, this.filePath);
JsonSerializer.CreateDefault(settings).Serialize(textWriter, recordedData);
Console.WriteLine("Finished serialization of {0} to {1}.", recordedData, this.filePath);
}
}))
{
Task.WaitAll(t1, t2);
}
}
}
Notes:
This solution uses the fact that, when serializing an IEnumerable<T>, Json.NET will not materialize the enumerable as a list. Instead it will take full advantage of lazy evaluation and simply enumerate through it, writing then forgetting each individual item encountered.
The first thread samples PressureData and adds them to the blocking collection.
The second thread wraps the blocking collection in an IEnumerable<PressureData> then serializes that as RecordedData.PressureData.
During serialization, the serializer will enumerate through the IEnumerable<PressureData> enumerable, streaming each to the JSON file then proceeding to the next -- effectively blocking until one becomes available.
You will need to do some experimentation to make sure that the serialization thread can "keep up" with the sampling thread, possibly by setting a BoundedCapacity during construction. If not, you may need to adopt a different strategy.
PressureMap GetPressureMap(int count) should be some method of yours (not shown in the question) that returns the current pressure map sample.
In this technique the JSON file remains open for the duration of the sampling session. If sampling terminates abnormally the file may be truncated. I make some attempt to ameliorate the problem by flushing the writer periodically.
While data serialization will no longer require unbounded amounts of memory, deserializing a RecordedData later will deserialize the PressureData array into a concrete List<PressureMap>. This may possibly cause memory issues during downstream processing.
Demo fiddle #1 here.
Option #2 would be to switch from a JSON file to a Newline Delimited JSON file. Such a file consists of sequences of JSON objects separated by newline characters. In your case, you would make the first object contain the RecordedDataHeader information, and the subsequent objects be of type PressureMap:
var sampleCount = 100; // Or whatever
var sampleInterval = 10;
var recordedDataHeader = new RecordedDataHeader
{
SoftwareVersion = softwareVersion,
CalibrationConfiguration = calibrationConfiguration,
RepresentationConfiguration = representationConfiguration,
};
var settings = new JsonSerializerSettings
{
ContractResolver = new CamelCasePropertyNamesContractResolver(),
};
// Write the header
Console.WriteLine("Beginning serialization of sample data to {0}.", this.filePath);
using (var stream = new FileStream(this.filePath, FileMode.Create))
{
JsonExtensions.ToNewlineDelimitedJson(stream, new[] { recordedDataHeader });
}
// Write each sample incrementally
for (int i = 0; i < sampleCount; i++)
{
Thread.Sleep(sampleInterval);
Console.WriteLine("Performing sample {0} of {1}", i, sampleCount);
var map = GetPressureMap(i);
using (var stream = new FileStream(this.filePath, FileMode.Append))
{
JsonExtensions.ToNewlineDelimitedJson(stream, new[] { map });
}
}
Console.WriteLine("Finished serialization of sample data to {0}.", this.filePath);
Using the extension methods:
public static partial class JsonExtensions
{
// Adapted from the answer to
// https://stackoverflow.com/questions/44787652/serialize-as-ndjson-using-json-net
// by dbc https://stackoverflow.com/users/3744182/dbc
public static void ToNewlineDelimitedJson<T>(Stream stream, IEnumerable<T> items)
{
// Let caller dispose the underlying stream
using (var textWriter = new StreamWriter(stream, new UTF8Encoding(false, true), 1024, true))
{
ToNewlineDelimitedJson(textWriter, items);
}
}
public static void ToNewlineDelimitedJson<T>(TextWriter textWriter, IEnumerable<T> items)
{
var serializer = JsonSerializer.CreateDefault();
foreach (var item in items)
{
// Formatting.None is the default; I set it here for clarity.
using (var writer = new JsonTextWriter(textWriter) { Formatting = Formatting.None, CloseOutput = false })
{
serializer.Serialize(writer, item);
}
// http://specs.okfnlabs.org/ndjson/
// Each JSON text MUST conform to the [RFC7159] standard and MUST be written to the stream followed by the newline character \n (0x0A).
// The newline charater MAY be preceeded by a carriage return \r (0x0D). The JSON texts MUST NOT contain newlines or carriage returns.
textWriter.Write("\n");
}
}
// Adapted from the answer to
// https://stackoverflow.com/questions/29729063/line-delimited-json-serializing-and-de-serializing
// by Yuval Itzchakov https://stackoverflow.com/users/1870803/yuval-itzchakov
public static IEnumerable<TBase> FromNewlineDelimitedJson<TBase, THeader, TRow>(TextReader reader)
where THeader : TBase
where TRow : TBase
{
bool first = true;
using (var jsonReader = new JsonTextReader(reader) { CloseInput = false, SupportMultipleContent = true })
{
var serializer = JsonSerializer.CreateDefault();
while (jsonReader.Read())
{
if (jsonReader.TokenType == JsonToken.Comment)
continue;
if (first)
{
yield return serializer.Deserialize<THeader>(jsonReader);
first = false;
}
else
{
yield return serializer.Deserialize<TRow>(jsonReader);
}
}
}
}
}
Later, you can process the newline delimited JSON file as follows:
using (var stream = File.OpenRead(filePath))
using (var textReader = new StreamReader(stream))
{
foreach (var obj in JsonExtensions.FromNewlineDelimitedJson<object, RecordedDataHeader, PressureMap>(textReader))
{
if (obj is RecordedDataHeader)
{
var header = (RecordedDataHeader)obj;
// Process the header
Console.WriteLine(JsonConvert.SerializeObject(header));
}
else
{
var row = (PressureMap)obj;
// Process the row.
Console.WriteLine(JsonConvert.SerializeObject(row));
}
}
}
Notes:
This approach looks simpler because the samples are added incrementally to the end of the file, rather than inserted inside some overall JSON container.
With this approach both serialization and downstream processing can be done with bounded memory use.
The sample file does not remain open for the duration of sampling, so is less likely to be truncated.
Downstream applications may not have built-in tools for processing newline delimited JSON.
This strategy may integrate more simply with your current threading code.
Demo fiddle #2 here.

Related

How to Bulk Insert in Cosmos DB with .NET Core 2.1 and Stream API

I'm trying to implement bulk insert with this CosmosDB sample. This sample is created with .NET Core 3.* and support of System.Text.Json.
When using the CreateItemAsync method, it works perfectly:
var concurrentTasks = new List<Task<ItemResponse<Notification>>>();
foreach (var entity in entities)
{
entity.Id = GenerateId(entity);
var requestOptions = new ItemRequestOptions();
requestOptions.EnableContentResponseOnWrite = false; // We don't need to get the entire body returend.
concurrentTasks.Add(Container.CreateItemAsync(entity, new PartitionKey(entity.UserId), requestOptions));
}
await Task.WhenAll(concurrentTasks);
However, I'm trying to see if I can reduce the number of RU's by streaming the data directly into CosmosDB, hoping CosmosDB doesn't charge me for deserializing JSON itself.
I'm working in .NET Core 2.1 and Newtonsoft.Json. This is my code that does not return a succesfull status code. The sub-status code in the response header is "0".
Notification[] notifications = entities.ToArray();
var itemsToInsert = new Dictionary<PartitionKey, Stream>();
foreach (var notification in notifications)
{
MemoryStream ms = new MemoryStream();
StreamWriter writer = new StreamWriter(ms);
JsonTextWriter jsonWriter = new JsonTextWriter(writer);
JsonSerializer ser = new JsonSerializer();
ser.Serialize(jsonWriter, notification);
await jsonWriter.FlushAsync();
await writer.FlushAsync();
itemsToInsert.Add(new PartitionKey(notification.UserId), ms);
}
List<Task> tasks = new List<Task>(notifications.Length);
foreach (KeyValuePair<PartitionKey, Stream> item in itemsToInsert)
{
tasks.Add(Container.CreateItemStreamAsync(item.Value, item.Key)
.ContinueWith((Task<ResponseMessage> task) =>
{
using (ResponseMessage response = task.Result)
{
if (!response.IsSuccessStatusCode)
{
Console.WriteLine($"Received {response.StatusCode} ({response.ErrorMessage}).");
}
else
{
}
}
}));
}
// Wait until all are done
await Task.WhenAll(tasks);
response.StatusCode: BadRequest
response.ErrorMessage: null
I'm assuming I don't serialize into the Stream in a correct way. Anyone got a clue?
Update
I discovered that the new System.Text.Json package also implements .NET Standard 2.0 so I installed it from NUget. Now I can copy the sample code from Github, mentioned earlier.
Notification[] notifications = entities.ToArray();
var itemsToInsert = new List<Tuple<PartitionKey, Stream>>();
foreach (var notification in notifications)
{
notification.id = $"{notification.UserId}:{Guid.NewGuid()}";
MemoryStream stream = new MemoryStream();
await JsonSerializer.SerializeAsync(stream, notification);
itemsToInsert.Add(new Tuple<PartitionKey, Stream>(new PartitionKey(notification.RoleId), stream));
}
List<Task> tasks = new List<Task>(notifications.Length);
foreach (var item in itemsToInsert)
{
tasks.Add(Container.CreateItemStreamAsync(item.Item2, item.Item1)
.ContinueWith((Task<ResponseMessage> task) =>
{
using (ResponseMessage response = task.Result)
{
if (!response.IsSuccessStatusCode)
{
Console.WriteLine($"Received {response.StatusCode} ({response.ErrorMessage}).");
}
else
{
}
}
}));
}
// Wait until all are done
await Task.WhenAll(tasks);
I double checked that BulkInsert is enabled (or else the first method also won't work). Still there is a BadRequest and a NULL for errorMessage.
I also checked that the data isn't added to the container dispite the BadRequest.
I found the problem.
I've setup my Cosmos Context with the following options:
var cosmosSerializationOptions = new CosmosSerializationOptions();
cosmosSerializationOptions.PropertyNamingPolicy = CosmosPropertyNamingPolicy.CamelCase;
CosmosClientOptions cosmosClientOptions = new CosmosClientOptions();
cosmosClientOptions.SerializerOptions = cosmosSerializationOptions;
Hence the CamelCase convention. In my first (working) code sample, I would let the CosmosDB Context deserialize to JSON. He would serialize with this CamelCase convention, so my PartionKey UserId would be serialized into userId.
However, to reduce some RU's I will use the CreateItemStreamAsync that makes me responsible for the serialization. And there was the mistake, my property was defined like:
public int UserId { get; set; }
So he would be serialized to json UserId: 1.
However, the partition key is defined as /userId. So if I add the JsonPropertyName attribute, it works:
[JsonPropertyName("userId")]
public int UserId { get; set; }
...if only an error message would tell me that.
There is about 3% RU savings on using this CreateItemStream method. However, over time this would slowly save some RU's in total I guess.
It looks like the stream is not readable. Hence the bad request.
I would make little modification to how MemoryStream is created:
foreach (var notification in notifications)
{
itemsToInsert.Add(new PartitionKey(notification.UserId), new MemoryStream(Encoding.UTF8.GetBytes(JsonConvert.SerializeObject(notification))));
}
Of course, I am using Newtonsoft.json for jsonConvert.

Validate object against a schema before serialization

I want to serialize a C# object as JSON into a stream, but to avoid the serialization if the object is not valid according to a schema. How should I proceed with this task using JSON.NET and Json.NET Schema? From what I see there is no method in the JSON.NET library which allows the validation of a C# object against a JSON schema. It seems somewhat weird that there is no direct method to just validate the C# object without encoding it. Do you have any idea why this method is not available?
It seems this API not currently available. At a guess, this might be because recursively generating the JSON values to validate involves most of the work of serializing the object. Or it could just be because no one at Newtonsoft ever designed, specified, implemented, tested, documented and shipped that feature.
If you want, you could file an enhancement request requesting this API, probably as a part of the SchemaExtensions class.
In the meantime, if you do need to test-validate a POCO without generating a complete serialization of it (because e.g. the result would be very large), you could grab NullJsonWriter from Reference to automatically created objects, wrap it in a JSchemaValidatingWriter and test-serialize your object as shown in Validate JSON with JSchemaValidatingWriter. NullJsonWriter doesn't actually write anything, and so using it eliminates the performance and memory overhead of generating a complete serialization (either as a string or as a JToken).
First, add the following static method:
public static class JsonExtensions
{
public static bool TestValidate<T>(T obj, JSchema schema, SchemaValidationEventHandler handler = null, JsonSerializerSettings settings = null)
{
using (var writer = new NullJsonWriter())
using (var validatingWriter = new JSchemaValidatingWriter(writer) { Schema = schema })
{
int count = 0;
if (handler != null)
validatingWriter.ValidationEventHandler += handler;
validatingWriter.ValidationEventHandler += (o, a) => count++;
JsonSerializer.CreateDefault(settings).Serialize(validatingWriter, obj);
return count == 0;
}
}
}
// Used to enable Json.NET to traverse an object hierarchy without actually writing any data.
class NullJsonWriter : JsonWriter
{
public NullJsonWriter()
: base()
{
}
public override void Flush()
{
// Do nothing.
}
}
Then use it like:
// Example adapted from
// https://www.newtonsoft.com/jsonschema/help/html/JsonValidatingWriterAndSerializer.htm
// by James Newton-King
string schemaJson = #"{
'description': 'A person',
'type': 'object',
'properties': {
'name': {'type':'string'},
'hobbies': {
'type': 'array',
'maxItems': 3,
'items': {'type':'string'}
}
}
}";
var schema = JSchema.Parse(schemaJson);
var person = new
{
Name = "James",
Hobbies = new [] { ".Net", "Blogging", "Reading", "XBox", "LOLCATS" },
};
var settings = new JsonSerializerSettings { ContractResolver = new CamelCasePropertyNamesContractResolver() };
var isValid = JsonExtensions.TestValidate(person, schema, (o, a) => Console.WriteLine(a.Message), settings);
// Prints Array item count 5 exceeds maximum count of 3. Path 'hobbies'.
Console.WriteLine("isValid = {0}", isValid);
// Prints isValid = False
Watch out for cases by the way. Json.NET schema is case sensitive so you will need to use an appropriate contract resolver when test-validating.
Sample fiddle.
You cannot do that from the JSON string, you need an object and a schema to compare with first..
public void Validate()
{
//...
JsonSchema schema = JsonSchema.Parse("{'pattern':'lol'}");
JToken stringToken = JToken.FromObject("pie");
stringToken.Validate(schema);

CosmosDB Query Performance

I wrote my latest update, and then got the following error from Stack Overflow: "Body is limited to 30000 characters; you entered 38676."
It's fair to say I have been very verbose in documenting my adventures, so I've rewritten what I have here to be more concise.
I have stored my (long) original post and updates on pastebin. I don't think many people will read them, but I put a lot of effort in to them so it'd be nice not to have them lost.
I have a collection which contains 100,000 documents for learning how to use CosmosDB and for things like performance testing.
Each of these documents has a Location property which is a GeoJSON Point.
According to the documentation, a GeoJSON point should be automatically indexed.
Azure Cosmos DB supports automatic indexing of Points, Polygons, and LineStrings
I've checked the Indexing Policy for my collection, and it has the entry for automatic point indexing:
{
"automatic":true,
"indexingMode":"Consistent",
"includedPaths":[
{
"path":"/*",
"indexes":[
...
{
"kind":"Spatial",
"dataType":"Point"
},
...
]
}
],
"excludedPaths":[ ]
}
I've been looking for a way to list, or otherwise interrogate the indexes that have been created, but I haven't found such a thing yet, so I haven't been able to confirm that this property definitely is being indexed.
I created a GeoJSON Polygon, and then used that to query my documents.
This is my query:
var query = client
.CreateDocumentQuery<TestDocument>(documentCollectionUri)
.Where(document => document.Type == this.documentType && document.Location.Intersects(target.Area));
And I then pass that query object to the following method so I can get the results while tracking the Request Units used:
protected async Task<IEnumerable<T>> QueryTrackingUsedRUsAsync(IQueryable<T> query)
{
var documentQuery = query.AsDocumentQuery();
var documents = new List<T>();
while (documentQuery.HasMoreResults)
{
var response = await documentQuery.ExecuteNextAsync<T>();
this.AddUsedRUs(response.RequestCharge);
documents.AddRange(response);
}
return documents;
}
The point locations are randomly chosen from 10s of millions of UK addresses, so they should have a fairly realistic spread.
The polygon is made up of 16 points (with the first and last point being the same), so it's not very complex. It covers most of the most southern part of the UK, from London down.
An example run of this query returned 8728 documents, using 3917.92 RU, in 170717.151 ms, which is just under 171 seconds, or just under 3 minutes.
3918 RU / 171 s = 22.91 RU/s
I currently have the Throughput (RU/s) set to the lowest value, at 400 RU/s.
It was my understanding that this is the reserved level you are guaranteed to get. You can "burst" above that level at times, but do that too frequently and you'll be throttled back to your reserved level.
The "query speed" of 23 RU/s is, obviously, much much lower than the Throughput setting of 400 RU/s.
I am running the client "locally" i.e. in my office, and not up in the Azure data center.
Each document is roughly 500 bytes (0.5 kb) in size.
So what's happening?
Am I doing something wrong?
Am I misunderstanding how my query is being throttled with regard to RU/s?
Is this the speed at which the GeoSpatial indexes operate, and so the best performance I'll get?
Is the GeoSpatial index not being used?
Is there a way I can view the created indexes?
Is there a way I can check if the index is being used?
Is there a way I can profile the query and get metrics about where time is being spent? e.g. s was used looking up documents by their type, s was used filtering them GeoSpatially, and s was used transferring the data.
UPDATE 1
Here's the polygon I'm using in the query:
Area = new Polygon(new List<LinearRing>()
{
new LinearRing(new List<Position>()
{
new Position(1.8567 ,51.3814),
new Position(0.5329 ,51.4618),
new Position(0.2477 ,51.2588),
new Position(-0.5329 ,51.2579),
new Position(-1.17 ,51.2173),
new Position(-1.9062 ,51.1958),
new Position(-2.5434 ,51.1614),
new Position(-3.8672 ,51.139 ),
new Position(-4.1578 ,50.9137),
new Position(-4.5373 ,50.694 ),
new Position(-5.1496 ,50.3282),
new Position(-5.2212 ,49.9586),
new Position(-3.7049 ,50.142 ),
new Position(-2.1698 ,50.314 ),
new Position(0.4669 ,50.6976),
new Position(1.8567 ,51.3814)
})
})
I have also tried reversing it (since ring orientation matters), but the query with the reversed polygon took significantly longer (I don't have the time to hand) and returned 91272 items.
Also, the coordinates are specified as Longitude/Latitude, as this is how GeoJSON expects them (i.e. as X/Y), rather than the traditional order used when speaking of Latitude/Longitude.
The GeoJSON specification specifies longitude first and latitude second.
UPDATE 2
Here's the JSON for one of my documents:
{
"GeoTrigger": null,
"SeverityTrigger": -1,
"TypeTrigger": -1,
"Name": "13, LONSDALE SQUARE, LONDON, N1 1EN",
"IsEnabled": true,
"Type": 2,
"Location": {
"$type": "Microsoft.Azure.Documents.Spatial.Point, Microsoft.Azure.Documents.Client",
"type": "Point",
"coordinates": [
-0.1076407397346815,
51.53970315059827
]
},
"id": "0dc2c03e-082b-4aea-93a8-79d89546c12b",
"_rid": "EQttAMGhSQDWPwAAAAAAAA==",
"_self": "dbs/EQttAA==/colls/EQttAMGhSQA=/docs/EQttAMGhSQDWPwAAAAAAAA==/",
"_etag": "\"42001028-0000-0000-0000-594943fe0000\"",
"_attachments": "attachments/",
"_ts": 1497973747
}
UPDATE 3
I created a minimal reproduction of the issue, and I found the issue no longer occured.
This indicated that the problem was indeed in my own code.
I set out to check all the differences between the original and the reproduction code and eventually found that something that appeared to be fairly innocent to me was infact having a big impact. And thankfully, that code wasn't needed at all, so it was an easy fix to simply not use that bit of code.
At one point I was using a custom ContractResolver and I hadn't removed it once it was no longer needed.
Here's the offending reproduction code:
using System;
using System.Collections.Generic;
using System.Configuration;
using System.Diagnostics;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Threading;
using System.Threading.Tasks;
using Microsoft.Azure.Documents;
using Microsoft.Azure.Documents.Client;
using Microsoft.Azure.Documents.Spatial;
using Newtonsoft.Json;
using Newtonsoft.Json.Serialization;
namespace Repro.Cli
{
public class Program
{
static void Main(string[] args)
{
JsonConvert.DefaultSettings = () =>
{
return new JsonSerializerSettings
{
ContractResolver = new PropertyNameMapContractResolver(new Dictionary<string, string>()
{
{ "ID", "id" }
})
};
};
//AJ: Init logging
Trace.AutoFlush = true;
Trace.Listeners.Add(new ConsoleTraceListener());
Trace.Listeners.Add(new TextWriterTraceListener("trace.log"));
//AJ: Increase availible threads
//AJ: https://learn.microsoft.com/en-us/azure/storage/storage-performance-checklist#subheading10
//AJ: https://github.com/Azure/azure-documentdb-dotnet/blob/master/samples/documentdb-benchmark/Program.cs
var minThreadPoolSize = 100;
ThreadPool.SetMinThreads(minThreadPoolSize, minThreadPoolSize);
//AJ: https://learn.microsoft.com/en-us/azure/cosmos-db/performance-tips
//AJ: gcServer enabled in app.config
//AJ: Prefer 32-bit disabled in project properties
//AJ: DO IT
var program = new Program();
Trace.TraceInformation($"Starting # {DateTime.UtcNow}");
program.RunAsync().Wait();
Trace.TraceInformation($"Finished # {DateTime.UtcNow}");
//AJ: Wait for user to exit
Console.WriteLine();
Console.WriteLine("Hit enter to exit...");
Console.ReadLine();
}
public async Task RunAsync()
{
using (new CodeTimer())
{
var client = await this.GetDocumentClientAsync();
var documentCollectionUri = UriFactory.CreateDocumentCollectionUri(ConfigurationManager.AppSettings["databaseID"], ConfigurationManager.AppSettings["collectionID"]);
//AJ: Prepare Test Documents
var documentCount = 10000; //AJ: 10,000
var documentsForUpsert = this.GetDocuments(documentCount);
await this.UpsertDocumentsAsync(client, documentCollectionUri, documentsForUpsert);
var allDocuments = this.GetAllDocuments(client, documentCollectionUri);
var area = this.GetArea();
var documentsInArea = this.GetDocumentsInArea(client, documentCollectionUri, area);
}
}
private async Task<DocumentClient> GetDocumentClientAsync()
{
using (new CodeTimer())
{
var serviceEndpointUri = new Uri(ConfigurationManager.AppSettings["serviceEndpoint"]);
var authKey = ConfigurationManager.AppSettings["authKey"];
var connectionPolicy = new ConnectionPolicy
{
ConnectionMode = ConnectionMode.Direct,
ConnectionProtocol = Protocol.Tcp,
RequestTimeout = new TimeSpan(1, 0, 0),
RetryOptions = new RetryOptions
{
MaxRetryAttemptsOnThrottledRequests = 10,
MaxRetryWaitTimeInSeconds = 60
}
};
var client = new DocumentClient(serviceEndpointUri, authKey, connectionPolicy);
await client.OpenAsync();
return client;
}
}
private List<TestDocument> GetDocuments(int count)
{
using (new CodeTimer())
{
return External.CreateDocuments(count);
}
}
private async Task UpsertDocumentsAsync(DocumentClient client, Uri documentCollectionUri, List<TestDocument> documents)
{
using (new CodeTimer())
{
//TODO: AJ: Parallelise
foreach (var document in documents)
{
await client.UpsertDocumentAsync(documentCollectionUri, document);
}
}
}
private List<TestDocument> GetAllDocuments(DocumentClient client, Uri documentCollectionUri)
{
using (new CodeTimer())
{
var query = client
.CreateDocumentQuery<TestDocument>(documentCollectionUri, new FeedOptions()
{
MaxItemCount = 1000
});
var documents = query.ToList();
return documents;
}
}
private Polygon GetArea()
{
//AJ: Longitude,Latitude i.e. X/Y
//AJ: Ring orientation matters
return new Polygon(new List<LinearRing>()
{
new LinearRing(new List<Position>()
{
new Position(1.8567 ,51.3814),
new Position(0.5329 ,51.4618),
new Position(0.2477 ,51.2588),
new Position(-0.5329 ,51.2579),
new Position(-1.17 ,51.2173),
new Position(-1.9062 ,51.1958),
new Position(-2.5434 ,51.1614),
new Position(-3.8672 ,51.139 ),
new Position(-4.1578 ,50.9137),
new Position(-4.5373 ,50.694 ),
new Position(-5.1496 ,50.3282),
new Position(-5.2212 ,49.9586),
new Position(-3.7049 ,50.142 ),
new Position(-2.1698 ,50.314 ),
new Position(0.4669 ,50.6976),
//AJ: Last point must be the same as first point
new Position(1.8567 ,51.3814)
})
});
}
private List<TestDocument> GetDocumentsInArea(DocumentClient client, Uri documentCollectionUri, Polygon area)
{
using (new CodeTimer())
{
var query = client
.CreateDocumentQuery<TestDocument>(documentCollectionUri, new FeedOptions()
{
MaxItemCount = 1000
})
.Where(document => document.Location.Intersects(area));
var documents = query.ToList();
return documents;
}
}
}
public class TestDocument : Resource
{
public string Name { get; set; }
public Point Location { get; set; } //AJ: Longitude,Latitude i.e. X/Y
public TestDocument()
{
this.Id = Guid.NewGuid().ToString("N");
}
}
//AJ: This should be "good enough". The times being recorded are seconds or minutes.
public class CodeTimer : IDisposable
{
private Action<TimeSpan> reportFunction;
private Stopwatch stopwatch = new Stopwatch();
public CodeTimer([CallerMemberName]string name = "")
: this((ellapsed) =>
{
Trace.TraceInformation($"{name} took {ellapsed}, or {ellapsed.TotalMilliseconds} ms.");
})
{ }
public CodeTimer(Action<TimeSpan> report)
{
this.reportFunction = report;
this.stopwatch.Start();
}
public void Dispose()
{
this.stopwatch.Stop();
this.reportFunction(this.stopwatch.Elapsed);
}
}
public class PropertyNameMapContractResolver : DefaultContractResolver
{
private Dictionary<string, string> propertyNameMap;
public PropertyNameMapContractResolver(Dictionary<string, string> propertyNameMap)
{
this.propertyNameMap = propertyNameMap;
}
protected override string ResolvePropertyName(string propertyName)
{
if (this.propertyNameMap.TryGetValue(propertyName, out string resolvedName))
return resolvedName;
return base.ResolvePropertyName(propertyName);
}
}
}
I was using a custom ContractResolver and that was evidently having a big impact on the performance of the DocumentDB classes from the .Net SDK.
This was how I was setting the ContractResolver:
JsonConvert.DefaultSettings = () =>
{
return new JsonSerializerSettings
{
ContractResolver = new PropertyNameMapContractResolver(new Dictionary<string, string>()
{
{ "ID", "id" }
})
};
};
And this is how it was implemented:
public class PropertyNameMapContractResolver : DefaultContractResolver
{
private Dictionary<string, string> propertyNameMap;
public PropertyNameMapContractResolver(Dictionary<string, string> propertyNameMap)
{
this.propertyNameMap = propertyNameMap;
}
protected override string ResolvePropertyName(string propertyName)
{
if (this.propertyNameMap.TryGetValue(propertyName, out string resolvedName))
return resolvedName;
return base.ResolvePropertyName(propertyName);
}
}
The solution was easy, don't set JsonConvert.DefaultSettings so the ContractResolver isn't used.
Results:
I was able to perform my spatial query in 21799.0221 ms, which is 22 seconds.
Previously it took 170717.151 ms, which is 2 minutes 50 seconds.
That's about 8x faster!

How to write JSON to Event Hub correctly

I'm batching serialized records (in a JArray) to send to Event Hub. When I'm writing the data to Event Hubs it seems to be inserting extra speech marks around the JSON i.e. what is written "{"myjson":"blah"}" not {"myjson":"blah"} so downstream I'm having trouble reading it.
Based on this guidance, I must convert JSON to string and then use GetBytes to pass it into an EventData object. I suspect my attempt at following this guidance is where my issue is arising.
using Newtonsoft.Json;
using Newtonsoft.Json.Linq;
public static class EventDataTransform
{
public static EventData ToEventData(dynamic eventObject, out int payloadSize)
{
string json = eventObject.ToString(Formatting.None);
payloadSize = Encoding.UTF8.GetByteCount(json);
var payload = Encoding.UTF8.GetBytes(json);
var eventData = new EventData(payload)
{
};
return eventData;
}
}
How should an item from a JArray containing serialized data be converted into the contents of an EventData message?
Code call location - used in batching upto 256kb parcels
public bool MoveNext()
{
var batch = new List<EventData>(_allEvents.Count);
var batchSize = 0;
for (int i = _lastBatchedEventIndex; i < _allEvents.Count; i++)
{
dynamic evt = _allEvents[i];
int payloadSize = 0;
var eventData = EventDataTransform.ToEventData(evt, out payloadSize);
var eventSize = payloadSize + EventDataOverheadBytes;
if (batchSize + eventSize > MaxBatchSizeBytes)
{
break;
}
batch.Add(eventData);
batchSize += eventSize;
}
_lastBatchedEventIndex += batch.Count();
_currentBatch = batch;
return _currentBatch.Count() > 0;
}
Sounds like the JArray already contains serialized objects (strings). Calling .ToString(Formatting.None) will serialize it again a second time (wrapping it in quotes).
Interestingly enough, if you call .ToString() without passing in a Formatting, it would not serialize it again.
This fiddle demonstrates this: https://dotnetfiddle.net/H4p6KL

retrieving partial content using multiple http requsets to fetch data via parllel tasks

i am trying to be as thorough as i can in this post, as it is very important for me,
though the issue is very simple, and only by reading the title of this question, you can get the idea...
question is:
with healthy bandwidth (30mb Vdsl) available...
how is it possible to get multiple httpWebRequest for a single data / file ?,
so each reaquest,will download only a portion of the data
then when all instances have completed, all parts are joined back to one piece.
Code:
...what i have got working so far is same idea only each task =HttpWebRequest = different file,
so speedup is pure tasks parallelism rather acceleration of one download using multiple tasks/threads
as in my question.
see code below
the next part is only more detailed explantion and background on the subject...if you don't mind reading.
while i am still on a similar project that differ from this (in question)one,
in the way that it(see code below..) was trying to fetch as many different data sources for each of separated tasks(different downloads/files).
... so the speedup was gaind while each(task) does not have to wait for the former one to complete first before it get a chance to be executed .
what i am trying to do in this current-subjected question (having allmost everything ready in the code below) is actually targetting same url for same data,
so this time the speedup to gain is for the single-task - current download .
implementing same idea as in code below only this time let SmartWebClient target same url by
using multiple instances.
then (only theory for now) it will request partial content of data,
with multiple requests with each one of instances .
last issue is i need to "put puzle back to one peace"... another problem i need to find out about...
as you can see in this code , what i did not get to work on yet is only the data parsing/processing which i find to be very easy using htmlAgilityPack so no problem.
current code
main entry:
var htmlDictionary = urlsForExtraction.urlsConcrDict();
Parallel.ForEach(
urlList.Values,
new ParallelOptions { MaxDegreeOfParallelism = 20 },
url => Download(url, htmlDictionary)
);
foreach (var pair in htmlDictionary)
{
///Process(pair);
MessageBox.Show(pair.Value);
}
public class urlsForExtraction
{
const string URL_Dollar= "";
const string URL_UpdateUsersTimeOut="";
public ConcurrentDictionary<string, string> urlsConcrDict()
{
//need to find the syntax to extract fileds names so it would be possible to iterate on each instead of specying
ConcurrentDictionary<string, string> retDict = new Dictionary<string,string>();
retDict.TryAdd("URL_Dollar", "Any.Url.com");
retDict.TryAdd("URL_UpdateUserstbl", "http://bing.com");
return retDict;
}
}
/// <summary>
/// second Stage Class consumes the Dictionary of urls for extraction
/// then downloads Each via parallel for each using The Smart WeBClient! (download(); )
/// </summary>
public class InitConcurentHtmDictExtrct
{
private void Download(string url, ConcurrentDictionary<string, string> htmlDictionary)
{
using (var webClient = new SmartWebClient())
{
webClient.Encoding = Encoding.GetEncoding("UTF-8");
webClient.Proxy = null;
htmlDictionary.TryAdd(url, webClient.DownloadString(url));
}
}
private ConcurrentDictionary<string, string> htmlDictionary;
public ConcurrentDictionary<string, string> LoopOnUrlsVia_SmartWC(Dictionary<string, string> urlList)
{
htmlDictionary = new ConcurrentDictionary<string, string>();
Parallel.ForEach(
urlList.Values,
new ParallelOptions { MaxDegreeOfParallelism = 20 },
url => Download(url, htmlDictionary)
);
return htmlDictionary;
}
}
/// <summary>
/// the Extraction Process, done via "HtmlAgility pack"
/// easy usage to collect information within a given html Documnet via referencing elements attributes
/// </summary>
public class Results
{
public struct ExtracionParameters
{
public string FileNameToSave;
public string directoryPath;
public string htmlElementType;
}
public enum Extraction
{
ById, ByClassName, ByElementName
}
public void ExtractHtmlDict( ConcurrentDictionary<string, string> htmlResults, Extract By)
{
// helps with easy elements extraction from the page.
HtmlAttribute htAgPcAttrbs;
HtmlDocument HtmlAgPCDoc = new HtmlDocument();
/// will hold a name+content of each documnet-part that was aventually extracted
/// then from this container the build of the result page will be possible
Dictionary<string, HtmlDocument> dictResults = new Dictionary<string, HtmlDocument>();
foreach (KeyValuePair<string, string> htmlPair in htmlResults)
{
Process(htmlPair);
}
}
private static void Process(KeyValuePair<string, string> pair)
{
// do the html processing
}
}
public class SmartWebClient : WebClient
{
private readonly int maxConcurentConnectionCount;
public SmartWebClient(int maxConcurentConnectionCount = 20)
{
this.Proxy = null;
this.Encoding = Encoding.GetEncoding("UTF-8");
this.maxConcurentConnectionCount = maxConcurentConnectionCount;
}
protected override WebRequest GetWebRequest(Uri address)
{
var httpWebRequest = (HttpWebRequest)base.GetWebRequest(address);
if (httpWebRequest == null)
{
return null;
}
if (maxConcurentConnectionCount != 0)
{
httpWebRequest.ServicePoint.ConnectionLimit = maxConcurentConnectionCount;
}
return httpWebRequest;
}
}
}
this allows me to take advantage of good bandwith,
only i am far from the subjected solution, i will realy appriciate any clue on where to start .
If the server support what's wikipedia calls byte serving, you can multiplex a file download spawning multiple requests with a specific Range header value (using the AddRange method. See also How to download the data from the server discontinuously?). Most serious HTTP servers do support byte-range.
Here is some sample code that implements a parallel download of a file using byte range:
public static void ParallelDownloadFile(string uri, string filePath, int chunkSize)
{
if (uri == null)
throw new ArgumentNullException("uri");
// determine file size first
long size = GetFileSize(uri);
using (FileStream file = new FileStream(filePath, FileMode.Create, FileAccess.Write, FileShare.Write))
{
file.SetLength(size); // set the length first
object syncObject = new object(); // synchronize file writes
Parallel.ForEach(LongRange(0, 1 + size / chunkSize), (start) =>
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.AddRange(start * chunkSize, start * chunkSize + chunkSize - 1);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
lock (syncObject)
{
using (Stream stream = response.GetResponseStream())
{
file.Seek(start * chunkSize, SeekOrigin.Begin);
stream.CopyTo(file);
}
}
});
}
}
public static long GetFileSize(string uri)
{
if (uri == null)
throw new ArgumentNullException("uri");
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.Method = "HEAD";
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
return response.ContentLength;
}
private static IEnumerable<long> LongRange(long start, long count)
{
long i = 0;
while (true)
{
if (i >= count)
{
yield break;
}
yield return start + i;
i++;
}
}
And sample usage:
private static void TestParallelDownload()
{
string uri = "http://localhost/welcome.png";
string fileName = Path.GetFileName(uri);
ParallelDownloadFile(uri, fileName, 10000);
}
PS: I'd be curious to know if it's really more interesting to do this parallel thing rather than to just use WebClient.DownloadFile... Maybe in slow network scenarios?

Categories