I'm trying to implement bulk insert with this CosmosDB sample. This sample is created with .NET Core 3.* and support of System.Text.Json.
When using the CreateItemAsync method, it works perfectly:
var concurrentTasks = new List<Task<ItemResponse<Notification>>>();
foreach (var entity in entities)
{
entity.Id = GenerateId(entity);
var requestOptions = new ItemRequestOptions();
requestOptions.EnableContentResponseOnWrite = false; // We don't need to get the entire body returend.
concurrentTasks.Add(Container.CreateItemAsync(entity, new PartitionKey(entity.UserId), requestOptions));
}
await Task.WhenAll(concurrentTasks);
However, I'm trying to see if I can reduce the number of RU's by streaming the data directly into CosmosDB, hoping CosmosDB doesn't charge me for deserializing JSON itself.
I'm working in .NET Core 2.1 and Newtonsoft.Json. This is my code that does not return a succesfull status code. The sub-status code in the response header is "0".
Notification[] notifications = entities.ToArray();
var itemsToInsert = new Dictionary<PartitionKey, Stream>();
foreach (var notification in notifications)
{
MemoryStream ms = new MemoryStream();
StreamWriter writer = new StreamWriter(ms);
JsonTextWriter jsonWriter = new JsonTextWriter(writer);
JsonSerializer ser = new JsonSerializer();
ser.Serialize(jsonWriter, notification);
await jsonWriter.FlushAsync();
await writer.FlushAsync();
itemsToInsert.Add(new PartitionKey(notification.UserId), ms);
}
List<Task> tasks = new List<Task>(notifications.Length);
foreach (KeyValuePair<PartitionKey, Stream> item in itemsToInsert)
{
tasks.Add(Container.CreateItemStreamAsync(item.Value, item.Key)
.ContinueWith((Task<ResponseMessage> task) =>
{
using (ResponseMessage response = task.Result)
{
if (!response.IsSuccessStatusCode)
{
Console.WriteLine($"Received {response.StatusCode} ({response.ErrorMessage}).");
}
else
{
}
}
}));
}
// Wait until all are done
await Task.WhenAll(tasks);
response.StatusCode: BadRequest
response.ErrorMessage: null
I'm assuming I don't serialize into the Stream in a correct way. Anyone got a clue?
Update
I discovered that the new System.Text.Json package also implements .NET Standard 2.0 so I installed it from NUget. Now I can copy the sample code from Github, mentioned earlier.
Notification[] notifications = entities.ToArray();
var itemsToInsert = new List<Tuple<PartitionKey, Stream>>();
foreach (var notification in notifications)
{
notification.id = $"{notification.UserId}:{Guid.NewGuid()}";
MemoryStream stream = new MemoryStream();
await JsonSerializer.SerializeAsync(stream, notification);
itemsToInsert.Add(new Tuple<PartitionKey, Stream>(new PartitionKey(notification.RoleId), stream));
}
List<Task> tasks = new List<Task>(notifications.Length);
foreach (var item in itemsToInsert)
{
tasks.Add(Container.CreateItemStreamAsync(item.Item2, item.Item1)
.ContinueWith((Task<ResponseMessage> task) =>
{
using (ResponseMessage response = task.Result)
{
if (!response.IsSuccessStatusCode)
{
Console.WriteLine($"Received {response.StatusCode} ({response.ErrorMessage}).");
}
else
{
}
}
}));
}
// Wait until all are done
await Task.WhenAll(tasks);
I double checked that BulkInsert is enabled (or else the first method also won't work). Still there is a BadRequest and a NULL for errorMessage.
I also checked that the data isn't added to the container dispite the BadRequest.
I found the problem.
I've setup my Cosmos Context with the following options:
var cosmosSerializationOptions = new CosmosSerializationOptions();
cosmosSerializationOptions.PropertyNamingPolicy = CosmosPropertyNamingPolicy.CamelCase;
CosmosClientOptions cosmosClientOptions = new CosmosClientOptions();
cosmosClientOptions.SerializerOptions = cosmosSerializationOptions;
Hence the CamelCase convention. In my first (working) code sample, I would let the CosmosDB Context deserialize to JSON. He would serialize with this CamelCase convention, so my PartionKey UserId would be serialized into userId.
However, to reduce some RU's I will use the CreateItemStreamAsync that makes me responsible for the serialization. And there was the mistake, my property was defined like:
public int UserId { get; set; }
So he would be serialized to json UserId: 1.
However, the partition key is defined as /userId. So if I add the JsonPropertyName attribute, it works:
[JsonPropertyName("userId")]
public int UserId { get; set; }
...if only an error message would tell me that.
There is about 3% RU savings on using this CreateItemStream method. However, over time this would slowly save some RU's in total I guess.
It looks like the stream is not readable. Hence the bad request.
I would make little modification to how MemoryStream is created:
foreach (var notification in notifications)
{
itemsToInsert.Add(new PartitionKey(notification.UserId), new MemoryStream(Encoding.UTF8.GetBytes(JsonConvert.SerializeObject(notification))));
}
Of course, I am using Newtonsoft.json for jsonConvert.
I want to serialize a C# object as JSON into a stream, but to avoid the serialization if the object is not valid according to a schema. How should I proceed with this task using JSON.NET and Json.NET Schema? From what I see there is no method in the JSON.NET library which allows the validation of a C# object against a JSON schema. It seems somewhat weird that there is no direct method to just validate the C# object without encoding it. Do you have any idea why this method is not available?
It seems this API not currently available. At a guess, this might be because recursively generating the JSON values to validate involves most of the work of serializing the object. Or it could just be because no one at Newtonsoft ever designed, specified, implemented, tested, documented and shipped that feature.
If you want, you could file an enhancement request requesting this API, probably as a part of the SchemaExtensions class.
In the meantime, if you do need to test-validate a POCO without generating a complete serialization of it (because e.g. the result would be very large), you could grab NullJsonWriter from Reference to automatically created objects, wrap it in a JSchemaValidatingWriter and test-serialize your object as shown in Validate JSON with JSchemaValidatingWriter. NullJsonWriter doesn't actually write anything, and so using it eliminates the performance and memory overhead of generating a complete serialization (either as a string or as a JToken).
First, add the following static method:
public static class JsonExtensions
{
public static bool TestValidate<T>(T obj, JSchema schema, SchemaValidationEventHandler handler = null, JsonSerializerSettings settings = null)
{
using (var writer = new NullJsonWriter())
using (var validatingWriter = new JSchemaValidatingWriter(writer) { Schema = schema })
{
int count = 0;
if (handler != null)
validatingWriter.ValidationEventHandler += handler;
validatingWriter.ValidationEventHandler += (o, a) => count++;
JsonSerializer.CreateDefault(settings).Serialize(validatingWriter, obj);
return count == 0;
}
}
}
// Used to enable Json.NET to traverse an object hierarchy without actually writing any data.
class NullJsonWriter : JsonWriter
{
public NullJsonWriter()
: base()
{
}
public override void Flush()
{
// Do nothing.
}
}
Then use it like:
// Example adapted from
// https://www.newtonsoft.com/jsonschema/help/html/JsonValidatingWriterAndSerializer.htm
// by James Newton-King
string schemaJson = #"{
'description': 'A person',
'type': 'object',
'properties': {
'name': {'type':'string'},
'hobbies': {
'type': 'array',
'maxItems': 3,
'items': {'type':'string'}
}
}
}";
var schema = JSchema.Parse(schemaJson);
var person = new
{
Name = "James",
Hobbies = new [] { ".Net", "Blogging", "Reading", "XBox", "LOLCATS" },
};
var settings = new JsonSerializerSettings { ContractResolver = new CamelCasePropertyNamesContractResolver() };
var isValid = JsonExtensions.TestValidate(person, schema, (o, a) => Console.WriteLine(a.Message), settings);
// Prints Array item count 5 exceeds maximum count of 3. Path 'hobbies'.
Console.WriteLine("isValid = {0}", isValid);
// Prints isValid = False
Watch out for cases by the way. Json.NET schema is case sensitive so you will need to use an appropriate contract resolver when test-validating.
Sample fiddle.
You cannot do that from the JSON string, you need an object and a schema to compare with first..
public void Validate()
{
//...
JsonSchema schema = JsonSchema.Parse("{'pattern':'lol'}");
JToken stringToken = JToken.FromObject("pie");
stringToken.Validate(schema);
I wrote my latest update, and then got the following error from Stack Overflow: "Body is limited to 30000 characters; you entered 38676."
It's fair to say I have been very verbose in documenting my adventures, so I've rewritten what I have here to be more concise.
I have stored my (long) original post and updates on pastebin. I don't think many people will read them, but I put a lot of effort in to them so it'd be nice not to have them lost.
I have a collection which contains 100,000 documents for learning how to use CosmosDB and for things like performance testing.
Each of these documents has a Location property which is a GeoJSON Point.
According to the documentation, a GeoJSON point should be automatically indexed.
Azure Cosmos DB supports automatic indexing of Points, Polygons, and LineStrings
I've checked the Indexing Policy for my collection, and it has the entry for automatic point indexing:
{
"automatic":true,
"indexingMode":"Consistent",
"includedPaths":[
{
"path":"/*",
"indexes":[
...
{
"kind":"Spatial",
"dataType":"Point"
},
...
]
}
],
"excludedPaths":[ ]
}
I've been looking for a way to list, or otherwise interrogate the indexes that have been created, but I haven't found such a thing yet, so I haven't been able to confirm that this property definitely is being indexed.
I created a GeoJSON Polygon, and then used that to query my documents.
This is my query:
var query = client
.CreateDocumentQuery<TestDocument>(documentCollectionUri)
.Where(document => document.Type == this.documentType && document.Location.Intersects(target.Area));
And I then pass that query object to the following method so I can get the results while tracking the Request Units used:
protected async Task<IEnumerable<T>> QueryTrackingUsedRUsAsync(IQueryable<T> query)
{
var documentQuery = query.AsDocumentQuery();
var documents = new List<T>();
while (documentQuery.HasMoreResults)
{
var response = await documentQuery.ExecuteNextAsync<T>();
this.AddUsedRUs(response.RequestCharge);
documents.AddRange(response);
}
return documents;
}
The point locations are randomly chosen from 10s of millions of UK addresses, so they should have a fairly realistic spread.
The polygon is made up of 16 points (with the first and last point being the same), so it's not very complex. It covers most of the most southern part of the UK, from London down.
An example run of this query returned 8728 documents, using 3917.92 RU, in 170717.151 ms, which is just under 171 seconds, or just under 3 minutes.
3918 RU / 171 s = 22.91 RU/s
I currently have the Throughput (RU/s) set to the lowest value, at 400 RU/s.
It was my understanding that this is the reserved level you are guaranteed to get. You can "burst" above that level at times, but do that too frequently and you'll be throttled back to your reserved level.
The "query speed" of 23 RU/s is, obviously, much much lower than the Throughput setting of 400 RU/s.
I am running the client "locally" i.e. in my office, and not up in the Azure data center.
Each document is roughly 500 bytes (0.5 kb) in size.
So what's happening?
Am I doing something wrong?
Am I misunderstanding how my query is being throttled with regard to RU/s?
Is this the speed at which the GeoSpatial indexes operate, and so the best performance I'll get?
Is the GeoSpatial index not being used?
Is there a way I can view the created indexes?
Is there a way I can check if the index is being used?
Is there a way I can profile the query and get metrics about where time is being spent? e.g. s was used looking up documents by their type, s was used filtering them GeoSpatially, and s was used transferring the data.
UPDATE 1
Here's the polygon I'm using in the query:
Area = new Polygon(new List<LinearRing>()
{
new LinearRing(new List<Position>()
{
new Position(1.8567 ,51.3814),
new Position(0.5329 ,51.4618),
new Position(0.2477 ,51.2588),
new Position(-0.5329 ,51.2579),
new Position(-1.17 ,51.2173),
new Position(-1.9062 ,51.1958),
new Position(-2.5434 ,51.1614),
new Position(-3.8672 ,51.139 ),
new Position(-4.1578 ,50.9137),
new Position(-4.5373 ,50.694 ),
new Position(-5.1496 ,50.3282),
new Position(-5.2212 ,49.9586),
new Position(-3.7049 ,50.142 ),
new Position(-2.1698 ,50.314 ),
new Position(0.4669 ,50.6976),
new Position(1.8567 ,51.3814)
})
})
I have also tried reversing it (since ring orientation matters), but the query with the reversed polygon took significantly longer (I don't have the time to hand) and returned 91272 items.
Also, the coordinates are specified as Longitude/Latitude, as this is how GeoJSON expects them (i.e. as X/Y), rather than the traditional order used when speaking of Latitude/Longitude.
The GeoJSON specification specifies longitude first and latitude second.
UPDATE 2
Here's the JSON for one of my documents:
{
"GeoTrigger": null,
"SeverityTrigger": -1,
"TypeTrigger": -1,
"Name": "13, LONSDALE SQUARE, LONDON, N1 1EN",
"IsEnabled": true,
"Type": 2,
"Location": {
"$type": "Microsoft.Azure.Documents.Spatial.Point, Microsoft.Azure.Documents.Client",
"type": "Point",
"coordinates": [
-0.1076407397346815,
51.53970315059827
]
},
"id": "0dc2c03e-082b-4aea-93a8-79d89546c12b",
"_rid": "EQttAMGhSQDWPwAAAAAAAA==",
"_self": "dbs/EQttAA==/colls/EQttAMGhSQA=/docs/EQttAMGhSQDWPwAAAAAAAA==/",
"_etag": "\"42001028-0000-0000-0000-594943fe0000\"",
"_attachments": "attachments/",
"_ts": 1497973747
}
UPDATE 3
I created a minimal reproduction of the issue, and I found the issue no longer occured.
This indicated that the problem was indeed in my own code.
I set out to check all the differences between the original and the reproduction code and eventually found that something that appeared to be fairly innocent to me was infact having a big impact. And thankfully, that code wasn't needed at all, so it was an easy fix to simply not use that bit of code.
At one point I was using a custom ContractResolver and I hadn't removed it once it was no longer needed.
Here's the offending reproduction code:
using System;
using System.Collections.Generic;
using System.Configuration;
using System.Diagnostics;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Threading;
using System.Threading.Tasks;
using Microsoft.Azure.Documents;
using Microsoft.Azure.Documents.Client;
using Microsoft.Azure.Documents.Spatial;
using Newtonsoft.Json;
using Newtonsoft.Json.Serialization;
namespace Repro.Cli
{
public class Program
{
static void Main(string[] args)
{
JsonConvert.DefaultSettings = () =>
{
return new JsonSerializerSettings
{
ContractResolver = new PropertyNameMapContractResolver(new Dictionary<string, string>()
{
{ "ID", "id" }
})
};
};
//AJ: Init logging
Trace.AutoFlush = true;
Trace.Listeners.Add(new ConsoleTraceListener());
Trace.Listeners.Add(new TextWriterTraceListener("trace.log"));
//AJ: Increase availible threads
//AJ: https://learn.microsoft.com/en-us/azure/storage/storage-performance-checklist#subheading10
//AJ: https://github.com/Azure/azure-documentdb-dotnet/blob/master/samples/documentdb-benchmark/Program.cs
var minThreadPoolSize = 100;
ThreadPool.SetMinThreads(minThreadPoolSize, minThreadPoolSize);
//AJ: https://learn.microsoft.com/en-us/azure/cosmos-db/performance-tips
//AJ: gcServer enabled in app.config
//AJ: Prefer 32-bit disabled in project properties
//AJ: DO IT
var program = new Program();
Trace.TraceInformation($"Starting # {DateTime.UtcNow}");
program.RunAsync().Wait();
Trace.TraceInformation($"Finished # {DateTime.UtcNow}");
//AJ: Wait for user to exit
Console.WriteLine();
Console.WriteLine("Hit enter to exit...");
Console.ReadLine();
}
public async Task RunAsync()
{
using (new CodeTimer())
{
var client = await this.GetDocumentClientAsync();
var documentCollectionUri = UriFactory.CreateDocumentCollectionUri(ConfigurationManager.AppSettings["databaseID"], ConfigurationManager.AppSettings["collectionID"]);
//AJ: Prepare Test Documents
var documentCount = 10000; //AJ: 10,000
var documentsForUpsert = this.GetDocuments(documentCount);
await this.UpsertDocumentsAsync(client, documentCollectionUri, documentsForUpsert);
var allDocuments = this.GetAllDocuments(client, documentCollectionUri);
var area = this.GetArea();
var documentsInArea = this.GetDocumentsInArea(client, documentCollectionUri, area);
}
}
private async Task<DocumentClient> GetDocumentClientAsync()
{
using (new CodeTimer())
{
var serviceEndpointUri = new Uri(ConfigurationManager.AppSettings["serviceEndpoint"]);
var authKey = ConfigurationManager.AppSettings["authKey"];
var connectionPolicy = new ConnectionPolicy
{
ConnectionMode = ConnectionMode.Direct,
ConnectionProtocol = Protocol.Tcp,
RequestTimeout = new TimeSpan(1, 0, 0),
RetryOptions = new RetryOptions
{
MaxRetryAttemptsOnThrottledRequests = 10,
MaxRetryWaitTimeInSeconds = 60
}
};
var client = new DocumentClient(serviceEndpointUri, authKey, connectionPolicy);
await client.OpenAsync();
return client;
}
}
private List<TestDocument> GetDocuments(int count)
{
using (new CodeTimer())
{
return External.CreateDocuments(count);
}
}
private async Task UpsertDocumentsAsync(DocumentClient client, Uri documentCollectionUri, List<TestDocument> documents)
{
using (new CodeTimer())
{
//TODO: AJ: Parallelise
foreach (var document in documents)
{
await client.UpsertDocumentAsync(documentCollectionUri, document);
}
}
}
private List<TestDocument> GetAllDocuments(DocumentClient client, Uri documentCollectionUri)
{
using (new CodeTimer())
{
var query = client
.CreateDocumentQuery<TestDocument>(documentCollectionUri, new FeedOptions()
{
MaxItemCount = 1000
});
var documents = query.ToList();
return documents;
}
}
private Polygon GetArea()
{
//AJ: Longitude,Latitude i.e. X/Y
//AJ: Ring orientation matters
return new Polygon(new List<LinearRing>()
{
new LinearRing(new List<Position>()
{
new Position(1.8567 ,51.3814),
new Position(0.5329 ,51.4618),
new Position(0.2477 ,51.2588),
new Position(-0.5329 ,51.2579),
new Position(-1.17 ,51.2173),
new Position(-1.9062 ,51.1958),
new Position(-2.5434 ,51.1614),
new Position(-3.8672 ,51.139 ),
new Position(-4.1578 ,50.9137),
new Position(-4.5373 ,50.694 ),
new Position(-5.1496 ,50.3282),
new Position(-5.2212 ,49.9586),
new Position(-3.7049 ,50.142 ),
new Position(-2.1698 ,50.314 ),
new Position(0.4669 ,50.6976),
//AJ: Last point must be the same as first point
new Position(1.8567 ,51.3814)
})
});
}
private List<TestDocument> GetDocumentsInArea(DocumentClient client, Uri documentCollectionUri, Polygon area)
{
using (new CodeTimer())
{
var query = client
.CreateDocumentQuery<TestDocument>(documentCollectionUri, new FeedOptions()
{
MaxItemCount = 1000
})
.Where(document => document.Location.Intersects(area));
var documents = query.ToList();
return documents;
}
}
}
public class TestDocument : Resource
{
public string Name { get; set; }
public Point Location { get; set; } //AJ: Longitude,Latitude i.e. X/Y
public TestDocument()
{
this.Id = Guid.NewGuid().ToString("N");
}
}
//AJ: This should be "good enough". The times being recorded are seconds or minutes.
public class CodeTimer : IDisposable
{
private Action<TimeSpan> reportFunction;
private Stopwatch stopwatch = new Stopwatch();
public CodeTimer([CallerMemberName]string name = "")
: this((ellapsed) =>
{
Trace.TraceInformation($"{name} took {ellapsed}, or {ellapsed.TotalMilliseconds} ms.");
})
{ }
public CodeTimer(Action<TimeSpan> report)
{
this.reportFunction = report;
this.stopwatch.Start();
}
public void Dispose()
{
this.stopwatch.Stop();
this.reportFunction(this.stopwatch.Elapsed);
}
}
public class PropertyNameMapContractResolver : DefaultContractResolver
{
private Dictionary<string, string> propertyNameMap;
public PropertyNameMapContractResolver(Dictionary<string, string> propertyNameMap)
{
this.propertyNameMap = propertyNameMap;
}
protected override string ResolvePropertyName(string propertyName)
{
if (this.propertyNameMap.TryGetValue(propertyName, out string resolvedName))
return resolvedName;
return base.ResolvePropertyName(propertyName);
}
}
}
I was using a custom ContractResolver and that was evidently having a big impact on the performance of the DocumentDB classes from the .Net SDK.
This was how I was setting the ContractResolver:
JsonConvert.DefaultSettings = () =>
{
return new JsonSerializerSettings
{
ContractResolver = new PropertyNameMapContractResolver(new Dictionary<string, string>()
{
{ "ID", "id" }
})
};
};
And this is how it was implemented:
public class PropertyNameMapContractResolver : DefaultContractResolver
{
private Dictionary<string, string> propertyNameMap;
public PropertyNameMapContractResolver(Dictionary<string, string> propertyNameMap)
{
this.propertyNameMap = propertyNameMap;
}
protected override string ResolvePropertyName(string propertyName)
{
if (this.propertyNameMap.TryGetValue(propertyName, out string resolvedName))
return resolvedName;
return base.ResolvePropertyName(propertyName);
}
}
The solution was easy, don't set JsonConvert.DefaultSettings so the ContractResolver isn't used.
Results:
I was able to perform my spatial query in 21799.0221 ms, which is 22 seconds.
Previously it took 170717.151 ms, which is 2 minutes 50 seconds.
That's about 8x faster!
I'm batching serialized records (in a JArray) to send to Event Hub. When I'm writing the data to Event Hubs it seems to be inserting extra speech marks around the JSON i.e. what is written "{"myjson":"blah"}" not {"myjson":"blah"} so downstream I'm having trouble reading it.
Based on this guidance, I must convert JSON to string and then use GetBytes to pass it into an EventData object. I suspect my attempt at following this guidance is where my issue is arising.
using Newtonsoft.Json;
using Newtonsoft.Json.Linq;
public static class EventDataTransform
{
public static EventData ToEventData(dynamic eventObject, out int payloadSize)
{
string json = eventObject.ToString(Formatting.None);
payloadSize = Encoding.UTF8.GetByteCount(json);
var payload = Encoding.UTF8.GetBytes(json);
var eventData = new EventData(payload)
{
};
return eventData;
}
}
How should an item from a JArray containing serialized data be converted into the contents of an EventData message?
Code call location - used in batching upto 256kb parcels
public bool MoveNext()
{
var batch = new List<EventData>(_allEvents.Count);
var batchSize = 0;
for (int i = _lastBatchedEventIndex; i < _allEvents.Count; i++)
{
dynamic evt = _allEvents[i];
int payloadSize = 0;
var eventData = EventDataTransform.ToEventData(evt, out payloadSize);
var eventSize = payloadSize + EventDataOverheadBytes;
if (batchSize + eventSize > MaxBatchSizeBytes)
{
break;
}
batch.Add(eventData);
batchSize += eventSize;
}
_lastBatchedEventIndex += batch.Count();
_currentBatch = batch;
return _currentBatch.Count() > 0;
}
Sounds like the JArray already contains serialized objects (strings). Calling .ToString(Formatting.None) will serialize it again a second time (wrapping it in quotes).
Interestingly enough, if you call .ToString() without passing in a Formatting, it would not serialize it again.
This fiddle demonstrates this: https://dotnetfiddle.net/H4p6KL
i am trying to be as thorough as i can in this post, as it is very important for me,
though the issue is very simple, and only by reading the title of this question, you can get the idea...
question is:
with healthy bandwidth (30mb Vdsl) available...
how is it possible to get multiple httpWebRequest for a single data / file ?,
so each reaquest,will download only a portion of the data
then when all instances have completed, all parts are joined back to one piece.
Code:
...what i have got working so far is same idea only each task =HttpWebRequest = different file,
so speedup is pure tasks parallelism rather acceleration of one download using multiple tasks/threads
as in my question.
see code below
the next part is only more detailed explantion and background on the subject...if you don't mind reading.
while i am still on a similar project that differ from this (in question)one,
in the way that it(see code below..) was trying to fetch as many different data sources for each of separated tasks(different downloads/files).
... so the speedup was gaind while each(task) does not have to wait for the former one to complete first before it get a chance to be executed .
what i am trying to do in this current-subjected question (having allmost everything ready in the code below) is actually targetting same url for same data,
so this time the speedup to gain is for the single-task - current download .
implementing same idea as in code below only this time let SmartWebClient target same url by
using multiple instances.
then (only theory for now) it will request partial content of data,
with multiple requests with each one of instances .
last issue is i need to "put puzle back to one peace"... another problem i need to find out about...
as you can see in this code , what i did not get to work on yet is only the data parsing/processing which i find to be very easy using htmlAgilityPack so no problem.
current code
main entry:
var htmlDictionary = urlsForExtraction.urlsConcrDict();
Parallel.ForEach(
urlList.Values,
new ParallelOptions { MaxDegreeOfParallelism = 20 },
url => Download(url, htmlDictionary)
);
foreach (var pair in htmlDictionary)
{
///Process(pair);
MessageBox.Show(pair.Value);
}
public class urlsForExtraction
{
const string URL_Dollar= "";
const string URL_UpdateUsersTimeOut="";
public ConcurrentDictionary<string, string> urlsConcrDict()
{
//need to find the syntax to extract fileds names so it would be possible to iterate on each instead of specying
ConcurrentDictionary<string, string> retDict = new Dictionary<string,string>();
retDict.TryAdd("URL_Dollar", "Any.Url.com");
retDict.TryAdd("URL_UpdateUserstbl", "http://bing.com");
return retDict;
}
}
/// <summary>
/// second Stage Class consumes the Dictionary of urls for extraction
/// then downloads Each via parallel for each using The Smart WeBClient! (download(); )
/// </summary>
public class InitConcurentHtmDictExtrct
{
private void Download(string url, ConcurrentDictionary<string, string> htmlDictionary)
{
using (var webClient = new SmartWebClient())
{
webClient.Encoding = Encoding.GetEncoding("UTF-8");
webClient.Proxy = null;
htmlDictionary.TryAdd(url, webClient.DownloadString(url));
}
}
private ConcurrentDictionary<string, string> htmlDictionary;
public ConcurrentDictionary<string, string> LoopOnUrlsVia_SmartWC(Dictionary<string, string> urlList)
{
htmlDictionary = new ConcurrentDictionary<string, string>();
Parallel.ForEach(
urlList.Values,
new ParallelOptions { MaxDegreeOfParallelism = 20 },
url => Download(url, htmlDictionary)
);
return htmlDictionary;
}
}
/// <summary>
/// the Extraction Process, done via "HtmlAgility pack"
/// easy usage to collect information within a given html Documnet via referencing elements attributes
/// </summary>
public class Results
{
public struct ExtracionParameters
{
public string FileNameToSave;
public string directoryPath;
public string htmlElementType;
}
public enum Extraction
{
ById, ByClassName, ByElementName
}
public void ExtractHtmlDict( ConcurrentDictionary<string, string> htmlResults, Extract By)
{
// helps with easy elements extraction from the page.
HtmlAttribute htAgPcAttrbs;
HtmlDocument HtmlAgPCDoc = new HtmlDocument();
/// will hold a name+content of each documnet-part that was aventually extracted
/// then from this container the build of the result page will be possible
Dictionary<string, HtmlDocument> dictResults = new Dictionary<string, HtmlDocument>();
foreach (KeyValuePair<string, string> htmlPair in htmlResults)
{
Process(htmlPair);
}
}
private static void Process(KeyValuePair<string, string> pair)
{
// do the html processing
}
}
public class SmartWebClient : WebClient
{
private readonly int maxConcurentConnectionCount;
public SmartWebClient(int maxConcurentConnectionCount = 20)
{
this.Proxy = null;
this.Encoding = Encoding.GetEncoding("UTF-8");
this.maxConcurentConnectionCount = maxConcurentConnectionCount;
}
protected override WebRequest GetWebRequest(Uri address)
{
var httpWebRequest = (HttpWebRequest)base.GetWebRequest(address);
if (httpWebRequest == null)
{
return null;
}
if (maxConcurentConnectionCount != 0)
{
httpWebRequest.ServicePoint.ConnectionLimit = maxConcurentConnectionCount;
}
return httpWebRequest;
}
}
}
this allows me to take advantage of good bandwith,
only i am far from the subjected solution, i will realy appriciate any clue on where to start .
If the server support what's wikipedia calls byte serving, you can multiplex a file download spawning multiple requests with a specific Range header value (using the AddRange method. See also How to download the data from the server discontinuously?). Most serious HTTP servers do support byte-range.
Here is some sample code that implements a parallel download of a file using byte range:
public static void ParallelDownloadFile(string uri, string filePath, int chunkSize)
{
if (uri == null)
throw new ArgumentNullException("uri");
// determine file size first
long size = GetFileSize(uri);
using (FileStream file = new FileStream(filePath, FileMode.Create, FileAccess.Write, FileShare.Write))
{
file.SetLength(size); // set the length first
object syncObject = new object(); // synchronize file writes
Parallel.ForEach(LongRange(0, 1 + size / chunkSize), (start) =>
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.AddRange(start * chunkSize, start * chunkSize + chunkSize - 1);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
lock (syncObject)
{
using (Stream stream = response.GetResponseStream())
{
file.Seek(start * chunkSize, SeekOrigin.Begin);
stream.CopyTo(file);
}
}
});
}
}
public static long GetFileSize(string uri)
{
if (uri == null)
throw new ArgumentNullException("uri");
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(uri);
request.Method = "HEAD";
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
return response.ContentLength;
}
private static IEnumerable<long> LongRange(long start, long count)
{
long i = 0;
while (true)
{
if (i >= count)
{
yield break;
}
yield return start + i;
i++;
}
}
And sample usage:
private static void TestParallelDownload()
{
string uri = "http://localhost/welcome.png";
string fileName = Path.GetFileName(uri);
ParallelDownloadFile(uri, fileName, 10000);
}
PS: I'd be curious to know if it's really more interesting to do this parallel thing rather than to just use WebClient.DownloadFile... Maybe in slow network scenarios?