I found a osrm-machine and it returns a json string when i request. The json has some spesific information about 2 location and I am processing the json and getting the value of distance property for building distance matrix of these locations. I have more than 2000 location and this processing takes approximately 4 hours. I need to decrease the execution time with parallelism, but I am very new to the topic. Here is my work, what should I do for optimizing the parallel loop ? or maybe you can drive me to new approach. Thanks
var client = new RestClient("http://127.0.0.1:5000/route/v1/table/");
var watch = System.Diagnostics.Stopwatch.StartNew();
//rawCount = 2500
Parallel.For(0, rowCount, i =>
{
Parallel.For(0, rowCount, j =>
{
//request a server with spesific lats,longs
var request = new RestRequest(String.Format("{0},{1};{2},{3}", le.LocationList[i].longitude, le.LocationList[i].latitude,
le.LocationList[j].longitude, le.LocationList[j].latitude));
//reading the response and deserialize into object
var response = client.Execute<RootObject>(request);
//defining objem with the List of attributes in routes
var objem = response.Data.routes;
//this part reading all distances and durations in each response and take them into dist and dur matrixes.
Parallel.ForEach(objem, (o) =>
{
dist[i, j] = o.distance;
dur[i, j] = o.duration;
threads[i,j] = Thread.CurrentThread.ManagedThreadId;
Thread.Sleep(10);
});
});
});
watch.Stop();
var elapsedMs = watch.ElapsedMilliseconds;
I cleaned it up a bit. One use of parallelism is enough. There still is one loop that you need to look at, because it's overwriting data and I do not know what to do with it, that's your call.
You need to experiement with the maxThreads variable's value a bit. Normally, .NET would spin up enough threads so your processor can handle them, in your case you can spin up more because you know that all of them are just idling waiting for the network stack.
using System;
using System.Linq;
using System.Threading.Tasks;
namespace ConsoleApp7
{
class Program
{
static void Process(int i, int j)
{
var client = new RestClient("http://127.0.0.1:5000/route/v1/table/");
//request a server with spesific lats,longs
var request = new RestRequest(String.Format("{0},{1};{2},{3}", le.LocationList[i].longitude, le.LocationList[i].latitude, le.LocationList[j].longitude, le.LocationList[j].latitude));
//reading the response and deserialize into object
var response = client.Execute<RootObject>(request);
//defining objem with the List of attributes in routes
var objem = response.Data.routes;
//this part reading all distances and durations in each response and take them into dist and dur matrixes.
// !!!
// !!! THIS LOOP NEEDS TO GO.
// !!! IT MAKES NO SENSE!
// !!! YOU ARE OVERWRITING YOUR OWN DATA!
// !!!
Parallel.ForEach(objem, (o) =>
{
dist[i, j] = o.distance;
dur[i, j] = o.duration;
threads[i, j] = Thread.CurrentThread.ManagedThreadId;
Thread.Sleep(10);
});
}
static void Main(string[] args)
{
var watch = System.Diagnostics.Stopwatch.StartNew();
var rowCount = 2500;
var maxThreads = 100;
var allPairs = Enumerable.Range(0, rowCount).SelectMany(x => Enumerable.Range(0, rowCount).Select(y => new { X = x, Y = y }));
Parallel.ForEach(allPairs, new ParallelOptions { MaxDegreeOfParallelism = maxThreads }, pair => Process(pair.X, pair.Y));
watch.Stop();
var elapsedMs = watch.ElapsedMilliseconds;
}
}
}
Related
Below is my current code which gets 500 documents(JSON format) from the documentDB per call. I can only do 500 per search and adding it to a concurrent bag(in parallel). The data fetched is based on the id number I provide where to the API and picks it from that range. E.g. id = 500 [gets documents from 501 - 1000]. The below code fills concurrent bag with 25k documents as expected.
int threadNumber = 5;
var concurrentBag = new ConcurrentBag<docClass>();
if (batch == 25000)
{
id = 500;
while (id <= 25000)
{
docs = await client.SearchDocuments<docClass>(GetFollowUpRequest(id), requestOptions);
docClass lastdoc = docs.Documents.Last();
lastid = lastdoc.Id.Id;
Parallel.ForEach(docs.Documents, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount }, item =>
{
concurrentBag.Add(item);
});
id = id + 500;
}
}
I wanted to run this whole while loop in threading so that I can do a multiple call to API and fetch 500 documents parallely. I tried to modify the code as below but always I see only 500 documents still in the concurrent bag 'concurrentBag' after the whole run and the skip id stays at 500 and doesnt increment.
int threadNumber = 5;
var concurrentBag = new ConcurrentBag<docClass>();
if (batch == 25000)
{
id = 500;
Task[] tasks = new Task[threadNumber];
for (int j = 0; j < threadNumber; j++)
{
tasks[j] = Task.Run(async() =>
{
while (id <= 25000)
{
docs = await client.SearchDocuments<docClass>(GetFollowUpRequest(id), requestOptions);
docClass lastdoc = docs.Documents.Last();
lastid = lastdoc.Id.Id;
Parallel.ForEach(docs.Documents, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount }, item =>
{
concurrentBag.Add(item);
});
id = id + 500;
}
});
}
}
Can you please help what am I doing wrong here?
For loading document from external resources use asynchronous approach without extra threads.
Note, that when you download external resources in parallel, extra threads doing no work, but just waiting for the response, so threads are just being wasted ;)
Asynchronous approach provide possibility to launch multiple requests almost simultaneously, without waiting for every task to complete, but wait only when all tasks are ready.
var maxDocuments = 25000;
var step = 500;
var documentTasks = Enumerable.Range(1, int.Max)
.Select(offset => step * offset)
.TakeWhile(id => id <= maxDocuments)
.Select(id => client.Search<docClass>(GetFollowUpRequest(id), requestOptions))
.ToArray();
await Task.WhenAll(documentTasks);
var allDocuments = documentTasks
.Select(task = task.Result)
.SelectMany(documents => documents)
.ToArray();
I'm trying to implement a multiple download function, that would fire and download like 10 files at the same time.
I implemented the IProgress interface too to let me know what's the progress for that.
What I need now is a simple counter inside each of them to say: this is download #1, this is download #2, etc..
I can't do a normal counter since they may all run at the same time and update the counter before me using/storing the value, causing me to have a wrong value or the same value for many of them.
I've looked into the Interlocked class, but I'm just not able to find a suitable implementation for that, that would store a specific number for a each async dynamic function that is triggered.
I'm using the number to store the progress in an array, so I'll just call like:
IProgress<double> progressHandler = new Progress<double>(p => HandleUnitProgressBar(p, downloadIndex));
and let the handler store the progress in the designated cell of the array.
Can anyone point me in the right direction?
Edit #1: Adding code I'm trying to use:
for (int i = 0; i < _downloadList.Count; i++)
{
var url = _downloadList.ToArray()[i];
Task.Factory.StartNew
(
async () =>
{
try
{
MegaApiClient client = new MegaApiClient();
//client.LoginAnonymous();
downloadIndex = i;
IProgress<double> progressHandler = new Progress<double>(p => HandleUnitProgressBar(p, downloadIndex));
await client.DownloadFileAsync(fileLink, url.Value, progressHandler);
}
catch (Exception e)
{
//will add later
}
}
, CancellationToken.None
, TaskCreationOptions.None
, TaskScheduler.Current
);
}
Problem with this code is that by the time it reaches downloadIndex = i, i is already 4 (for 4 simultaneous downloads) whereas I want to send 0,1,2,3 and not 4 to all the handlers.
I don't know if I fully understand your question, but I'll give it a shot. If I understand correctly, you are asking for an integer to be assigned to a request for tracking purposes right? I've done this before while doing a bit of stress testing and sends the requests out in waves of 100. The code looked a little like this:
static void Main(string[] args)
{
var app = new Program();
var i = 0;
var j = 0;
var tasks = new List<Task>();
while (j < 1000)
{
tasks.Add(app.CreateOpp(j));
Console.WriteLine(i);
if (i == 100)
{
var done = Task.WhenAll(tasks);
done.Wait();
i = 0;
tasks = new List<Task>();
}
i++;
j++;
}
}
private async Task CreateOpp(int i)
{
var client = new RestClient("https...");
var request = new RestRequest(Method.POST);
request.AddHeader("Authorization", "Bearer Token");
request.AddHeader("Content-Type", "application/json");
var response = await client.ExecuteTaskAsync(request);
Console.WriteLine(i + " Status: " + response.StatusCode);
Console.WriteLine();
}
After each task completes, instead of having it write to the console like I did on the line that says "Console.WriteLine(i + " Status: " + response.StatusCode);", you can probably just increment some value. Hope this helps or at least leads you down a new path!
Hi i am spidering the site and reading the contents.I want to keep the request rate reasonable. Up to approx 10 requests per second should probably be ok.Currently it is 5k request per minute and it is causing security issues as this looks to be a bot activity.
How to do this? Here is my code
protected void Iterareitems(List<Item> items)
{
foreach (var item in items)
{
GetImagesfromItem(item);
if (item.HasChildren)
{
Iterareitems(item.Children.ToList());
}
}
}
protected void GetImagesfromItem(Item childitems)
{
var document = new HtmlWeb().Load(completeurl);
var urls = document.DocumentNode.Descendants("img")
.Select(e => e.GetAttributeValue("src", null))
.Where(s => !string.IsNullOrEmpty(s)).ToList();
}
You need System.Threading.Semaphore, using which you can control the max concurrent threads/tasks. Here is an example:
var maxThreads = 3;
var semaphore = new Semaphore(maxThreads, maxThreads);
for (int i = 0; i < 10; i++) //10 tasks in total
{
var j = i;
Task.Factory.StartNew(() =>
{
semaphore.WaitOne();
Console.WriteLine("start " + j.ToString());
Thread.Sleep(1000);
Console.WriteLine("end " + j.ToString());
semaphore.Release();
});
}
You can see at most 3 tasks are working, others are pending by semaphore.WaitOne() because the maximum limit reached, and the pending thread will continue if another thread released the semaphore by semaphore.Release().
I have a database (RavenDB) which needs to be able to handle 300 queries (Full text search) every 10 seconds. To increase peformance I splitted up the database so I have multiple documentStores
my Code:
var watch = Stopwatch.StartNew();
int taskcnt = 0;
int sum = 0;
for (int i = 0; i < 11; i++)
{
Parallel.For(0, 7, new Action<int>((x) =>
{
for(int docomentStore = 0;docomentStore < 5; docomentStore++)
{
var stopWatch = Stopwatch.StartNew();
Task<IList<eBayItem>> task = new Task<IList<eBayItem>>(Database.ExecuteQuery, new Filter()
{
Store = "test" + docomentStore,
MaxPrice = 600,
MinPrice = 200,
BIN = true,
Keywords = new List<string>() { "Canon", "MP", "Black" },
ExcludedKeywords = new List<string>() { "G1", "T3" }
});
task.ContinueWith((list) => {
stopWatch.Stop();
sum += stopWatch.Elapsed.Milliseconds;
taskcnt++;
if (taskcnt == 300)
{
watch.Stop();
Console.WriteLine("Average time: " + (sum / (float)300).ToString());
Console.WriteLine("Total time: " + watch.Elapsed.ToString() + "ms");
}
});
task.Start();
}
}));
Thread.Sleep(1000);
}
Average query time: 514,13 ms
Total time: 00:01:29.9108016
The code where I query ravenDB:
public static IList<eBayItem> ExecuteQuery(object Filter)
{
IList<eBayItem> items;
Filter filter = (Filter)Filter;
if (int.Parse(filter.Store.ToCharArray().Last().ToString()) > 4)
{
Console.WriteLine(filter.Store); return null;
}
using (var session = Shards[filter.Store].OpenSession())
{
var query = session.Query<eBayItem, eBayItemIndexer>().Where(y => y.Price <= filter.MaxPrice && y.Price >= filter.MinPrice);
query = filter.Keywords.ToArray()
.Aggregate(query, (q, term) =>
q.Search(xx => xx.Title, term, options: SearchOptions.And));
if (filter.ExcludedKeywords.Count > 0)
{
query = filter.ExcludedKeywords.ToArray().Aggregate(query, (q, exterm) =>
q.Search(it => it.Title, exterm, options: SearchOptions.Not));
}
items = query.ToList<eBayItem>();
}
return items;
}
And the initialization of RavenDB:
static Dictionary<string, EmbeddableDocumentStore> Shards = new Dictionary<string, EmbeddableDocumentStore>();
public static void Connect()
{
Shards.Add("test0", new EmbeddableDocumentStore() { DataDirectory = "test.db" });
Shards.Add("test1", new EmbeddableDocumentStore() { DataDirectory = "test1.db" });
Shards.Add("test2", new EmbeddableDocumentStore() { DataDirectory = "test2.db" });
Shards.Add("test3", new EmbeddableDocumentStore() { DataDirectory = "test3.db" });
Shards.Add("test4", new EmbeddableDocumentStore() { DataDirectory = "test4.db" });
foreach (string key in Shards.Keys)
{
EmbeddableDocumentStore store = Shards[key];
store.Initialize();
IndexCreation.CreateIndexes(typeof(eBayItemIndexer).Assembly, store);
}
}
How can I optimize my code so my total time is lower ? Is it good to divide my database up in 5 different ones ?
EDIT: The program has only 1 documentStore instead of 5. (As sugested by Ayende Rahien)
Also this is the Query on its own:
Price_Range:[* TO Dx600] AND Price_Range:[Dx200 TO NULL] AND Title:(Canon) AND Title:(MP) AND Title:(Black) -Title:(G1) -Title:(T3)
No, this isn't good.
Use a single embedded RavenDB. If you need sharding, this involved multiple machines.
In general, RavenDB queries are in the few ms each. You need to show what your queries looks like (you can call ToString() on them to see that).
Having shards of RavenDB in this manner means that all of them are fighting for CPU and IO
I know this is an old post but this was the top search result I got.
I had the same problem that my queries were taking 500ms. It now takes 100ms by applying the following search practices: http://ravendb.net/docs/article-page/2.5/csharp/client-api/querying/static-indexes/searching
Suppose there are an arbitrary number of threads in my C# program. Each thread needs to look up the changeset ids for a particular path by looking up it's history. The method looks like so:
public List<int> GetIdsFromHistory(string path, VersionControlServer tfsClient)
{
IEnumerable submissions = tfsClient.QueryHistory(
path,
VersionSpec.Latest,
0,
RecursionType.None, // Assume that the path is to a file, not a directory
null,
null,
null,
Int32.MaxValue,
false,
false);
List<int> ids = new List<int>();
foreach(Changeset cs in submissions)
{
ids.Add(cs.ChangesetId);
}
return ids;
}
My question is, does each thread need it's own VersionControlServer instance or will one suffice? My intuition tells me that each thread needs its own instance since the TFS SDK uses webservices and I should probably have more than one connection open if I'm really going to get the parallel behavior. If I only use one connection, my intuition tells me that I'll get serial behavior even though I've got multiple threads.
If I need as many instances as there are threads, I think of using an Object-Pool pattern, but will the connections time out and close over a long period if not being used? The docs seem sparse in this regard.
It would appear that threads using the SAME client is the fastest option.
Here's the output from a test program that runs 4 tests 5 times each and returns the average result in milliseconds. Clearly using the same client across multiple threads is the fastest execution:
Parallel Pre-Alloc: Execution Time Average (ms): 1921.26044
Parallel AllocOnDemand: Execution Time Average (ms): 1391.665
Parallel-SameClient: Execution Time Average (ms): 400.5484
Serial: Execution Time Average (ms): 1472.76138
For reference, here's the test program itself (also on GitHub):
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Microsoft.TeamFoundation;
using Microsoft.TeamFoundation.Client;
using Microsoft.TeamFoundation.VersionControl.Client;
using System.Collections;
using System.Threading.Tasks;
using System.Diagnostics;
namespace QueryHistoryPerformanceTesting
{
class Program
{
static string TFS_COLLECTION = /* TFS COLLECTION URL */
static VersionControlServer GetTfsClient()
{
var projectCollectionUri = new Uri(TFS_COLLECTION);
var projectCollection = TfsTeamProjectCollectionFactory.GetTeamProjectCollection(projectCollectionUri, new UICredentialsProvider());
projectCollection.EnsureAuthenticated();
return projectCollection.GetService<VersionControlServer>();
}
struct ThrArg
{
public VersionControlServer tfc { get; set; }
public string path { get; set; }
}
static List<string> PATHS = new List<string> {
// ASSUME 21 FILE PATHS
};
static int NUM_RUNS = 5;
static void Main(string[] args)
{
var results = new List<TimeSpan>();
for (int i = NUM_RUNS; i > 0; i--)
{
results.Add(RunTestParallelPreAlloc());
}
Console.WriteLine("Parallel Pre-Alloc: Execution Time Average (ms): " + results.Select(t => t.TotalMilliseconds).Average());
results.Clear();
for (int i = NUM_RUNS; i > 0; i--)
{
results.Add(RunTestParallelAllocOnDemand());
}
Console.WriteLine("Parallel AllocOnDemand: Execution Time Average (ms): " + results.Select(t => t.TotalMilliseconds).Average());
results.Clear();
for (int i = NUM_RUNS; i > 0; i--)
{
results.Add(RunTestParallelSameClient());
}
Console.WriteLine("Parallel-SameClient: Execution Time Average (ms): " + results.Select(t => t.TotalMilliseconds).Average());
results.Clear();
for (int i = NUM_RUNS; i > 0; i--)
{
results.Add(RunTestSerial());
}
Console.WriteLine("Serial: Execution Time Average (ms): " + results.Select(t => t.TotalMilliseconds).Average());
}
static TimeSpan RunTestParallelPreAlloc()
{
var paths = new List<ThrArg>();
paths.AddRange( PATHS.Select( x => new ThrArg { path = x, tfc = GetTfsClient() } ) );
return RunTestParallel(paths);
}
static TimeSpan RunTestParallelAllocOnDemand()
{
var paths = new List<ThrArg>();
paths.AddRange(PATHS.Select(x => new ThrArg { path = x, tfc = null }));
return RunTestParallel(paths);
}
static TimeSpan RunTestParallelSameClient()
{
var paths = new List<ThrArg>();
var _tfc = GetTfsClient();
paths.AddRange(PATHS.Select(x => new ThrArg { path = x, tfc = _tfc }));
return RunTestParallel(paths);
}
static TimeSpan RunTestParallel(List<ThrArg> args)
{
var allIds = new List<int>();
var stopWatch = new Stopwatch();
stopWatch.Start();
Parallel.ForEach(args, s =>
{
allIds.AddRange(GetIdsFromHistory(s.path, s.tfc));
}
);
stopWatch.Stop();
return stopWatch.Elapsed;
}
static TimeSpan RunTestSerial()
{
var allIds = new List<int>();
VersionControlServer tfsc = GetTfsClient();
var stopWatch = new Stopwatch();
stopWatch.Start();
foreach (string s in PATHS)
{
allIds.AddRange(GetIdsFromHistory(s, tfsc));
}
stopWatch.Stop();
return stopWatch.Elapsed;
}
static List<int> GetIdsFromHistory(string path, VersionControlServer tfsClient)
{
if (tfsClient == null)
{
tfsClient = GetTfsClient();
}
IEnumerable submissions = tfsClient.QueryHistory(
path,
VersionSpec.Latest,
0,
RecursionType.None, // Assume that the path is to a file, not a directory
null,
null,
null,
Int32.MaxValue,
false,
false);
List<int> ids = new List<int>();
foreach(Changeset cs in submissions)
{
ids.Add(cs.ChangesetId);
}
return ids;
}