Parallel.ForEach blocking calling method - c#

I am having a problem with Parallel.ForEach. I have written simple application that adds file names to be downloaded to the queue, then using while loop it iterates through the queue, downloads file one at a time, then when file has been downloaded, another async method is called to create object from downloaded memoryStream. Returned result of this method is not awaited, it is discarded, so the next download starts immediately. Everything works fine if I use simple foreach in object creation - objects are being created while download is continuing. But if I would like to speed up the object creation process and use Parallel.ForEach it stops download process until the object is created. UI is fully responsive, but it just won't download the next object. I don't understand why is this happening - Parallel.ForEach is inside await Task.Run() and to my limited knowledge about asynchronous programming this should do the trick. Can anyone help me understand why is it blocking first method and how to avoid it?
Here is a small sample:
public async Task DownloadFromCloud(List<string> constructNames)
{
_downloadDataQueue = new Queue<string>();
var _gcsClient = StorageClient.Create();
foreach (var item in constructNames)
{
_downloadDataQueue.Enqueue(item);
}
while (_downloadDataQueue.Count > 0)
{
var memoryStream = new MemoryStream();
await _gcsClient.DownloadObjectAsync("companyprojects",
_downloadDataQueue.Peek(), memoryStream);
memoryStream.Position = 0;
_ = ReadFileXml(memoryStream);
_downloadDataQueue.Dequeue();
}
}
private async Task ReadFileXml(MemoryStream memoryStream)
{
var reader = new XmlReader();
var properties = reader.ReadXmlTest(memoryStream);
await Task.Run(() =>
{
var entityList = new List<Entity>();
foreach (var item in properties)
{
entityList.Add(CreateObjectsFromDownloadedProperties(item));
}
//Parallel.ForEach(properties item =>
//{
// entityList.Add(CreateObjectsFromDownloadedProperties(item));
//});
});
}
EDIT
This is simplified object creation method:
public Entity CreateObjectsFromDownloadedProperties(RebarProperties properties)
{
var path = new LinearPath(properties.Path);
var section = new Region(properties.Region);
var sweep = section.SweepAsMesh(path, 1);
return sweep;
}

Returned result of this method is not awaited, it is discarded, so the next download starts immediately.
This is also dangerous. "Fire and forget" means "I don't care when this operation completes, or if it completes. Just discard all exceptions because I don't care." So fire-and-forget should be extremely rare in practice. It's not appropriate here.
UI is fully responsive, but it just won't download the next object.
I have no idea why it would block the downloads, but there's a definite problem in switching to Parallel.ForEach: List<T>.Add is not threadsafe.
private async Task ReadFileXml(MemoryStream memoryStream)
{
var reader = new XmlReader();
var properties = reader.ReadXmlTest(memoryStream);
await Task.Run(() =>
{
var entityList = new List<Entity>();
Parallel.ForEach(properties, item =>
{
var itemToAdd = CreateObjectsFromDownloadedProperties(item);
lock (entityList) { entityList.Add(itemToAdd); }
});
});
}
One tip: if you have a result value, PLINQ is often cleaner than Parallel:
private async Task ReadFileXml(MemoryStream memoryStream)
{
var reader = new XmlReader();
var properties = reader.ReadXmlTest(memoryStream);
await Task.Run(() =>
{
var entityList = proeprties
.AsParallel()
.Select(CreateObjectsFromDownloadedProperties)
.ToList();
});
}
However, the code still suffers from the fire-and-forget problem.
For a better fix, I'd recommend taking a step back and using something more suited to "pipeline"-style processing. E.g., TPL Dataflow:
public async Task DownloadFromCloud(List<string> constructNames)
{
// Set up the pipeline.
var gcsClient = StorageClient.Create();
var downloadBlock = new TransformBlock<string, MemoryStream>(async constructName =>
{
var memoryStream = new MemoryStream();
await gcsClient.DownloadObjectAsync("companyprojects", constructName, memoryStream);
memoryStream.Position = 0;
return memoryStream;
});
var processBlock = new TransformBlock<MemoryStream, List<Entity>>(memoryStream =>
{
var reader = new XmlReader();
var properties = reader.ReadXmlTest(memoryStream);
return proeprties
.AsParallel()
.Select(CreateObjectsFromDownloadedProperties)
.ToList();
});
var resultsBlock = new ActionBlock<List<Entity>>(entities => { /* TODO */ });
downloadBlock.LinkTo(processBlock, new DataflowLinkOptions { PropagateCompletion = true });
processBlock.LinkTo(resultsBlock, new DataflowLinkOptions { PropagateCompletion = true });
// Push data into the pipeline.
foreach (var constructName in constructNames)
await downloadBlock.SendAsync(constructName);
downlodBlock.Complete();
// Wait for pipeline to complete.
await resultsBlock.Completion;
}

Related

Is my approach correct for concurrent network requests?

I wrote a web crawler and I want to know if my approach is correct. The only issue I'm facing is that it stops after some hours of crawling. No exception, it just stops.
1 - the private members and the constructor:
private const int CONCURRENT_CONNECTIONS = 5;
private readonly HttpClient _client;
private readonly string[] _services = new string[2] {
"https://example.com/items?id=ID_HERE",
"https://another_example.com/items?id=ID_HERE"
}
private readonly List<SemaphoreSlim> _semaphores;
public Crawler() {
ServicePointManager.DefaultConnectionLimit = CONCURRENT_CONNECTIONS;
_client = new HttpClient();
_semaphores = new List<SemaphoreSlim>();
foreach (var _ in _services) {
_semaphores.Add(new SemaphoreSlim(CONCURRENT_CONNECTIONS));
}
}
Single HttpClient instance.
The _services is just a string array that contains the URL, they are not the same domain.
I'm using semaphores (one per domain) since I read that it's not a good idea to use the network queue (I don't remember how it calls).
2 - The Run method, which is the one I will call to start crawling.
public async Run(List<int> ids) {
const int BATCH_COUNT = 1000;
var svcIndex = 0;
var tasks = new List<Task<string>>(BATCH_COUNT);
foreach (var itemId in ids) {
tasks.Add(DownloadItem(svcIndex, _services[svcIndex].Replace("ID_HERE", $"{itemId}")));
if (++svcIndex >= _services.Length) {
svcIndex = 0;
}
if (tasks.Count >= BATCH_COUNT) {
var results = await Task.WhenAll(tasks);
await SaveDownloadedData(results);
tasks.Clear();
}
}
if (tasks.Count > 0) {
var results = await Task.WhenAll(tasks);
await SaveDownloadedData(results);
tasks.Clear();
}
}
DownloadItem is an async function that actually makes the GET request, note that I'm not awaiting it here.
If the number of tasks reaches the BATCH_COUNT, I will await all to complete and save the results to file.
3 - The DownloadItem function.
private async Task<string> DownloadItem(int serviceIndex, string link) {
var needReleaseSemaphore = true;
var result = string.Empty;
try {
await _semaphores[serviceIndex].WaitAsync();
var r = await _client.GetStringAsync(link);
_semaphores[serviceIndex].Release();
needReleaseSemaphore = false;
// DUE TO JSON SIZE, I NEED TO REMOVE A VALUE (IT'S USELESS FOR ME)
var obj = JObject.Parse(r);
if (obj.ContainsKey("blah")) {
obj.Remove("blah");
}
result = obj.ToString(Formatting.None);
} catch {
result = string.Empty;
// SINCE I GOT AN EXCEPTION, I WILL 'LOCK' THIS SERVICE FOR 1 MINUTE.
// IF I RELEASED THIS SEMAPHORE, I WILL LOCK IT AGAIN FIRST.
if (!needReleaseSemaphore) {
await _semaphores[serviceIndex].WaitAsync();
needReleaseSemaphore = true;
}
await Task.Delay(60_000);
} finally {
// RELEASE THE SEMAPHORE, IF NEEDED.
if (needReleaseSemaphore) {
_semaphores[serviceIndex].Release();
}
}
return result;
}
4- The function that saves the result.
private async Task SaveDownloadedData(List<string> myData) {
using var fs = new FileStream("./output.dat", FileMode.Append);
foreach (var res in myData) {
var blob = Encoding.UTF8.GetBytes(res);
await fs.WriteAsync(BitConverter.GetBytes((uint)blob.Length));
await fs.WriteAsync(blob);
}
await fs.DisposeAsync();
}
5- Finally, the Main function.
static async Task Main(string[] args) {
var crawler = new Crawler();
var items = LoadItemIds();
await crawler.Run(items);
}
After all this, is my approach correct? I need to make millions of requests, will take some weeks/months to gather all data I need (due to the connection limit).
After 12 - 14 hours, it just stops and I need to manually restart the app (memory usage is ok, my VPS has 1 GB and it never used more than 60%).

Never coming back from Task.WaitAll [duplicate]

This question already has answers here:
An async/await example that causes a deadlock
(5 answers)
Closed 2 years ago.
Still coming up to speed on Xamarin and C#.
I have some code like:
List<Task<int>> taskList = new List<Task<int>>();
ConfigEntry siteId = new ConfigEntry
{
ConfigKey = KEY_SITE_ID,
ConfigValue = siteInfo.siteId
};
taskList.Add(ConfigDatabase.SaveConfigAsync(siteId));
ConfigEntry productId = new ConfigEntry
{
ConfigKey = KEY_PRODUCT_ID1,
ConfigValue = siteInfo.products[0].productId.ToString()
};
taskList.Add(ConfigDatabase.SaveConfigAsync(productId));
There's a total of nine of these getting added to taskList. Each of these inserts stuff into SQLITE. Here is the code being run:
public async Task<int> SaveConfigAsync(ConfigEntry entry)
{
if (entry.ConfigKey == null)
{
throw new Square1Exception("Config entry key not defined:" + entry);
}
else
{
try
{
ConfigEntry existing = await GetConfigAsync(entry.ConfigKey);
if (existing == null)
{
return await _database.InsertAsync(entry);
}
else
{
existing.UpdateFrom(entry);
return await _database.UpdateAsync(entry);
}
}
catch (Exception ex)
{
Console.WriteLine("Error while saving value:" + entry.ConfigKey);
throw ex;
}
}
}
So at the end of the building of this tasklist, I have the following line:
Task.WaitAll(taskList.ToArray());
Which I had hoped would wait until all of the adds completed before exiting. Instead it is never coming back from this. It just hangs my whole app. Not seeing anything in the log either. Does it (potentially) start the task when created or wait until something like WaitAll?
If I replace each of the adds with an await and single thread them it works fine. Maybe blocking on the database or disk?
Ideas?
You shouldn't block on asynchronous code.
The best fix is to change Task.WaitAll(taskList.ToArray()); to await Task.WhenAll(taskList);.
If you must block, then you can use Task.Run to push the work to background threads, as such:
taskList.Add(Task.Run(() => ConfigDatabase.SaveConfigAsync(siteId)));
...
taskList.Add(Task.Run(() => ConfigDatabase.SaveConfigAsync(productId)));
But then you would be blocking your UI thread at the Task.WaitAll, so I don't recommend that approach. Using await Task.WhenAll is better.
You are missing await when doing the Task.WhenAll(Tasks).
Try the following solution:
ConfigEntry siteId = new ConfigEntry
{
ConfigKey = KEY_SITE_ID,
ConfigValue = siteInfo.siteId
};
ConfigEntry productId = new ConfigEntry
{
ConfigKey = KEY_PRODUCT_ID1,
ConfigValue = siteInfo.products[0].productId.ToString()
};
var insertResultFirstTask = ConfigDatabase.SaveConfigAsync(siteId)
var insertResultSecondTask = ConfigDatabase.SaveConfigAsync(productId)
IEnumerable<Task> tasks = new List<Task>() {
insertResultFirstTask,
insertResultSecondTask
};
await Task.WhenAll(tasks);
var insertResultFirst = insertResultFirstTask.Result;
var insertResultSecond = insertResultSecondTask.Result;

Concurrent bag is empty outside of parallel.foreach?

My media bag is getting populated inside of the foreach, but when it hits the bottom line the mediaBag is empty?
var mediaBag = new ConcurrentBag<MediaDto>();
Parallel.ForEach(mediaList,
new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
async media =>
{
var imgBytes = await this.blobStorageService.ReadMedia(media.BlobID, Enums.MediaType.Image);
var fileContent = Convert.ToBase64String(imgBytes);
var image = new MediaDto()
{
ImageId = media.MediaID,
Title = media.Title,
Description = media.Description,
ImageContent = fileContent
};
mediaBag.Add(image);
});
return mediaBag.ToList();
Is this because of my blobstorage function not being thread safe? what would this mean and what is the soultion if that is the case.
Parallel.ForEach doesn't work well with async actions.
You could start and store the tasks returned by ReadMedia in an array and then wait for them all to complete using Task.WhenAll before you create the MediaDto objects in parallel. Something like this:
var mediaBag = new ConcurrentBag<MediaDto>();
Task<byte[]>[] tasks = mediaList.Select(media => blobStorageService.ReadMedia(media.BlobID, Enums.MediaType.Image)).ToArray();
await Task.WhenAll(tasks);
Parallel.ForEach(imgBytes, new ParallelOptions { MaxDegreeOfParallelism = Environment.ProcessorCount },
bytes =>
{
var fileContent = Convert.ToBase64String(imgBytes);
var image = new MediaDto()
{
ImageId = media.MediaID,
Title = media.Title,
Description = media.Description,
ImageContent = fileContent
};
mediaBag.Add(image);
});
return mediaBag.ToList();
Parallelism isn't concurrency. Parallel.ForEach is meant for data parallelism, not executing concurrent actions. It partitions the input data and uses as many worker tasks as there are cores to process one partition each. It doesn't work at all with asynchronous methods because that would defeat its very purpose.
What you ask for is concurrent operations - eg downloading 100 files, 4 or 6 at a time. One way would be to just launch all 100 tasks and wait for them to finish. That's a bit extreme and will probably flood the network connection.
A better way to do this would be to use a TPL Dataflow block like TransformBlock with a specific DOP, eg :
var options = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = 4 };
var buffer=new BufferBlock<MediaDto>();
var block=new TransformBlock<ThatMedia,MediaDto>(media =>{
var imgBytes = await this.blobStorageService.ReadMedia(media.BlobID, Enums.MediaType.Image);
var fileContent = Convert.ToBase64String(imgBytes);
var image = new MediaDto()
{
ImageId = media.MediaID,
Title = media.Title,
Description = media.Description,
ImageContent = fileContent
};
return image;
},options);
block.LinkTo(buffer);
After that, you can start posting entries to the block.
foreach(var entry in mediaList)
{
block.Post(entry);
}
block.Complete();
await block.Completion;
if(buffer.TryReceiveAll(out var theNewList))
{
...
}
Thanks for the advice, i believe i may of misunderstood a 'Parallel.ForEach' usecase.
i have modified the function to use a list of tasks instead and it works very nicely. Below is the changes i made.
var mediaBag = new ConcurrentBag<MediaDto>();
IEnumerable<Task> mediaTasks = mediaList.Select(async m =>
{
var imgBytes = await this.blobStorageService.ReadMedia(m.BlobID, Enums.MediaType.Image);
var fileContent = Convert.ToBase64String(imgBytes);
var image = new MediaDto()
{
ImageId = m.MediaID,
Title = m.Title,
Description = m.Description,
ImageContent = fileContent
};
mediaBag.Add(image);
});
await Task.WhenAll(mediaTasks);
return mediaBag.ToList();

Processing large number of tasks concurrently and asynchronously

I would like to process a list of 50,000 urls through a web service, The provider of this service allows 5 connections per second.
I need to process these urls in parallel with adherence to provider's rules.
This is my current code:
static void Main(string[] args)
{
process_urls().GetAwaiter().GetResult();
}
public static async Task process_urls()
{
// let's say there is a list of 50,000+ URLs
var urls = System.IO.File.ReadAllLines("urls.txt");
var allTasks = new List<Task>();
var throttler = new SemaphoreSlim(initialCount: 5);
foreach (var url in urls)
{
await throttler.WaitAsync();
allTasks.Add(
Task.Run(async () =>
{
try
{
Console.WriteLine(String.Format("Starting {0}", url));
var client = new HttpClient();
var xml = await client.GetStringAsync(url);
//do some processing on xml output
client.Dispose();
}
finally
{
throttler.Release();
}
}));
}
await Task.WhenAll(allTasks);
}
Instead of var client = new HttpClient(); I will create a new object of the target web service but this is just to make the code generic.
Is this the correct approach to handle and process a huge list of connections? and is there anyway I can limit the number of established connections per second to 5 as the current implementation will not consider any timeframe?
Thanks
Reading values from web service is IO operation which can be done asynchronously without multithreading.
Threads do nothing - only waiting for response in this case. So using parallel is just wasting of resources.
public static async Task process_urls()
{
var urls = System.IO.File.ReadAllLines("urls.txt");
var allTasks = new List<Task>();
var throttler = new SemaphoreSlim(initialCount: 5);
foreach (var urlGroup in SplitToGroupsOfFive(urls))
{
var tasks = new List<Task>();
foreach(var url in urlGroup)
{
var task = ProcessUrl(url);
tasks.Add(task);
}
// This delay will sure that next 5 urls will be used only after 1 seconds
tasks.Add(Task.Delay(1000));
await Task.WhenAll(tasks.ToArray());
}
}
private async Task ProcessUrl(string url)
{
using (var client = new HttpClient())
{
var xml = await client.GetStringAsync(url);
//do some processing on xml output
}
}
private IEnumerable<IEnumerable<string>> SplitToGroupsOfFive(IEnumerable<string> urls)
{
var const GROUP_SIZE = 5;
var string[] group = null;
var int count = 0;
foreach (var url in urls)
{
if (group == null)
group = new string[GROUP_SIZE];
group[count] = url;
count++;
if (count < GROUP_SIZE)
continue;
yield return group;
group = null;
count = 0;
}
if (group != null && group.Length > 0)
{
yield return group.Take(group.Length);
}
}
Because you mention that "processing" of response is also IO operation, then async/await approach is most efficient, because it using only one thread and process other tasks when previous tasks waiting for response from web service or from file writing IO operations.

How to use HttpClient PostAsync() with threadpool in C#?

I'm using the following code to post an image to a server.
var image= Image.FromFile(#"C:\Image.jpg");
Task<string> upload = Upload(image);
upload.Wait();
public static async Task<string> Upload(Image image)
{
var uriBuilder = new UriBuilder
{
Host = "somewhere.net",
Path = "/path/",
Port = 443,
Scheme = "https",
Query = "process=false"
};
using (var client = new HttpClient())
{
client.DefaultRequestHeaders.Add("locale", "en_US");
client.DefaultRequestHeaders.Add("country", "US");
var content = ConvertToHttpContent(image);
content.Headers.ContentType = MediaTypeHeaderValue.Parse("image/jpeg");
using (var mpcontent = new MultipartFormDataContent("--myFakeDividerText--")
{
{content, "fakeImage", "myFakeImageName.jpg"}
}
)
{
using (
var message = await client.PostAsync(uriBuilder.Uri, mpcontent))
{
var input = await message.Content.ReadAsStringAsync();
return "nothing for now";
}
}
}
}
I'd like to modify this code to run multiple threads. I've used "ThreadPool.QueueUserWorkItem" before and started to modify the code to leverage it.
private void UseThreadPool()
{
int minWorker, minIOC;
ThreadPool.GetMinThreads(out minWorker, out minIOC);
ThreadPool.SetMinThreads(1, minIOC);
int maxWorker, maxIOC;
ThreadPool.GetMaxThreads(out maxWorker, out maxIOC);
ThreadPool.SetMinThreads(4, maxIOC);
var events = new List<ManualResetEvent>();
foreach (var image in ImageCollection)
{
var resetEvent = new ManualResetEvent(false);
ThreadPool.QueueUserWorkItem(
arg =>
{
var img = Image.FromFile(image.getPath());
Task<string> upload = Upload(img);
upload.Wait();
resetEvent.Set();
});
events.Add(resetEvent);
if (events.Count <= 0) continue;
foreach (ManualResetEvent e in events) e.WaitOne();
}
}
The problem is that only one thread executes at a time due to the call to "upload.Wait()". So I'm still executing each thread in sequence. It's not clear to me how I can use PostAsync with a thread-pool.
How can I post images to a server using multiple threads by tweaking the code above? Is HttpClient PostAsync the best way to do this?
I'd like to modify this code to run multiple threads.
Why? The thread pool should only be used for CPU-bound work (and I/O completions, of course).
You can do concurrency just fine with async:
var tasks = ImageCollection.Select(image =>
{
var img = Image.FromFile(image.getPath());
return Upload(img);
});
await Task.WhenAll(tasks);
Note that I removed your Wait. You should avoid using Wait or Result with async tasks; use await instead. Yes, this will cause async to grow through you code, and you should use async "all the way".

Categories