How to download correctly files simultaneously? - c#

I am trying to download mltiple files simultaneosly. But all files are downloading one by one, sequantilly. So, at first this file downloaded #"http://download.geofabrik.de/europe/cyprus-latest.osm.pbf", and then this file is started to dowload #"http://download.geofabrik.de/europe/finland-latest.osm.pbf",, and the next file to be downloaded is #"http://download.geofabrik.de/europe/great-britain-latest.osm.pbf" and so on.
But I would like to download simultaneously.
So I've the following code based on the code from this answer:
static void Main(string[] args)
{
Task.Run(async () =>
{
await DownloadFiles();
}).GetAwaiter().GetResult();
}
public static async Task DownloadFiles()
{
IList<string> urls = new List<string>
{
#"http://download.geofabrik.de/europe/cyprus-latest.osm.pbf",
#"http://download.geofabrik.de/europe/finland-latest.osm.pbf",
#"http://download.geofabrik.de/europe/great-britain-latest.osm.pbf",
#"http://download.geofabrik.de/europe/belgium-latest.osm.pbf",
#"http://download.geofabrik.de/europe/belgium-latest.osm.pbf"
};
foreach (var url in urls)
{
string fileName = url.Substring(url.LastIndexOf('/'));
await DownloadFile(url, fileName);
}
}
public static async Task DownloadFile(string url, string fileName)
{
string address = #"D:\Downloads";
using (var client = new WebClient())
{
await client.DownloadFileTaskAsync(url, $"{address}{fileName}");
}
}
However, when I see in my file system, then I see that files are downloading one by one, sequantially, not simultaneosuly:
In addition, I've tried to use this approach, however there are no simultaneous downloads:
static void Main(string[] args)
{
IList<string> urls = new List<string>
{
#"http://download.geofabrik.de/europe/cyprus-latest.osm.pbf",
#"http://download.geofabrik.de/europe/finland-latest.osm.pbf",
#"http://download.geofabrik.de/europe/great-britain-latest.osm.pbf",
#"http://download.geofabrik.de/europe/belgium-latest.osm.pbf",
#"http://download.geofabrik.de/europe/belgium-latest.osm.pbf"
};
Parallel.ForEach(urls,
new ParallelOptions { MaxDegreeOfParallelism = 10 },
DownloadFile);
}
public static void DownloadFile(string url)
{
string address = #"D:\Downloads";
using (var sr = new StreamReader(WebRequest.Create(url)
.GetResponse().GetResponseStream()))
using (var sw = new StreamWriter(address + url.Substring(url.LastIndexOf('/'))))
{
sw.Write(sr.ReadToEnd());
}
}
Could you tell me how it is possible to download simultaneosly?
Any help would be greatly appreciated.

foreach (var url in urls)
{
string fileName = url.Substring(url.LastIndexOf('/'));
await DownloadFile(url, fileName); // you wait to download the item and then move the next
}
Instead you should create tasks and wait all of them to complete.
public static Task DownloadFiles()
{
IList<string> urls = new List<string>
{
#"http://download.geofabrik.de/europe/cyprus-latest.osm.pbf",
#"http://download.geofabrik.de/europe/finland-latest.osm.pbf",
#"http://download.geofabrik.de/europe/great-britain-latest.osm.pbf",
#"http://download.geofabrik.de/europe/belgium-latest.osm.pbf",
#"http://download.geofabrik.de/europe/belgium-latest.osm.pbf"
};
var tasks = urls.Select(url=> {
var fileName = url.Substring(url.LastIndexOf('/'));
return DownloadFile(url, fileName);
}).ToArray();
return Task.WhenAll(tasks);
}
Rest of your code can remain same.

Eldar's solution works with some minor edits. This is the full working DownloadFiles method that was edited:
public static async Task DownloadFiles()
{
IList<string> urls = new List<string>
{
#"http://download.geofabrik.de/europe/cyprus-latest.osm.pbf",
#"http://download.geofabrik.de/europe/finland-latest.osm.pbf",
#"http://download.geofabrik.de/europe/great-britain-latest.osm.pbf",
#"http://download.geofabrik.de/europe/belgium-latest.osm.pbf",
#"http://download.geofabrik.de/europe/belgium-latest.osm.pbf"
};
var tasks = urls.Select(t => {
var fileName = t.Substring(t.LastIndexOf('/'));
return DownloadFile(t, fileName);
}).ToArray();
await Task.WhenAll(tasks);
}

this will download them asynchronously one after each other.
await DownloadFile(url, fileName);
await DownloadFile(url2, fileName2);
this will do what you actually want to achieve:
var task1 = DownloadFile(url, fileName);
var task2 = DownloadFile(url2, fileName2);
await Task.WhenAll(task1, task2);

Related

Is my approach correct for concurrent network requests?

I wrote a web crawler and I want to know if my approach is correct. The only issue I'm facing is that it stops after some hours of crawling. No exception, it just stops.
1 - the private members and the constructor:
private const int CONCURRENT_CONNECTIONS = 5;
private readonly HttpClient _client;
private readonly string[] _services = new string[2] {
"https://example.com/items?id=ID_HERE",
"https://another_example.com/items?id=ID_HERE"
}
private readonly List<SemaphoreSlim> _semaphores;
public Crawler() {
ServicePointManager.DefaultConnectionLimit = CONCURRENT_CONNECTIONS;
_client = new HttpClient();
_semaphores = new List<SemaphoreSlim>();
foreach (var _ in _services) {
_semaphores.Add(new SemaphoreSlim(CONCURRENT_CONNECTIONS));
}
}
Single HttpClient instance.
The _services is just a string array that contains the URL, they are not the same domain.
I'm using semaphores (one per domain) since I read that it's not a good idea to use the network queue (I don't remember how it calls).
2 - The Run method, which is the one I will call to start crawling.
public async Run(List<int> ids) {
const int BATCH_COUNT = 1000;
var svcIndex = 0;
var tasks = new List<Task<string>>(BATCH_COUNT);
foreach (var itemId in ids) {
tasks.Add(DownloadItem(svcIndex, _services[svcIndex].Replace("ID_HERE", $"{itemId}")));
if (++svcIndex >= _services.Length) {
svcIndex = 0;
}
if (tasks.Count >= BATCH_COUNT) {
var results = await Task.WhenAll(tasks);
await SaveDownloadedData(results);
tasks.Clear();
}
}
if (tasks.Count > 0) {
var results = await Task.WhenAll(tasks);
await SaveDownloadedData(results);
tasks.Clear();
}
}
DownloadItem is an async function that actually makes the GET request, note that I'm not awaiting it here.
If the number of tasks reaches the BATCH_COUNT, I will await all to complete and save the results to file.
3 - The DownloadItem function.
private async Task<string> DownloadItem(int serviceIndex, string link) {
var needReleaseSemaphore = true;
var result = string.Empty;
try {
await _semaphores[serviceIndex].WaitAsync();
var r = await _client.GetStringAsync(link);
_semaphores[serviceIndex].Release();
needReleaseSemaphore = false;
// DUE TO JSON SIZE, I NEED TO REMOVE A VALUE (IT'S USELESS FOR ME)
var obj = JObject.Parse(r);
if (obj.ContainsKey("blah")) {
obj.Remove("blah");
}
result = obj.ToString(Formatting.None);
} catch {
result = string.Empty;
// SINCE I GOT AN EXCEPTION, I WILL 'LOCK' THIS SERVICE FOR 1 MINUTE.
// IF I RELEASED THIS SEMAPHORE, I WILL LOCK IT AGAIN FIRST.
if (!needReleaseSemaphore) {
await _semaphores[serviceIndex].WaitAsync();
needReleaseSemaphore = true;
}
await Task.Delay(60_000);
} finally {
// RELEASE THE SEMAPHORE, IF NEEDED.
if (needReleaseSemaphore) {
_semaphores[serviceIndex].Release();
}
}
return result;
}
4- The function that saves the result.
private async Task SaveDownloadedData(List<string> myData) {
using var fs = new FileStream("./output.dat", FileMode.Append);
foreach (var res in myData) {
var blob = Encoding.UTF8.GetBytes(res);
await fs.WriteAsync(BitConverter.GetBytes((uint)blob.Length));
await fs.WriteAsync(blob);
}
await fs.DisposeAsync();
}
5- Finally, the Main function.
static async Task Main(string[] args) {
var crawler = new Crawler();
var items = LoadItemIds();
await crawler.Run(items);
}
After all this, is my approach correct? I need to make millions of requests, will take some weeks/months to gather all data I need (due to the connection limit).
After 12 - 14 hours, it just stops and I need to manually restart the app (memory usage is ok, my VPS has 1 GB and it never used more than 60%).

Duplicate dictionary key on task list

I'm trying to generate a zip file of pdfs asynchronously to speed things up as follows:
var files = new Dictionary<string, byte[]>();
var fileTasks = new List<Task<Library.Models.Helpers.File>>();
foreach (var i in groups)
{
var task = Task.Run(async () =>
{
var fileName = $"{i.Key.Title.Replace('/', '-')} - Records.pdf";
ViewBag.GroupName= i.Key.Title;
var html = await this.RenderViewAsync("~/Views/Report/_UserRecordsReport.cshtml", i.ToList(), true);
return await _fileUtilityService.HtmlToPDF2(html, null, fileName);
});
fileTasks.Add(task);
}
var completedTaskFiles = await Task.WhenAll(fileTasks);
foreach(var item in completedTaskFiles)
{
files.Add($"{item.FileName}", item.FileResult);
}
return _fileUtilityService.GenerateZIP(files);
I'm generating all my html to pdf file tasks and waiting for them to be completed - then trying to synchronously loop through the completed tasks and add them to my dictionary for zipping but I keep getting the following error:
An item with the same key has already been added
There is no duplicate key in the list of items being added.
EDIT - so the current idea is that because its a scoped service, thats why i'm running into thread issues (attached the file utility service for information)
public class FileUtilityService : IFileUtilityService
{
private readonly IHttpClientFactory _clientFactory;
public FileUtilityService(IHttpClientFactory clientFactory)
{
public async Task<byte[]> HtmlToPDF(string html = null, string url = null)
{
try
{
byte[] res = null;
if (html is null && url != null)
{
var client = _clientFactory.CreateClient();
var requestResp = await client.GetAsync(url);
using var sr = new StreamReader(await requestResp.Content.ReadAsStreamAsync());
html = HttpUtility.HtmlDecode(await sr.ReadToEndAsync());
}
using(var ms = new MemoryStream())
{
HtmlConverter.ConvertToPdf(html, ms);
res = ms.ToArray();
}
return res;
}catch(Exception ex)
{
throw ex;
}
}
public async Task<Library.Models.Helpers.File> HtmlToPDF(string html = null, string url = null, string fileName = "")
{
return new Library.Models.Helpers.File() { FileName = fileName, FileResult = await HtmlToPDF(html, url) };
}

Function returns before await async finishes

I'm trying to send 2 emails through the SendGrid API. Sometimes 0 send, sometimes 1 sends, sometimes both send. It seems that the function does not await the promise. How can I fix it so it always sends both emails?
My function looks like this:
private async Task<bool> SendMails(string email, string name, string pdfPath, string imgPath)
{
var client = new SendGridClient(_config["SendGrid:Key"]);
bool messagesSent = false;
var messageClient = new SendGridMessage
{
From = new EmailAddress(_config["SendGrid:Recipient"]),
Subject = "Testmail",
HtmlContent = _textManager.Get("getMailHtml")
};
var messageSecondClient = new SendGridMessage
{
From = new EmailAddress(_config["SendGrid:Recipient"]),
Subject = "Second Testmail",
HtmlContent = _textManager.Get("getSecondMailHtml")
};
messageClient.AddTo(email, name);
messageSecondClient.AddTo(email, name);
string[] fileListClient = new string[] { pdfPath };
string[] fileListSecond = new string[] { pdfPath, imgPath };
foreach (var file in fileListClient)
{
var fileInfo = new FileInfo(file);
if (fileInfo.Exists)
await messageClient.AddAttachmentAsync(fileInfo.Name, fileInfo.OpenRead());
}
foreach (var file in fileListSecond)
{
var fileInfo = new FileInfo(file);
if (fileInfo.Exists)
await messageSecondClient.AddAttachmentAsync(fileInfo.Name, fileInfo.OpenRead());
}
var responseClient = await client.SendEmailAsync(messageClient);
var responseSecond = await client.SendEmailAsync(messageSecondClient);
if (responseClient.StatusCode.ToString() == "202" && responseSecond.StatusCode.ToString() == "202")
{
messagesSent = true;
}
return messagesSent;
}
And this is how I'm calling it:
Task<bool> sendMails = await Task.FromResult(SendMails(formCollection["email"], formCollection["name"], pdfPath, imgPath));
if (!sendMails.Result)
{
errorMessage = "Error sending mails.";
}
You're blocking on the async task:
if (!sendMails.Result)
and this can cause a deadlock. Instead of blocking, use await.
And you can also get rid of the await Task.FromResult, which isn't doing anything at all:
bool sentMails = await SendMails(formCollection["email"], formCollection["name"], pdfPath, imgPath);
if (!sentMails)
{
errorMessage = "Error sending mails.";
}
Task.FromResult returns a new Task that is already completed, not the Task returned from SendMails.
Nothing is awaiting the completion of SendMails.
Just await the Task returned from the method:
bool result = await SendMails(formCollection["email"], formCollection["name"], pdfPath, imgPath);
The await keyword unwraps the Task.Result for you.

Processing large number of tasks concurrently and asynchronously

I would like to process a list of 50,000 urls through a web service, The provider of this service allows 5 connections per second.
I need to process these urls in parallel with adherence to provider's rules.
This is my current code:
static void Main(string[] args)
{
process_urls().GetAwaiter().GetResult();
}
public static async Task process_urls()
{
// let's say there is a list of 50,000+ URLs
var urls = System.IO.File.ReadAllLines("urls.txt");
var allTasks = new List<Task>();
var throttler = new SemaphoreSlim(initialCount: 5);
foreach (var url in urls)
{
await throttler.WaitAsync();
allTasks.Add(
Task.Run(async () =>
{
try
{
Console.WriteLine(String.Format("Starting {0}", url));
var client = new HttpClient();
var xml = await client.GetStringAsync(url);
//do some processing on xml output
client.Dispose();
}
finally
{
throttler.Release();
}
}));
}
await Task.WhenAll(allTasks);
}
Instead of var client = new HttpClient(); I will create a new object of the target web service but this is just to make the code generic.
Is this the correct approach to handle and process a huge list of connections? and is there anyway I can limit the number of established connections per second to 5 as the current implementation will not consider any timeframe?
Thanks
Reading values from web service is IO operation which can be done asynchronously without multithreading.
Threads do nothing - only waiting for response in this case. So using parallel is just wasting of resources.
public static async Task process_urls()
{
var urls = System.IO.File.ReadAllLines("urls.txt");
var allTasks = new List<Task>();
var throttler = new SemaphoreSlim(initialCount: 5);
foreach (var urlGroup in SplitToGroupsOfFive(urls))
{
var tasks = new List<Task>();
foreach(var url in urlGroup)
{
var task = ProcessUrl(url);
tasks.Add(task);
}
// This delay will sure that next 5 urls will be used only after 1 seconds
tasks.Add(Task.Delay(1000));
await Task.WhenAll(tasks.ToArray());
}
}
private async Task ProcessUrl(string url)
{
using (var client = new HttpClient())
{
var xml = await client.GetStringAsync(url);
//do some processing on xml output
}
}
private IEnumerable<IEnumerable<string>> SplitToGroupsOfFive(IEnumerable<string> urls)
{
var const GROUP_SIZE = 5;
var string[] group = null;
var int count = 0;
foreach (var url in urls)
{
if (group == null)
group = new string[GROUP_SIZE];
group[count] = url;
count++;
if (count < GROUP_SIZE)
continue;
yield return group;
group = null;
count = 0;
}
if (group != null && group.Length > 0)
{
yield return group.Take(group.Length);
}
}
Because you mention that "processing" of response is also IO operation, then async/await approach is most efficient, because it using only one thread and process other tasks when previous tasks waiting for response from web service or from file writing IO operations.

Combining tasks results in a dictionary

I'm trying to parallelize work that relies on external resources, and combine it into a single resulting dictionary.
To illustrate my need, imagine I want to download a set of file, and put each result in a dictionary, where the key is the url:
string[] urls = { "http://msdn.microsoft.com", "http://www.stackoverflow.com", "http://www.google.com" };
var fileContentTask = GetUrls(urls);
fileContentTask.Wait();
Dictionary<string, string> result = fileContentTask.Result;
// Do something
However, I was able to code the GetUrls methode. I can generate all the tasks, but I didn't found how to consolidate the result in the dictionary:
static Task<Dictionary<string,string>> GetUrls(string[] urls)
{
var subTasks = from url in urls
let wc = new WebClient()
select wc.DownloadStringTaskAsync(url);
return Task.WhenAll(subTasks); // Does not compile
}
How can I merge the resulting tasks into a dictionary?
You need to perform the mapping yourself. For example, you could use:
static async Task<Dictionary<string,string>> GetUrls(string[] urls)
{
var tasks = urls.Select(async url =>
{
using (var client = new WebClient())
{
return new { url, content = await client.DownloadStringTaskAsync(url) };
};
}).ToList();
var results = await Task.WhenAll(tasks);
return results.ToDictionary(pair => pair.url, pair => pair.content);
}
Note how the method has to be async so that you can use await within it.
As an alternative to #Jon's answer, here is another working code (see comments to know why it's not working):
private static Task<Dictionary<string, string>> GetUrls(string[] urls)
{
var tsc = new TaskCompletionSource<Dictionary<string, string>>();
var subTasks = urls.ToDictionary(
url => url,
url =>
{
using (var wc = new WebClient())
{
return wc.DownloadStringTaskAsync(url);
}
}
);
Task.WhenAll(subTasks.Values).ContinueWith(allTasks =>
{
var actualResult = subTasks.ToDictionary(
task => task.Key,
task => task.Value.Result
);
tsc.SetResult(actualResult);
});
return tsc.Task;
}
Something that makes use of your existing linq:
static async Task<Dictionary<string, string>> GetUrls(string[] urls)
{
IEnumerable<Task<string>> subTasks = from url in urls
let wc = new WebClient()
select wc.DownloadStringTaskAsync(url);
var urlsAndData = subTasks.Zip(urls, async (data, url) => new { url, data = await data });
return (await Task.WhenAll(urlsAndData)).ToDictionary(a => a.url, a => a.data);
}
But as that does not dispose the WebClient, I would refactor out a method to make it like below. I've also added a Distinct call as there's no point downloading two urls that are the same only to fall-over when making a dictionary.
static async Task<Dictionary<string, string>> GetUrls(string[] urls)
{
var distinctUrls = urls
.Distinct().ToList();
var urlsAndData =
distinctUrls
.Select(DownloadStringAsync)
.Zip(distinctUrls, async (data, url) => new { url, data = await data });
return (await Task.WhenAll(urlsAndData)).ToDictionary(a => a.url, a => a.data);
}
private static async Task<string> DownloadStringAsync(string url)
{
using (var client = new WebClient())
{
return await client.DownloadStringTaskAsync(url);
}
}

Categories