Task Parallel Library WaitAny design - c#

I've just begun to explore the TPL and have a design question.
My Scenario:
I have a list of URLs that each refer to an image. I want each image to be downloaded in parallel. As soon as at least one image is downloaded, I want to execute a method that does something with the downloaded image. That method should NOT be parallelized -- it should be serial.
I think the following will work but I'm not sure if this is the right way to do it. Because I have separate classes for collecting the images and for doing "something" with the collected images, I end up passing around an array of Tasks which seems wrong since it exposes the inner workings of how images are retrieved. But I don't know a way around it. In reality there is more to both of these methods but that's not important for this. Just know that they really shouldn't be lumped into one large method that both retrieves and does something with the image.
//From the Director class
Task<Image>[] downloadTasks = collector.RetrieveImages(listOfURLs);
for (int i = 0; i < listOfURLs.Count; i++)
{
//Wait for any of the remaining downloads to complete
int completedIndex = Task<Image>.WaitAny(downloadTasks);
Image completedImage = downloadTasks[completedIndex].Result;
//Now do something with the image (this "something" must happen serially)
//Uses the "Formatter" class to accomplish this let's say
}
///////////////////////////////////////////////////
//From the Collector class
public Task<Image>[] RetrieveImages(List<string> urls)
{
Task<Image>[] tasks = new Task<Image>[urls.Count];
int index = 0;
foreach (string url in urls)
{
string lambdaVar = url; //Required... Bleh
tasks[index] = Task<Image>.Factory.StartNew(() =>
{
using (WebClient client = new WebClient())
{
//TODO: Replace with live image locations
string fileName = String.Format("{0}.png", i);
client.DownloadFile(lambdaVar, Path.Combine(
Application.StartupPath, fileName));
}
return Image.FromFile(Path.Combine(Application.StartupPath, fileName));
},
TaskCreationOptions.LongRunning | TaskCreationOptions.AttachedToParent);
index++;
}
return tasks;
}

Typically you use WaitAny to wait for one task when you don't care about the results of any of the others. For example if you just cared about the first image that happened to get returned.
How about this instead.
This creates two tasks, one which loads images and adds them to a blocking collection. The second task waits on the collection and processes any images added to the queue. When all the images are loaded the first task closes the queue down so the second task can shut down.
using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Drawing;
using System.IO;
using System.Net;
using System.Threading.Tasks;
namespace ClassLibrary1
{
public class Class1
{
readonly string _path = Directory.GetCurrentDirectory();
public void Demo()
{
IList<string> listOfUrls = new List<string>();
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/editicon.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/favorite-star-on.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/arrow_dsc_green.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/editicon.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/favorite-star-on.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/arrow_dsc_green.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/editicon.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/favorite-star-on.gif");
listOfUrls.Add("http://i3.codeplex.com/Images/v16821/arrow_dsc_green.gif");
BlockingCollection<Image> images = new BlockingCollection<Image>();
Parallel.Invoke(
() => // Task 1: load the images
{
Parallel.For(0, listOfUrls.Count, (i) =>
{
Image img = RetrieveImages(listOfUrls[i], i);
img.Tag = i;
images.Add(img); // Add each image to the queue
});
images.CompleteAdding(); // Done with images.
},
() => // Task 2: Process images serially
{
foreach (var img in images.GetConsumingEnumerable())
{
string newPath = Path.Combine(_path, String.Format("{0}_rot.png", img.Tag));
Console.WriteLine("Rotating image {0}", img.Tag);
img.RotateFlip(RotateFlipType.RotateNoneFlipXY);
img.Save(newPath);
}
});
}
public Image RetrieveImages(string url, int i)
{
using (WebClient client = new WebClient())
{
string fileName = Path.Combine(_path, String.Format("{0}.png", i));
Console.WriteLine("Downloading {0}...", url);
client.DownloadFile(url, Path.Combine(_path, fileName));
Console.WriteLine("Saving {0} as {1}.", url, fileName);
return Image.FromFile(Path.Combine(_path, fileName));
}
}
}
}
WARNING: The code doesn't have any error checking or cancelation. It's late and you need something to do right? :)
This is an example of the pipeline pattern. It assumes that getting an image is pretty slow and that the cost of locking inside the blocking collection isn't going to cause a problem because it happens relatively infrequently compared to the time spent downloading images.
Our book... You can read more about this and other patterns for parallel programming at http://parallelpatterns.codeplex.com/
Chapter 7 covers pipelines and the accompanying examples show pipelines with error handling and cancellation.

TPL already provides the ContinueWith function to execute one task when another finishes. Task chaining is one of the main patterns used in TPL for asynchronous operations.
The following method downloads a set of images and continues by renaming each of the files
static void DownloadInParallel(string[] urls)
{
var tempFolder = Path.GetTempPath();
var downloads = from url in urls
select Task.Factory.StartNew<string>(() =>{
using (var client = new WebClient())
{
var uri = new Uri(url);
string file = Path.Combine(tempFolder,uri.Segments.Last());
client.DownloadFile(uri, file);
return file;
}
},TaskCreationOptions.LongRunning|TaskCreationOptions.AttachedToParent)
.ContinueWith(t=>{
var filePath = t.Result;
File.Move(filePath, filePath + ".test");
},TaskContinuationOptions.ExecuteSynchronously);
var results = downloads.ToArray();
Task.WaitAll(results);
}
You should also check the WebClient Async Tasks from the ParallelExtensionsExtras samples. The DownloadXXXTask extension methods handle both the creation of tasks and the asynchronous downloading of files.
The following method uses the DownloadDataTask extension to get the image's data and rotate it before saving it to disk
static void DownloadInParallel2(string[] urls)
{
var tempFolder = Path.GetTempPath();
var downloads = from url in urls
let uri=new Uri(url)
let filePath=Path.Combine(tempFolder,uri.Segments.Last())
select new WebClient().DownloadDataTask(uri)
.ContinueWith(t=>{
var img = Image.FromStream(new MemoryStream(t.Result));
img.RotateFlip(RotateFlipType.RotateNoneFlipY);
img.Save(filePath);
},TaskContinuationOptions.ExecuteSynchronously);
var results = downloads.ToArray();
Task.WaitAll(results);
}

The best way to do this would probably be by implementing the Observer pattern: have your RetreiveImages function implement IObservable, put your "completed image action" into an IObserver object's OnNext method, and subscribe it to RetreiveImages.
I haven't tried this myself yet (still have to play more with the task library) but I think this is the "right" way to do it.

//download all images
private async void GetAllImages ()
{
var downloadTasks = listOfURLs.Where(url => !string.IsNullOrEmpty(url)).Select(async url =>
{
var ret = await RetrieveImage(url);
return ret;
}).ToArray();
var counts = await Task.WhenAll(downloadTasks);
}
//From the Collector class
public async Task<Image> RetrieveImage(string url)
{
var lambdaVar = url; //Required... Bleh
using (WebClient client = new WebClient())
{
//TODO: Replace with live image locations
var fileName = String.Format("{0}.png", i);
await client.DownloadFile(lambdaVar, Path.Combine(Application.StartupPath, fileName));
}
return Image.FromFile(Path.Combine(Application.StartupPath, fileName));
}

Related

How to make sure that the data of multiple Async downloads are saved in the order they were started?

I'm writing a basic Http Live Stream (HLS) downloader, where I'm re-downloading a m3u8 media playlist at an interval specified by "#EXT-X-TARGETDURATION" and then download the *.ts segments as they become available.
This is what the m3u8 media playlist might look like when first downloaded.
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:12
#EXT-X-MEDIA-SEQUENCE:1
#EXTINF:7.975,
http://website.com/segment_1.ts
#EXTINF:7.941,
http://website.com/segment_2.ts
#EXTINF:7.975,
http://website.com/segment_3.ts
I'd like to download these *.ts segments all at the same time with HttpClient async/await. The segments do not have the same size, so even though the download of "segment_1.ts" is started first, it might finish after the other two segments.
These segments are all part of one large video, so it's important that the data of the downloaded segments are written in the order they were started, NOT in the order they finished.
My code below works perfectly fine if the segments are downloaded one after another, but not when multiple segments are downloaded at the same time, because sometimes they don't finish in the order they were started.
I thought about using Task.WhenAll, which guarantees a correct order, but I don't want to keep the the downloaded segments in memory unnecessarily, because they can be a few megabytes in size. If the download of "segment_1.ts" does finish first, then it should be written to disk right away, without having to wait for the other segments to finish. Writing all the *.ts segments to separate files and joining them in the end is not an option either, because it would require double disk space and the total video can be a few gigabytes in size.
I have no idea how to do this and I'm wondering if somebody can help me with that. I'm looking for a way that doesn't require me to create threads manually or block a ThreadPool thread for a long period of time.
Some of the code and exception handling have been removed to make it easier to see what is going on.
// Async BlockingCollection from the AsyncEx library
private AsyncCollection<byte[]> segmentDataQueue = new AsyncCollection<byte[]>();
public void Start()
{
RunConsumer();
RunProducer();
}
private async void RunProducer()
{
while (!_isCancelled)
{
var response = await _client.GetAsync(_playlistBaseUri + _playlistFilename, _cts.Token).ConfigureAwait(false);
var data = await response.Content.ReadAsStringAsync().ConfigureAwait(false);
string[] lines = data.Split(new string[] { "\n" }, StringSplitOptions.RemoveEmptyEntries);
if (!lines.Any() || lines[0] != "#EXTM3U")
throw new Exception("Invalid m3u8 media playlist.");
for (var i = 1; i < lines.Length; i++)
{
var line = lines[i];
if (line.StartsWith("#EXT-X-TARGETDURATION"))
{
ParseTargetDuration(line);
}
else if (line.StartsWith("#EXT-X-MEDIA-SEQUENCE"))
{
ParseMediaSequence(line);
}
else if (!line.StartsWith("#"))
{
if (_isNewSegment)
{
// Fire and forget
DownloadTsSegment(line);
}
}
}
// Wait until it's time to reload the m3u8 media playlist again
await Task.Delay(_targetDuration * 1000, _cts.Token).ConfigureAwait(false);
}
}
// async void. We never await this method, so we can download multiple segments at once
private async void DownloadTsSegment(string tsUrl)
{
var response = await _client.GetAsync(tsUrl, _cts.Token).ConfigureAwait(false);
var data = await response.Content.ReadAsByteArrayAsync().ConfigureAwait(false);
// Add the downloaded segment data to the AsyncCollection
await segmentDataQueue.AddAsync(data, _cts.Token).ConfigureAwait(false);
}
private async void RunConsumer()
{
using (FileStream fs = new FileStream(_filePath, FileMode.Create, FileAccess.Write, FileShare.Read))
{
while (!_isCancelled)
{
// Wait until new segment data is added to the AsyncCollection and write it to disk
var data = await segmentDataQueue.TakeAsync(_cts.Token).ConfigureAwait(false);
await fs.WriteAsync(data, 0, data.Length).ConfigureAwait(false);
}
}
}
I don't think you need a producer/consumer queue at all here. However, I do think you should avoid "fire and forget".
You can start them all at the same time, and just process them as they complete.
First, define how to download a single segment:
private async Task<byte[]> DownloadTsSegmentAsync(string tsUrl)
{
var response = await _client.GetAsync(tsUrl, _cts.Token).ConfigureAwait(false);
return await response.Content.ReadAsByteArrayAsync().ConfigureAwait(false);
}
Then add the parsing of the playlist which results in a list of segment downloads (which are all in progress already):
private List<Task<byte[]>> DownloadTasks(string data)
{
var result = new List<Task<byte[]>>();
string[] lines = data.Split(new string[] { "\n" }, StringSplitOptions.RemoveEmptyEntries);
if (!lines.Any() || lines[0] != "#EXTM3U")
throw new Exception("Invalid m3u8 media playlist.");
...
if (_isNewSegment)
{
result.Add(DownloadTsSegmentAsync(line));
}
...
return result;
}
Consume this list one at a time (in order) by writing to a file:
private async Task RunConsumerAsync(List<Task<byte[]>> downloads)
{
using (FileStream fs = new FileStream(_filePath, FileMode.Create, FileAccess.Write, FileShare.Read))
{
for (var task in downloads)
{
var data = await task.ConfigureAwait(false);
await fs.WriteAsync(data, 0, data.Length).ConfigureAwait(false);
}
}
}
And kick it all off with a producer:
public async Task RunAsync()
{
// TODO: consider CancellationToken instead of a boolean.
while (!_isCancelled)
{
var response = await _client.GetAsync(_playlistBaseUri + _playlistFilename, _cts.Token).ConfigureAwait(false);
var data = await response.Content.ReadAsStringAsync().ConfigureAwait(false);
var tasks = DownloadTasks(data);
await RunConsumerAsync(tasks);
await Task.Delay(_targetDuration * 1000, _cts.Token).ConfigureAwait(false);
}
}
Note that this solution does run all downloads concurrently, and this can cause memory pressure. If this is a problem, I recommend you restructure to use TPL Dataflow, which has built-in support for throttling.
Assign each download a sequence number. Put the results into a Dictionary<int, byte[]>. Each time a download completes it adds its own result.
It then checks if there are segments to write to disk:
while (dict.ContainsKey(lowestWrittenSegmentNumber + 1)) {
WriteSegment(dict[lowestWrittenSegmentNumber + 1]);
lowestWrittenSegmentNumber++;
}
That way all segments end up on disk, in order and with buffering.
RunConsumer();
RunProducer();
Make sure to use async Task so that you can wait for completion with await Task.WhenAll(RunConsumer(), RunProducer());. But you should not need RunConsumer any longer.

Reading/Writing Async Files for Universal App

im trying to Reading/Writing Async Files for an Universal App in c#.
When i write and read a file for first time, it works... But when i retry it immeadiatly, there are two Errors: 1. UnauthorizedAccess 2. Handle with the OPLOCK has been closed
It seems that the methods arent finished yet and so the data is not free
(in my frame is a button which adds a new member to a List, then the list shall serialized in an XML data. When i reNavigate to that page, that XML sheet shall be deserialized back to that List, because the Content shall be displayed)
List<Immobilie> immoListe = new List<Immobilie>();
private const string FileName_ImmoObjects = "ImmoObjects.xml";
StorageFolder sFolder = Windows.Storage.ApplicationData.Current.LocalFolder;
IStorageFile latestImmoListFile;
public Startmenue()
{
this.InitializeComponent();
immoListe.Add(new Immobilie()); // for testing creating an XML first
immoListe[0].adresse = "Foo1";
immoListe.Add(new Immobilie());
immoListe[1].adresse = "Foo2";
WriteImmoListAsync();
ReadImmoListAsync(); // These two steps working
WriteImmoListAsync(); // everything more causes error
ReadImmoListAsync();
}
public async void WriteImmoListAsync()
{
try
{
IStorageFolder folder = await sFolder.CreateFolderAsync("Saves", CreationCollisionOption.OpenIfExists);
latestImmoListFile = await folder.CreateFileAsync(FileName_ImmoObjects, CreationCollisionOption.ReplaceExisting);
using (IRandomAccessStream stream = await latestImmoListFile.OpenAsync(FileAccessMode.ReadWrite))
using (Stream outputStream = stream.AsStreamForWrite())
{
DataContractSerializer serializer = new DataContractSerializer(typeof(List<Immobilie>));
serializer.WriteObject(outputStream, immoListe);
}
}
catch (Exception e)
{
var d = new MessageDialog(e.ToString());
await d.ShowAsync();
}
}
public async void ReadImmoListAsync()
{
int i = 0;
try
{
IStorageFolder folder = await sFolder.GetFolderAsync("Saves");
i = 1;
latestImmoListFile = await folder.GetFileAsync(FileName_ImmoObjects);
i = 2;
using (IRandomAccessStream stream = await latestImmoListFile.OpenAsync(FileAccessMode.Read))
{
i = 3;
using (Stream inputStream = stream.AsStreamForRead())
{
i = 4;
DataContractSerializer deserializer = new DataContractSerializer(typeof(List<Immobilie>));
i = 5;
immoListe = (List<Immobilie>)deserializer.ReadObject(inputStream);
}
}
}
catch (Exception e)
{
var d = new MessageDialog("Fehler I = " + i + "\n" + e.ToString());
await d.ShowAsync();
}
}
So what can i do and why is it so difficult??(normal I/O is easy-peasy).-.
As I describe in my MSDN article on async best practices, you should avoid async void:
public async Task WriteImmoListAsync();
public async Task ReadImmoListAsync();
Once your methods are properly async Task, then you can await them:
await WriteImmoListAsync();
await ReadImmoListAsync();
await WriteImmoListAsync();
await ReadImmoListAsync();
You can't start the methods again until you wait for them to complete. What that above code is trying to do is to write to a file, but while that's processing, it tries to open the file and write to it while the first method call hasn't completed. You need to wait for those method calls to finish before running them again - using the await keyword would be helpful here
It might be that the process writing/reading the file are still attached to the file. You might want to take a look at this pattern for async file read/write from Microsoft:
https://msdn.microsoft.com/en-ca/library/mt674879.aspx
Also, note that if the read and write are done from differents process, you're going to have to use a mutex. Here's a great explanation on how it works:
What is a good pattern for using a Global Mutex in C#?

Use DownloadFileTaskAsync to download all files at once

Given a input text file containing the Urls, I would like to download the corresponding files all at once. I use the answer to this question
UserState using WebClient and TaskAsync download from Async CTP as reference.
public void Run()
{
List<string> urls = File.ReadAllLines(#"c:/temp/Input/input.txt").ToList();
int index = 0;
Task[] tasks = new Task[urls.Count()];
foreach (string url in urls)
{
WebClient wc = new WebClient();
string path = string.Format("{0}image-{1}.jpg", #"c:/temp/Output/", index+1);
Task downloadTask = wc.DownloadFileTaskAsync(new Uri(url), path);
Task outputTask = downloadTask.ContinueWith(t => Output(path));
tasks[index] = outputTask;
}
Console.WriteLine("Start now");
Task.WhenAll(tasks);
Console.WriteLine("Done");
}
public void Output(string path)
{
Console.WriteLine(path);
}
I expected that the downloading of the files would begin at the point of "Task.WhenAll(tasks)". But it turns out that the output look likes
c:/temp/Output/image-2.jpg
c:/temp/Output/image-1.jpg
c:/temp/Output/image-4.jpg
c:/temp/Output/image-6.jpg
c:/temp/Output/image-3.jpg
[many lines deleted]
Start now
c:/temp/Output/image-18.jpg
c:/temp/Output/image-19.jpg
c:/temp/Output/image-20.jpg
c:/temp/Output/image-21.jpg
c:/temp/Output/image-23.jpg
[many lines deleted]
Done
Why does the downloading begin before WaitAll is called? What can I change to achieve what I would like (i.e. all tasks will begin at the same time)?
Thanks
Why does the downloading begin before WaitAll is called?
First of all, you're not calling Task.WaitAll, which synchronously blocks, you're calling Task.WhenAll, which returns an awaitable which should be awaited.
Now, as others said, when you call an async method, even without using await on it, it fires the asynchronous operation, because any method conforming to the TAP will return a "hot task".
What can I change to achieve what I would like (i.e. all tasks will
begin at the same time)?
Now, if you want to defer execution until Task.WhenAll, you can use Enumerable.Select to project each element to a Task, and materialize it when you pass it to Task.WhenAll:
public async Task RunAsync()
{
IEnumerable<string> urls = File.ReadAllLines(#"c:/temp/Input/input.txt");
var urlTasks = urls.Select((url, index) =>
{
WebClient wc = new WebClient();
string path = string.Format("{0}image-{1}.jpg", #"c:/temp/Output/", index);
var downloadTask = wc.DownloadFileTaskAsync(new Uri(url), path);
Output(path);
return downloadTask;
});
Console.WriteLine("Start now");
await Task.WhenAll(urlTasks);
Console.WriteLine("Done");
}
Why does the downloading begin before WaitAll is called?
Because:
Tasks created by its public constructors are referred to as “cold”
tasks, in that they begin their life cycle in the non-scheduled
TaskStatus.Created state, and it’s not until Start is called on these
instances that they progress to being scheduled. All other tasks begin
their life cycle in a “hot” state, meaning that the asynchronous
operations they represent have already been initiated and their
TaskStatus is an enumeration value other than Created. All tasks
returned from TAP methods must be “hot.”
Since DownloadFileTaskAsync is a TAP method, it returns "hot" (that is, already started) task.
What can I change to achieve what I would like (i.e. all tasks will begin at the same time)?
I'd look at TPL Data Flow. Something like this (I've used HttpClient instead of WebClient, but, actually, it doesn't matter):
static async Task DownloadData(IEnumerable<string> urls)
{
// we want to execute this in parallel
var executionOptions = new ExecutionDataflowBlockOptions { MaxDegreeOfParallelism = Environment.ProcessorCount };
// this block will receive URL and download content, pointed by URL
var donwloadBlock = new TransformBlock<string, Tuple<string, string>>(async url =>
{
using (var client = new HttpClient())
{
var content = await client.GetStringAsync(url);
return Tuple.Create(url, content);
}
}, executionOptions);
// this block will print number of bytes downloaded
var outputBlock = new ActionBlock<Tuple<string, string>>(tuple =>
{
Console.WriteLine($"Downloaded {(string.IsNullOrEmpty(tuple.Item2) ? 0 : tuple.Item2.Length)} bytes from {tuple.Item1}");
}, executionOptions);
// here we tell to donwloadBlock, that it is linked with outputBlock;
// this means, that when some item from donwloadBlock is being processed,
// it must be posted to outputBlock
using (donwloadBlock.LinkTo(outputBlock))
{
// fill downloadBlock with input data
foreach (var url in urls)
{
await donwloadBlock.SendAsync(url);
}
// tell donwloadBlock, that it is complete; thus, it should start processing its items
donwloadBlock.Complete();
// wait while downloading data
await donwloadBlock.Completion;
// tell outputBlock, that it is completed
outputBlock.Complete();
// wait while printing output
await outputBlock.Completion;
}
}
static void Main(string[] args)
{
var urls = new[]
{
"http://www.microsoft.com",
"http://www.google.com",
"http://stackoverflow.com",
"http://www.amazon.com",
"http://www.asp.net"
};
Console.WriteLine("Start now.");
DownloadData(urls).Wait();
Console.WriteLine("Done.");
Console.ReadLine();
}
Output:
Start now.
Downloaded 1020 bytes from http://www.microsoft.com
Downloaded 53108 bytes from http://www.google.com
Downloaded 244143 bytes from http://stackoverflow.com
Downloaded 468922 bytes from http://www.amazon.com
Downloaded 27771 bytes from http://www.asp.net
Done.
What can I change to achieve what I would like (i.e. all tasks will
begin at the same time)?
To synchronize the beginning of the download you could use Barrier class.
public void Run()
{
List<string> urls = File.ReadAllLines(#"c:/temp/Input/input.txt").ToList();
Barrier barrier = new Barrier(url.Count, ()=> {Console.WriteLine("Start now");} );
Task[] tasks = new Task[urls.Count()];
Parallel.For(0, urls.Count, (int index)=>
{
string path = string.Format("{0}image-{1}.jpg", #"c:/temp/Output/", index+1);
tasks[index] = DownloadAsync(Uri(urls[index]), path, barrier);
})
Task.WaitAll(tasks); // wait for completion
Console.WriteLine("Done");
}
async Task DownloadAsync(Uri url, string path, Barrier barrier)
{
using (WebClient wc = new WebClient())
{
barrier.SignalAndWait();
await wc.DownloadFileAsync(url, path);
Output(path);
}
}

Read a file from background task

I'm trying to call a method from inside the Run method of a background task which among other it desirializes a xml file. The problem is that I end up in a deadlock. This is the methos that reads the file
protected async Task<Anniversaries> readFile(string fileName)
{
IStorageFile file;
Anniversaries tempAnniversaries;
file = await ApplicationData.Current.LocalFolder.GetFileAsync(fileName);
using (IRandomAccessStream stream =
await file.OpenAsync(FileAccessMode.Read))
using (Stream inputStream = stream.AsStreamForRead())
{
DataContractSerializer serializer = new DataContractSerializer(typeof(Anniversaries));
tempAnniversaries = serializer.ReadObject(inputStream) as Anniversaries;
}
return tempAnniversaries;
}
and here is the Run method
public sealed class TileUpdater : IBackgroundTask
{
GeneralAnniversariesManager generalManager = new GeneralAnniversariesManager();
Anniversaries tempAnn = new Anniversaries();
string test = "skata";
public async void Run(IBackgroundTaskInstance taskInstance)
{
DateTime curentTime = new DateTime();
var defferal = taskInstance.GetDeferral();
await generalManager.InitializeAnniversariesAsync().AsAsyncAction();
curentTime = DateTime.Now;
var updater = TileUpdateManager.CreateTileUpdaterForApplication();
updater.EnableNotificationQueue(true);
updater.Clear();
for (int i = 1; i < 6; i++)
{
var tile = TileUpdateManager.GetTemplateContent(TileTemplateType.TileWide310x150BlockAndText01);
tile.GetElementsByTagName("text")[0].InnerText = test + i;
tile.GetElementsByTagName("text")[1].InnerText = curentTime.ToString();
updater.Update(new TileNotification(tile));
}
defferal.Complete();
}
I'm assuming that by deadlock you mean that the deserialization method finishes too late and your original program tries to read the data before it's finished loading.
It depends on how complicated/reliable you want your solution to be and how you're intending to use the program. The simplest way relies on the fact that the directory creation function is always 100% atomic in Windows/Unix and OSX. For example at the top of your readFile function have something like this.
Directory.CreateDirectory("lock");
Before you start parsing the results of your async action in TileUpdater, have a loop that looks like this.
while (Directory.Exists("lock"))
{
Thread.Sleep(50);
}
This assumes that everything is happening in the same directory, generally you'll want to replace "lock" with a path that leads to the user's temp directory for their version of Windows/Linux/OSX.
If you want to implement something more complicated where you're reading from a series of files while at the same time reading the deserialized output into your class, you'll want to use something like a System.Collections.Concurrent.ConcurrentQueue that allows your threads to act completely independently without blocking each other.
Incidentally I'm assuming that you know that the class Process and the function .waitfor() exists. You can spin off a thread and then at a later point, halt the main thread until the spawned thread finishes.
Actually I think I've found where the problem is. At the namespaces, I've tried a try and catch and I got an exception about using different namespaces at the datacontract serealizer. I have updated the code like this
file = await ApplicationData.Current.LocalFolder.GetFileAsync("EortologioMovingEntries.xml");
try
{
using (IRandomAccessStream stream =
await file.OpenAsync(FileAccessMode.Read))
using (Stream inputStream = stream.AsStreamForRead())
{
DataContractSerializer serializer = new DataContractSerializer(typeof(Anniversaries), "Anniversaries", "http://schemas.datacontract.org/2004/07/Eortologio.Model");
tempAnniversaries = serializer.ReadObject(inputStream) as Anniversaries;
}
}
catch (Exception ex)
{
error = ex.ToString();
tempAnniversaries.Entries.Add(new AnniversaryEntry("Ena", DateTime.Now, "skata", PriorityEnum.High));
}
I don't get any exceptions now but the tempAnniversaries returns null. Any ideas?

Mass Downloading of Webpages C#

My application requires that I download a large amount of webpages into memory for further parsing and processing. What is the fastest way to do it? My current method (shown below) seems to be too slow and occasionally results in timeouts.
for (int i = 1; i<=pages; i++)
{
string page_specific_link = baseurl + "&page=" + i.ToString();
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page_specific_link);
client.Dispose();
sourcelist.Add(pagesource);
}
catch (Exception)
{
}
}
The way you approach this problem is going to depend very much on how many pages you want to download, and how many sites you're referencing.
I'll use a good round number like 1,000. If you want to download that many pages from a single site, it's going to take a lot longer than if you want to download 1,000 pages that are spread out across dozens or hundreds of sites. The reason is that if you hit a single site with a whole bunch of concurrent requests, you'll probably end up getting blocked.
So you have to implement a type of "politeness policy," that issues a delay between multiple requests on a single site. The length of that delay depends on a number of things. If the site's robots.txt file has a crawl-delay entry, you should respect that. If they don't want you accessing more than one page per minute, then that's as fast as you should crawl. If there's no crawl-delay, you should base your delay on how long it takes a site to respond. For example, if you can download a page from the site in 500 milliseconds, you set your delay to X. If it takes a full second, set your delay to 2X. You can probably cap your delay to 60 seconds (unless crawl-delay is longer), and I would recommend that you set a minimum delay of 5 to 10 seconds.
I wouldn't recommend using Parallel.ForEach for this. My testing has shown that it doesn't do a good job. Sometimes it over-taxes the connection and often it doesn't allow enough concurrent connections. I would instead create a queue of WebClient instances and then write something like:
// Create queue of WebClient instances
BlockingCollection<WebClient> ClientQueue = new BlockingCollection<WebClient>();
// Initialize queue with some number of WebClient instances
// now process urls
foreach (var url in urls_to_download)
{
var worker = ClientQueue.Take();
worker.DownloadStringAsync(url, ...);
}
When you initialize the WebClient instances that go into the queue, set their OnDownloadStringCompleted event handlers to point to a completed event handler. That handler should save the string to a file (or perhaps you should just use DownloadFileAsync), and then the client, adds itself back to the ClientQueue.
In my testing, I've been able to support 10 to 15 concurrent connections with this method. Any more than that and I run into problems with DNS resolution (`DownloadStringAsync' doesn't do the DNS resolution asynchronously). You can get more connections, but doing so is a lot of work.
That's the approach I've taken in the past, and it's worked very well for downloading thousands of pages quickly. It's definitely not the approach I took with my high performance Web crawler, though.
I should also note that there is a huge difference in resource usage between these two blocks of code:
WebClient MyWebClient = new WebClient();
foreach (var url in urls_to_download)
{
MyWebClient.DownloadString(url);
}
---------------
foreach (var url in urls_to_download)
{
WebClient MyWebClient = new WebClient();
MyWebClient.DownloadString(url);
}
The first allocates a single WebClient instance that is used for all requests. The second allocates one WebClient for each request. The difference is huge. WebClient uses a lot of system resources, and allocating thousands of them in a relatively short time is going to impact performance. Believe me ... I've run into this. You're better off allocating just 10 or 20 WebClients (as many as you need for concurrent processing), rather than allocating one per request.
Why not just use a web crawling framework. It can handle all the stuff for you like (multithreading, httprequests, parsing links, scheduling, politeness, etc..).
Abot (https://code.google.com/p/abot/) handles all that stuff for you and is written in c#.
In addition to #Davids perfectly valid answer, I want to add a slightly cleaner "version" of his approach.
var pages = new List<string> { "http://bing.com", "http://stackoverflow.com" };
var sources = new BlockingCollection<string>();
Parallel.ForEach(pages, x =>
{
using(var client = new WebClient())
{
var pagesource = client.DownloadString(x);
sources.Add(pagesource);
}
});
Yet another approach, that uses async:
static IEnumerable<string> GetSources(List<string> pages)
{
var sources = new BlockingCollection<string>();
var latch = new CountdownEvent(pages.Count);
foreach (var p in pages)
{
using (var wc = new WebClient())
{
wc.DownloadStringCompleted += (x, e) =>
{
sources.Add(e.Result);
latch.Signal();
};
wc.DownloadStringAsync(new Uri(p));
}
}
latch.Wait();
return sources;
}
You should use parallel programming for this purpose.
There are a lot of ways to achieve what u want; the easiest would be something like this:
var pageList = new List<string>();
for (int i = 1; i <= pages; i++)
{
pageList.Add(baseurl + "&page=" + i.ToString());
}
// pageList is a list of urls
Parallel.ForEach<string>(pageList, (page) =>
{
try
{
WebClient client = new WebClient();
var pagesource = client.DownloadString(page);
client.Dispose();
lock (sourcelist)
sourcelist.Add(pagesource);
}
catch (Exception) {}
});
I Had a similar Case ,and that's how i solved
using System;
using System.Threading;
using System.Collections.Generic;
using System.Net;
using System.IO;
namespace WebClientApp
{
class MainClassApp
{
private static int requests = 0;
private static object requests_lock = new object();
public static void Main() {
List<string> urls = new List<string> { "http://www.google.com", "http://www.slashdot.org"};
foreach(var url in urls) {
ThreadPool.QueueUserWorkItem(GetUrl, url);
}
int cur_req = 0;
while(cur_req<urls.Count) {
lock(requests_lock) {
cur_req = requests;
}
Thread.Sleep(1000);
}
Console.WriteLine("Done");
}
private static void GetUrl(Object the_url) {
string url = (string)the_url;
WebClient client = new WebClient();
Stream data = client.OpenRead (url);
StreamReader reader = new StreamReader(data);
string html = reader.ReadToEnd ();
/// Do something with html
Console.WriteLine(html);
lock(requests_lock) {
//Maybe you could add here the HTML to SourceList
requests++;
}
}
}
You should think using Paralel's because the slow speed is because you're software is waiting for I/O and why not while a thread i waiting for I/O another one get started.
While the other answers are perfectly valid, all of them (at the time of this writing) are neglecting something very important: calls to the web are IO bound, having a thread wait on an operation like this is going to strain system resources and have an impact on your system resources.
What you really want to do is take advantage of the async methods on the WebClient class (as some have pointed out) as well as the Task Parallel Library's ability to handle the Event-Based Asynchronous Pattern.
First, you would get the urls that you want to download:
IEnumerable<Uri> urls = pages.Select(i => new Uri(baseurl +
"&page=" + i.ToString(CultureInfo.InvariantCulture)));
Then, you would create a new WebClient instance for each url, using the TaskCompletionSource<T> class to handle the calls asynchronously (this won't burn a thread):
IEnumerable<Task<Tuple<Uri, string>> tasks = urls.Select(url => {
// Create the task completion source.
var tcs = new TaskCompletionSource<Tuple<Uri, string>>();
// The web client.
var wc = new WebClient();
// Attach to the DownloadStringCompleted event.
client.DownloadStringCompleted += (s, e) => {
// Dispose of the client when done.
using (wc)
{
// If there is an error, set it.
if (e.Error != null)
{
tcs.SetException(e.Error);
}
// Otherwise, set cancelled if cancelled.
else if (e.Cancelled)
{
tcs.SetCanceled();
}
else
{
// Set the result.
tcs.SetResult(new Tuple<string, string>(url, e.Result));
}
}
};
// Start the process asynchronously, don't burn a thread.
wc.DownloadStringAsync(url);
// Return the task.
return tcs.Task;
});
Now you have an IEnumerable<T> which you can convert to an array and wait on all of the results using Task.WaitAll:
// Materialize the tasks.
Task<Tuple<Uri, string>> materializedTasks = tasks.ToArray();
// Wait for all to complete.
Task.WaitAll(materializedTasks);
Then, you can just use Result property on the Task<T> instances to get the pair of the url and the content:
// Cycle through each of the results.
foreach (Tuple<Uri, string> pair in materializedTasks.Select(t => t.Result))
{
// pair.Item1 will contain the Uri.
// pair.Item2 will contain the content.
}
Note that the above code has the caveat of not having an error handling.
If you wanted to get even more throughput, instead of waiting for the entire list to be finished, you could process the content of a single page after it's done downloading; Task<T> is meant to be used like a pipeline, when you've completed your unit of work, have it continue to the next one instead of waiting for all of the items to be done (if they can be done in an asynchronous manner).
I am using an active Threads count and a arbitrary limit:
private static volatile int activeThreads = 0;
public static void RecordData()
{
var nbThreads = 10;
var source = db.ListOfUrls; // Thousands urls
var iterations = source.Length / groupSize;
for (int i = 0; i < iterations; i++)
{
var subList = source.Skip(groupSize* i).Take(groupSize);
Parallel.ForEach(subList, (item) => RecordUri(item));
//I want to wait here until process further data to avoid overload
while (activeThreads > 30) Thread.Sleep(100);
}
}
private static async Task RecordUri(Uri uri)
{
using (WebClient wc = new WebClient())
{
Interlocked.Increment(ref activeThreads);
wc.DownloadStringCompleted += (sender, e) => Interlocked.Decrement(ref iterationsCount);
var jsonData = "";
RootObject root;
jsonData = await wc.DownloadStringTaskAsync(uri);
var root = JsonConvert.DeserializeObject<RootObject>(jsonData);
RecordData(root)
}
}

Categories