I need to compute hashes of two big files (around 10GB) to check for equality. Currently i'm computing one hash at a time, but to save a lot of time i had the idea to parallely compute both hashes at the same time.
Heres my method:
private bool checkEquality(FileInfo firstFile, FileInfo secondFile)
{
//These 2 lines are for creating one hash at a time, currently commented out for
//testing purpose
//byte[] firstHash = createHash(firstFile);
//byte[] secondHash = createHash(secondFile);
//My take in running the computing processes parallely
Task<byte[]> fh = Task.Run(() => createHash(firstFile));
Task<byte[]> sh = Task.Run(() => createHash(secondFile));
byte[] firstHash = fh.Result;
byte[] secondHash = sh.Result;
for (int i = 0; i < firstHash.Length; i++)
{
if (firstHash[i] != secondHash[i]) return false;
}
return true;
}
Since this is my first time trying to do something like that, i'm not quite sure if the code i wrote does work as i imagine, because i've seen usual use of async methods in combination with the await keyword in other threads, but i cant wrap my head about this concept yet.
Edit:
Ok i changed my method to:
private async Task<bool> checkEquality(FileInfo firstFile, FileInfo secondFile)
{
//These 2 lines are for creating one hash at a time, currently commented out for
//testing purpose
//byte[] firstHash = createHash(firstFile);
//byte[] secondHash = createHash(secondFile);
//My take in running the computing processes parallely
Task<byte[]> fh = Task.Run(() => createHash(firstFile));
Task<byte[]> sh = Task.Run(() => createHash(secondFile));
byte[] firstHash = await fh;
byte[] secondHash = await sh;
for (int i = 0; i < firstHash.Length; i++)
{
if (firstHash[i] != secondHash[i]) return false;
}
return true;
}
Is this the working way to run both computing processes asynchronously at the same time?
You can use Task.WhenAll to run both tasks and await them finishing.
var result = await Task.WhenAll(fh, sh);
byte[] firstHash = result[0];
byte[] secondHash = result[1];
This is an appropriate usage for actual parallel code; it's CPU-bound and not, e.g., I/O-bound which is more appropriate for async/await.
I recommend using the higher-level APIs such as Parallel or PLINQ for parallel code. In this case, PLINQ will work nicely:
private bool checkEquality(FileInfo firstFile, FileInfo secondFile)
{
var results = new[] { firstFile, secondFile }
.AsParallel()
.Select(createHash)
.ToList();
return results[0].AsSpan().SequenceEqual(results[1]);
}
Related
I'm building a solution to find a desired value from an API call inside a for loop.
I basically need to pass to an API the index of a for loop, one of those index will return a desired output, in that moment I want the for loop to break, but I need to efficient this process. I thought making it asynchronous without the await, so that when some API returns the desired output it breaks the loop.
Each API call takes around 10sec, so if I make this async or multithread I would reduce the execution time considerably.
I haven't fount any good orientation of how making async / not await HTTP request.
Any suggestions?
for (int i = 0; i < 60000; i += 256)
{
Console.WriteLine("Incrementing_value: " + i);
string response = await client.GetStringAsync(
"http://localhost:7075/api/Function1?index=" + i.ToString());
Console.WriteLine(response);
if (response != "null")//
{
//found the desired output
break;
}
}
You can run requests in parallel, and cancel them once you have found your desired output:
public class Program {
public static async Task Main() {
var cts = new CancellationTokenSource();
var client = new HttpClient();
var tasks = new List<Task>();
// This is the number of requests that you want to run in parallel.
const int batchSize = 10;
int requestId = 0;
int batchRequestCount = 0;
while (requestId < 60000) {
if (batchRequestCount == batchSize) {
// Batch size reached, wait for current requests to finish.
await Task.WhenAll(tasks);
tasks.Clear();
batchRequestCount = 0;
}
tasks.Add(MakeRequestAsync(client, requestId, cts));
requestId += 256;
batchRequestCount++;
}
if (tasks.Count > 0) {
// Await any remaining tasks
await Task.WhenAll(tasks);
}
}
private static async Task MakeRequestAsync(HttpClient client, int index, CancellationTokenSource cts) {
if (cts.IsCancellationRequested) {
// The desired output was already found, no need for any more requests.
return;
}
string response;
try {
response = await client.GetStringAsync(
"http://localhost:7075/api/Function1?index=" + index.ToString(), cts.Token);
}
catch (TaskCanceledException) {
// Operation was cancelled.
return;
}
if (response != "null") {
// Cancel all current connections
cts.Cancel();
// Do something with the output ...
}
}
}
Note that this solution uses a simple mechanism to limit the amount of concurrent requests, a more advanced solution would make use of semaphores (as mentioned in some of the comments).
There are multiple ways to solve this problem. My personal favorite is to use an ActionBlock<T> from the TPL Dataflow library as a processing engine. This component invokes a provided Action<T> delegate for every data element received, and can also be provided with an asynchronous delegate (Func<T, Task>). It has many useful features, including (among others) configurable degree of parallelism/concurrency, and cancellation via a CancellationToken. Here is an implementation that takes advantage of those features:
async Task<string> GetStuffAsync()
{
var client = new HttpClient();
var cts = new CancellationTokenSource();
string output = null;
// Define the dataflow block
var block = new ActionBlock<string>(async url =>
{
string response = await client.GetStringAsync(url, cts.Token);
Console.WriteLine($"{url} => {response}");
if (response != "null")
{
// Found the desired output
output = response;
cts.Cancel();
}
}, new ExecutionDataflowBlockOptions()
{
CancellationToken = cts.Token,
MaxDegreeOfParallelism = 10 // Configure this to a desirable value
});
// Feed the block with URLs
for (int i = 0; i < 60000; i += 256)
{
block.Post("http://localhost:7075/api/Function1?index=" + i.ToString());
}
block.Complete();
// Wait for the completion of the block
try { await block.Completion; }
catch (OperationCanceledException) { } // Ignore cancellation errors
return output;
}
The TPL Dataflow library is built-in the .NET Core / .NET 5. and it is available as a package for .NET Framework.
The upcoming .NET 6 will feature a new API Parallel.ForEachAsync, that could also be used to solve this problem in a similar fashion.
I have 3 files, each 1 million rows long and I'm reading them line by line. No processing, just reading as I'm just trialling things out.
If I do this synchronously it takes 1 second. If I switch to using Threads, one for each file, it is slightly quicker (code not below, but I simply created a new Thread and started it for each file).
When I change to async, it is taking 40 times as long at 40 seconds. If I add in any work to do actual processing, I cannot see how I'd ever use async over synchronous or if I wanted a responsive application using Threads.
Or am I doing something fundamentally wrong with this code and not as async was intended?
Thanks.
class AsyncTestIOBound
{
Stopwatch sw = new Stopwatch();
internal void Tests()
{
DoSynchronous();
DoASynchronous();
}
#region sync
private void DoSynchronous()
{
sw.Restart();
var start = sw.ElapsedMilliseconds;
Console.WriteLine($"Starting Sync Test");
DoSync("Addresses", "SampleLargeFile1.txt");
DoSync("routes ", "SampleLargeFile2.txt");
DoSync("Equipment", "SampleLargeFile3.txt");
sw.Stop();
Console.WriteLine($"Ended Sync Test. Took {(sw.ElapsedMilliseconds - start)} mseconds");
Console.ReadKey();
}
private long DoSync(string v, string filename)
{
string line;
long counter = 0;
using (StreamReader file = new StreamReader(filename))
{
while ((line = file.ReadLine()) != null)
{
counter++;
}
}
Console.WriteLine($"{v}: T{Thread.CurrentThread.ManagedThreadId}: Lines: {counter}");
return counter;
}
#endregion
#region async
private void DoASynchronous()
{
sw.Restart();
var start = sw.ElapsedMilliseconds;
Console.WriteLine($"Starting Sync Test");
Task a=DoASync("Addresses", "SampleLargeFile1.txt");
Task b=DoASync("routes ", "SampleLargeFile2.txt");
Task c=DoASync("Equipment", "SampleLargeFile3.txt");
Task.WaitAll(a, b, c);
sw.Stop();
Console.WriteLine($"Ended Sync Test. Took {(sw.ElapsedMilliseconds - start)} mseconds");
Console.ReadKey();
}
private async Task<long> DoASync(string v, string filename)
{
string line;
long counter = 0;
using (StreamReader file = new StreamReader(filename))
{
while ((line = await file.ReadLineAsync()) != null)
{
counter++;
}
}
Console.WriteLine($"{v}: T{Thread.CurrentThread.ManagedThreadId}: Lines: {counter}");
return counter;
}
#endregion
}
Since you are using await several times in a giant loop (in your case, looping through each line of a "SampleLargeFile"), you are doing a lot of context switching, and the overhead can be really bad.
For each line, your code maybe is switching between each file. If your computer uses a hard drive, this can get even worse. Imagine the head of your HD getting crazy.
When you use normal threads, you are not switching the context for each line.
To solve this, just read the file on a single run. You can still use async/await (ReadToEndAsync()) and get a good performance.
EDIT
So, you are trying to count lines on the text file using async, right?
Try this (no need to load the entire file in memory):
private async Task<int> CountLines(string path)
{
int count = 0;
await Task.Run(() =>
{
using (FileStream fs = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (BufferedStream bs = new BufferedStream(fs))
using (StreamReader sr = new StreamReader(bs))
{
while (sr.ReadLine() != null)
{
count++;
}
}
});
return count;
}
a few things. First I would read all lines at once in the async method so that you are only awaiting once (instead of per line).
private async Task<long> DoASync(string v, string filename)
{
string lines;
long counter = 0;
using (StreamReader file = new StreamReader(filename))
{
lines = await reader.ReadToEndAsync();
}
Console.WriteLine($"{v}: T{Thread.CurrentThread.ManagedThreadId}: Lines: {lines.Split('\n').Length}");
return counter;
}
next, you can also wait for each Task individually. This will cause your CPU to only focus on one at a time, instead of possibly switching between the 3, which will cause more overhead.
private async void DoASynchronous()
{
sw.Restart();
var start = sw.ElapsedMilliseconds;
Console.WriteLine($"Starting Sync Test");
await DoASync("Addresses", "SampleLargeFile1.txt");
await DoASync("routes ", "SampleLargeFile2.txt");
await DoASync("Equipment", "SampleLargeFile3.txt");
sw.Stop();
Console.WriteLine($"Ended Sync Test. Took {(sw.ElapsedMilliseconds - start)} mseconds");
Console.ReadKey();
}
The reason why you are seeing slower performance is due to how await works with the CPU load. For each new line, this will cause an increase of CPU usage. Async machinery adds processing, allocations and synchronization. Also, we need to transition to kernel mode two times instead of once (first to initiate the IO, then to dequeue the IO completion notification).
More info, see: Does async await increases Context switching
I need to process a very large text file (6-8 GB). I wrote the code attached below. Unfortunately, every time output file reaches (being created next to source file) reaches ~2GB, I observe sudden jump in memory consumption (~100MB to few GBs) and in result - out of memory exception.
Debugger indicates that OOM occurs at while ((tempLine = streamReader.ReadLine()) != null)
I am targeting .NET 4.7 and x64 architecture only.
Single line is at most 50 character long.
I can workaround this and split original file to smaller parts not to face the problem while processing and merge resuls back to one file at the end, but would like not to do it.
Code:
public async Task PerformDecodeAsync(string sourcePath, string targetPath)
{
var allLines = CountLines(sourcePath);
long processedlines = default;
using (File.Create(targetPath));
var streamWriter = File.AppendText(targetPath);
var decoderBlockingCollection = new BlockingCollection<string>(1000);
var writerBlockingCollection = new BlockingCollection<string>(1000);
var producer = Task.Factory.StartNew(() =>
{
using (var streamReader = new StreamReader(File.OpenRead(sourcePath), Encoding.Default, true))
{
string tempLine;
while ((tempLine = streamReader.ReadLine()) != null)
{
decoderBlockingCollection.Add(tempLine);
}
decoderBlockingCollection.CompleteAdding();
}
});
var consumer1 = Task.Factory.StartNew(() =>
{
foreach (var line in decoderBlockingCollection.GetConsumingEnumerable())
{
short decodeCounter = 0;
StringBuilder builder = new StringBuilder();
foreach (var singleChar in line)
{
var positionInDecodeKey = decodingKeysList[decodeCounter].IndexOf(singleChar);
if (positionInDecodeKey > 0)
builder.Append(model.Substring(positionInDecodeKey, 1));
else
builder.Append(singleChar);
if (decodeCounter > 18)
decodeCounter = 0;
else ++decodeCounter;
}
writerBlockingCollection.TryAdd(builder.ToString());
Interlocked.Increment(ref processedlines);
if (processedlines == (long)allLines)
writerBlockingCollection.CompleteAdding();
}
});
var writer = Task.Factory.StartNew(() =>
{
foreach (var line in writerBlockingCollection.GetConsumingEnumerable())
{
streamWriter.WriteLine(line);
}
});
Task.WaitAll(producer, consumer1, writer);
}
Solutions, as well as advices how to optimize it a little more is greatly appreciated.
Like I said, I'd probably go for something simpler first, unless or until it's demonstrated that it's not performing well. As Adi said in their answer, this work appears to be I/O bound - so there seems little benefit in creating multiple tasks for it.
publiv void PerformDecode(string sourcePath, string targetPath)
{
File.WriteAllLines(targetPath,File.ReadLines(sourcePath).Select(line=>{
short decodeCounter = 0;
StringBuilder builder = new StringBuilder();
foreach (var singleChar in line)
{
var positionInDecodeKey = decodingKeysList[decodeCounter].IndexOf(singleChar);
if (positionInDecodeKey > 0)
builder.Append(model.Substring(positionInDecodeKey, 1));
else
builder.Append(singleChar);
if (decodeCounter > 18)
decodeCounter = 0;
else ++decodeCounter;
}
return builder.ToString();
}));
}
Now, of course, this code actually blocks until it's done, which is why I've not marked it async. But then, so did yours, and it should have been warning about that already.
(You could try using PLINQ instead of LINQ for the Select portion but honestly, the amount of processing we're doing here looks trivial; Profile first before applying any such change)
As the work you are doing is mostly IO bound, you aren't really gaining anything from parallelization. It also looks to me like (correct me if I'm wrong) that your transformation algorithm doesn't depend on you reading the file line-by-line, so I would recommend instead doing something like this:
void Main()
{
//Setup streams for testing
using(var inputStream = new MemoryStream())
using(var outputStream = new MemoryStream())
using (var inputWriter = new StreamWriter(inputStream))
using (var outputReader = new StreamReader(outputStream))
{
//Write test string and rewind stream
inputWriter.Write("abcdefghijklmnop");
inputWriter.Flush();
inputStream.Seek(0, SeekOrigin.Begin);
var inputBuffer = new byte[5];
var outputBuffer = new byte[5];
int inputLength;
while ((inputLength = inputStream.Read(inputBuffer, 0, inputBuffer.Length)) > 0)
{
for (var i = 0; i < inputLength; i++)
{
//transform each character
outputBuffer[i] = ++inputBuffer[i];
}
//Write to output
outputStream.Write(outputBuffer, 0, inputLength);
}
//Read for testing
outputStream.Seek(0, SeekOrigin.Begin);
var output = outputReader.ReadToEnd();
Console.WriteLine(output);
//Outputs: "bcdefghijklmnopq"
}
}
Obviously, you would be using FileStreams instead of MemoryStreams, and you can increase the buffer length to something much larger (as this was just a demonstrative example). Also as your original method is Async, you use the async variants of Stream.Write and Stream.Read
I'm using the following code to post an image to a server.
var image= Image.FromFile(#"C:\Image.jpg");
Task<string> upload = Upload(image);
upload.Wait();
public static async Task<string> Upload(Image image)
{
var uriBuilder = new UriBuilder
{
Host = "somewhere.net",
Path = "/path/",
Port = 443,
Scheme = "https",
Query = "process=false"
};
using (var client = new HttpClient())
{
client.DefaultRequestHeaders.Add("locale", "en_US");
client.DefaultRequestHeaders.Add("country", "US");
var content = ConvertToHttpContent(image);
content.Headers.ContentType = MediaTypeHeaderValue.Parse("image/jpeg");
using (var mpcontent = new MultipartFormDataContent("--myFakeDividerText--")
{
{content, "fakeImage", "myFakeImageName.jpg"}
}
)
{
using (
var message = await client.PostAsync(uriBuilder.Uri, mpcontent))
{
var input = await message.Content.ReadAsStringAsync();
return "nothing for now";
}
}
}
}
I'd like to modify this code to run multiple threads. I've used "ThreadPool.QueueUserWorkItem" before and started to modify the code to leverage it.
private void UseThreadPool()
{
int minWorker, minIOC;
ThreadPool.GetMinThreads(out minWorker, out minIOC);
ThreadPool.SetMinThreads(1, minIOC);
int maxWorker, maxIOC;
ThreadPool.GetMaxThreads(out maxWorker, out maxIOC);
ThreadPool.SetMinThreads(4, maxIOC);
var events = new List<ManualResetEvent>();
foreach (var image in ImageCollection)
{
var resetEvent = new ManualResetEvent(false);
ThreadPool.QueueUserWorkItem(
arg =>
{
var img = Image.FromFile(image.getPath());
Task<string> upload = Upload(img);
upload.Wait();
resetEvent.Set();
});
events.Add(resetEvent);
if (events.Count <= 0) continue;
foreach (ManualResetEvent e in events) e.WaitOne();
}
}
The problem is that only one thread executes at a time due to the call to "upload.Wait()". So I'm still executing each thread in sequence. It's not clear to me how I can use PostAsync with a thread-pool.
How can I post images to a server using multiple threads by tweaking the code above? Is HttpClient PostAsync the best way to do this?
I'd like to modify this code to run multiple threads.
Why? The thread pool should only be used for CPU-bound work (and I/O completions, of course).
You can do concurrency just fine with async:
var tasks = ImageCollection.Select(image =>
{
var img = Image.FromFile(image.getPath());
return Upload(img);
});
await Task.WhenAll(tasks);
Note that I removed your Wait. You should avoid using Wait or Result with async tasks; use await instead. Yes, this will cause async to grow through you code, and you should use async "all the way".
I have an application that needs to read very big .CSV files on application start and convert each row to an object. these are the methods that read the files:
public List<Aobject> GetAobject()
{
List<Aobject> Aobjects = new List<Aobject>();
using (StreamReader sr = new StreamReader(pathA, Encoding.GetEncoding("Windows-1255")))
{
string line;
while ((line = sr.ReadLine()) != null)
{
string[] spl = line.Split(',');
Aobject p = new Aobject { Aprop = spl[0].Trim(), Bprop = spl[1].Trim(), Cprop = spl[2].Trim() };
Aobjects.Add(p);
}
}
return Aobjects;
}
public List<Bobject> GetBobject()
{
List<Bobject> Bobjects = new List<Bobject>();
using (StreamReader sr =
new StreamReader(pathB, Encoding.GetEncoding("Windows-1255")))
{
//parts.Clear();
string line;
while ((line = sr.ReadLine()) != null)
{
string[] spl = line.Split(',');
Bobject p = new Bobject();
p.Cat = spl[0];
p.Name = spl[1];
p.Serial1 = spl[3].ToUpper().Contains("1");
if (spl[4].StartsWith("1"))
p.Technical = 1;
else if (spl[4].StartsWith("2"))
p.Technical = 2;
else
p.Technical = 0;
Bobjects.Add(p);
}
}
return Bobjects;
}
this was blocking my UI for a few seconds so I tried to make it multi-Threaded. however all my tests show that the un-threaded scenario is faster. this is how I tested it:
Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = 0; i < 1000; i++)
{
Dal dal = new Dal();
Thread a = new Thread(() => { ThreadedAobjects = dal.GetAobject(); });
Thread b = new Thread(() => { ThreadedBobjects = dal.GetBobject(); });
a.Start();
b.Start();
b.Join();
a.Join();
}
sw.Stop();
txtThreaded.Text = sw.Elapsed.ToString();
Stopwatch sw2 = new Stopwatch();
sw2.Start();
for (int i = 0; i < 1000; i++)
{
Dal dal2 = new Dal();
NonThreadedAobjects = dal2.GetAobject();
NonThreadedBobjects = dal2.GetBobject();
}
sw2.Stop();
txtUnThreaded.Text = sw2.Elapsed.ToString();
The results:
Threaded run: 00:01:55.1378686
UnTreaded run: 00:01:37.1197840
Compiled for .Net4.0 but should also work under .Net3.5, in release mode.
Could some please explain why does it happen and how can I improve this?
You are ignoring the cost associated with creating and starting up a thread. Instead of creating new threads try using the thread pool:
ThreadPool.QueueUserWorkItem(() => { ThreadedAobjects = dal.GetAobject(); });
You'll also need to keep a count of how many operations you have completed in order to properly calculate your total time. Have a look at this link: http://msdn.microsoft.com/en-us/library/3dasc8as.aspx
I would suggest a single thread that calls GetAobject and then calls GetBobject. Your task is almost certainly I/O bound, and if those two files are very large and on the same drive, then trying to access them concurrently will cause a lot of unnecessary disk seeks. So your code becomes:
ThreadPool.QueueUserWorkItem(() =>
{
AObjects = GetAObject();
BObjects = GetBObject();
});
That also simplifies your code because you only have to synchronize on one ManualResetEvent.
If you will run this test you will get slightly diffrent result every time. the time that takes to things to happend infuanced by many things that happens on the computer while running - in example: other processes, GC, etc.
but your resoults are reasonable because having another thread means that the proccessor need more context-switching and every context swich takes time...
you can read more on context-switch on:
http://en.wikipedia.org/wiki/Context_switch
Adding to Slugart's correct answer: your parallelisation is ineffective in a number of ways, because you wait for first thread to complete, while second one may be completed quicker and doing nothing for some time (Look into Task Parallel Library and PLINQ).
Also, your operations are IO bound, which means parallelism depends on IO device (some devices better perform in sequential manner, and trying to do multiple reads will slow down overall operation).