I have a program that get user int input "1" and increment it based on the amount of files in a directory then stamp that int on each file( first is 1 and so on 1++). The foreach loop go in each directory gets its files, increment the input and call the stamp method until all files are done. In this process the order is important. However multitasking ( Parallel.ForEach) does't always guarantee order, in my understanding it returns which ever thread done first and maybe also damage the i++ functionality ( correct me if I'm wrong).
The question is how to apply multi threading in this case? i am thinking save the values of the foreach at the end, pass it to the stamping method and have the method stamp x amount of files at a time. I don't know if its possible or how to apply.
Here is my watermark method:
//text comes from the foreach already set.
public void waterMark(string text, string sourcePath, string destinationPathh)
{
using (Bitmap bitmap = new Bitmap(sourcePath))
{
//somecode
using (Graphics graphics = Graphics.FromImage(tempBitmap))
{
//somecode
tempBitmap.Save(destinationPathh, ImageFormat.Tiff);
//Erroe^: a generic error occurred in gdi+
//I think due to trying to save multiple files at once
}
}
}
The foreach loop:
var files = folder.GetFiles();
Parallel.ForEach(files, new ParallelOptions { MaxDegreeOfParallelism = 4 }, (file, state,indexer) =>
{
//somecode that calls the waterMark method in multiple spots as of now
});
Thank you in advance.
There is an overload of Parallel.ForEach that also provides an index for the item being processed:
Parallel.ForEach(someEnumerable, (val, state, idx) => Console.WriteLine(idx))
You can use it to keep track of the index in a thread-safe fashion.
As for the GDI+ stuff (Bitmap), I think you're safe as long as you use a single thread for all interactions with the bitmap. Don't try to do anything clever with async between instantiation and disposal.
Related
I'm writing a console app to do some data migration between a legacy system and a new version. Each record has associated images stored on one web server and I'm downloading/altering/uploading each image to Azure (and also recording some data about each image in a database).
Here's a rough outline, in code:
public void MigrateData()
{
var records = GetRecords();
foreach (var record in records)
{
// ...
MigrateImages(record.Id, record.ImageCount);
}
}
public void MigrateImages(int recordId, int imageCount)
{
for (int i = 1; i <= imageCount; i++)
{
var legacyImageData = DownloadImage("the image url");
if (legacyImageData != null && legacyImageData.Length > 0)
{
// discard because we don't need the image id here, but it's used in other workflows
var _ = InsertImage(recordId, legacyImageData);
}
}
}
// This method can be used elsewhere, so the return of int is necessary and cannot be changed
public int InsertImage(int recordId, byte[] imageData)
{
var urls = UploadImage(imageData).Result;
return // method call to save image and return image ID
}
public async Task<(Uri LargeUri, Uri ThumbnailUri)> UploadImage(byte[] imageData)
{
byte[] largeData = ResizeImageToLarge(imageData);
byte[] thumbnailData = ResizeImageToThumbnail(imageData);
var largeUpload = largeBlob.UploadFromByteArrayAsync(largeImage, 0, largeImage.Length);
var thumbUpload = thumbsBlob.UploadFromByteArrayAsync(thumbImage, 0, thumbImage.Length);
await Task.WhenAll(largeUpload, thumbUpload);
var largeUrl = "";// logic to build url
var thumbUrl = "";// logic to build url
return (largeUrl, thumbUrl);
}
I'm using async/await for UploadImage() to allow parallel uploads for the large and thumbnail images, saving time.
My question is (if it's possible/makes sense) how can I utilize async/await for MigrateImages() to do parallel image uploads in order to reduce the overall time it takes for the task to complete? Does the fact that I'm already using async/await within UploadImage() hinder my goal?
It might be obvious due to my question, but await/async is still something I can't fully wrap my head around about how to correct utilize and/or implement it.
The async/await technology is intended for facilitating asynchrony, not concurrency. In your case you want to speed-up the images-uploading process by uploading multiple images concurrently. It is not important if you are wasting some thread-pool threads by blocking them, and you have no UI thread that needs to remain unblocked. You are making a tool that you are going to use once or twice and that's it. So I suggest that you spare yourself the trouble of trying to understand why your app is not behaving the way you expect, and avoid async/await altogether. Stick with the simple and familiar synchronous programming model for this assignment.
Combining synchronous and asynchronous code is dangerous, and even more so if your experience and understanding of async/await is limited. There are many intricacies in this technology. Using the Task.Result property in particular is a red flag. When you become more proficient with async/await code, you are going to treat any use of Result like a an unlocked grenade, ready to explode in your face at any time, and make you look like a fool. When used in apps with synchronization context (Windows Forms, ASP.NET) it can introduce deadlocks so easily that it's not even funny.
Here is how you can achieve the desired concurrency, without having to deal with the complexities of asynchrony:
public (Uri LargeUri, Uri ThumbnailUri) UploadImage(byte[] imageData)
{
byte[] largeData = ResizeImageToLarge(imageData);
byte[] thumbnailData = ResizeImageToThumbnail(imageData);
var largeUpload = largeBlob.UploadFromByteArrayAsync(
largeImage, 0, largeImage.Length);
var thumbUpload = thumbsBlob.UploadFromByteArrayAsync(
thumbImage, 0, thumbImage.Length);
Task.WaitAll(largeUpload, thumbUpload);
var largeUrl = "";// logic to build url
var thumbUrl = "";// logic to build url
return (largeUrl, thumbUrl);
}
I just replaced await Task.WhenAny with Task.WaitAll, and removed the wrapping Task from the method's return value.
I have a folder with many CSV files in it, which are around 3MB each in size.
example of content of one CSV:
afkla890sdfa9f8sadfkljsdfjas98sdf098,-1dskjdl4kjff;
afkla890sdfa9f8sadfkljsdfjas98sdf099,-1kskjd11kjsj;
afkla890sdfa9f8sadfkljsdfjas98sdf100,-1asfjdl1kjgf;
etc...
Now I have a Console app written in C#, that searches each CSV file for a certain string.
And those strings to search for are in a txt file.
example of search txt file:
-1gnmjdl5dghs
-17kn3mskjfj4
-1plo3nds3ddd
then I call the method to search each search string in all files in given folder:
private static object _lockObject = new object();
public static IEnumerable<string> SearchContentListInFiles(string searchFolder, List<string> searchList)
{
var result = new List<string>();
var files = Directory.EnumerateFiles(searchFolder);
Parallel.ForEach(files, (file) =>
{
var fileContent = File.ReadLines(file);
if (fileContent.Any(x => searchList.Any(y => x.ToLower().Contains(y))))
{
lock (_lockObject)
{
foreach (string searchFound in fileContent.Where(x => searchList.Any(y => x.ToLower().Contains(y))))
{
result.Add(searchFound);
}
}
}
});
return result;
}
Question now is, can I anyhow improve performance of this operation?
I have around 100GB of files to search trough.
It takes aproximatly 1 hour to search all ~30.000 files with around 25 search strings, on a SSD disk and a good i7 CPU.
Would it make a difference to have larger CSV files or smaller CSV? I just want this search to be as fast as possible.
UPDATE
I have tried every suggestion that you wrote, and this is now what best performed for me (Removing ToLower from the LINQ yielded best performance boost. Search time from 1hour is now 16minutes!):
public static IEnumerable<string> SearchContentListInFiles(string searchFolder, HashSet<string> searchList)
{
var result = new BlockingCollection<string>();
var files = Directory.EnumerateFiles(searchFolder);
Parallel.ForEach(files, (file) =>
{
var fileContent = File.ReadLines(file); //.Select(x => x.ToLower());
if (fileContent.Any(x => searchList.Any(y => x.Contains(y))))
{
foreach (string searchFound in fileContent.Where(x => searchList.Any(y => x.Contains(y))))
{
result.Add(searchFound);
}
}
});
return result;
}
Probably something like Lucene could be a performance boost: why don't you index your data so you can search it easily?
Take a look at Lucene .NET
You'll avoid searching data sequentially. In addition, you can model many indexes based on the same data to be able to get to certain results at the light speed.
Try to:
Do .ToLower one time for a line instead of do .ToLower for each element in searchList.
Do one scan of file instead of two pass any and where. Get the list and then add with lock if any found. In your sample you waste time for two pass and block all threads when search and add.
If you know position where to look for (in your sample you know) you can scan from position, not in all string
Use producer consumer pattern for example use: BlockingCollection<T>, so no need to use lock
If you need to strictly search in field, build HashSet of searchList and do searchHash.Contains(fieldValue) this will increase process dramatically
So here a sample (not tested):
using(var searcher = new FilesSearcher(
searchFolder: "path",
searchList: toLookFor))
{
searcher.SearchContentListInFiles();
}
here is the searcher:
public class FilesSearcher : IDisposable
{
private readonly BlockingCollection<string[]> filesInMemory;
private readonly string searchFolder;
private readonly string[] searchList;
public FilesSearcher(string searchFolder, string[] searchList)
{
// reader thread stores lines here
this.filesInMemory = new BlockingCollection<string[]>(
// limit count of files stored in memory, so if processing threads not so fast, reader will take a break and wait
boundedCapacity: 100);
this.searchFolder = searchFolder;
this.searchList = searchList;
}
public IEnumerable<string> SearchContentListInFiles()
{
// start read,
// we not need many threads here, probably 1 thread by 1 storage device is the optimum
var filesReaderTask = Task.Factory.StartNew(ReadFiles, TaskCreationOptions.LongRunning);
// at least one proccessing thread, because reader thread is IO bound
var taskCount = Math.Max(1, Environment.ProcessorCount - 1);
// start search threads
var tasks = Enumerable
.Range(0, taskCount)
.Select(x => Task<string[]>.Factory.StartNew(Search, TaskCreationOptions.LongRunning))
.ToArray();
// await for results
Task.WaitAll(tasks);
// combine results
return tasks
.SelectMany(t => t.Result)
.ToArray();
}
private string[] Search()
{
// if you always get unique results use list
var results = new List<string>();
//var results = new HashSet<string>();
foreach (var content in this.filesInMemory.GetConsumingEnumerable())
{
// one pass by a file
var currentFileMatches = content
.Where(sourceLine =>
{
// to lower one time for a line, and we don't need to make lowerd copy of file
var lower = sourceLine.ToLower();
return this.searchList.Any(sourceLine.Contains);
});
// store current file matches
foreach (var currentMatch in currentFileMatches)
{
results.Add(currentMatch);
}
}
return results.ToArray();
}
private void ReadFiles()
{
var files = Directory.EnumerateFiles(this.searchFolder);
try
{
foreach (var file in files)
{
var fileContent = File.ReadLines(file);
// add file, or wait if filesInMemory are full
this.filesInMemory.Add(fileContent.ToArray());
}
}
finally
{
this.filesInMemory.CompleteAdding();
}
}
public void Dispose()
{
if (filesInMemory != null)
filesInMemory.Dispose();
}
}
This operation is first and foremost disk bound. Disk bound operations do not benefit from Multithreading. Indeed all you will do is swamp the Disk controler with a ton of conflictign requests at the same time, that a feature like NCQ has to striahgten out again.
If you had loaded all the files into memory first, your operation would be Memory Bound. And memory bound operations do not benefit from Multithreading either (usually; it goes into details of CPU and memory architecture here).
While a certain amount of Multitasking is mandatory in Programming, true Multithreading only helps with CPU bound operations. Nothing in there looks remotely CPU bound. So multithreading taht search (one thread per file) will not make it faster. And indeed likely make it slower due to all the Thread switching and synchronization overhead.
In the question Why I need to overload the method when use it as ThreadStart() parameter?, I got the following solution for saving file in separate thread problem (it's required to save file when delete or add new instance of the PersonEntity):
private ObservableCollection<PersonEntitiy> allStaff;
private Thread dataFileTransactionsThread;
public staffRepository() {
allStaff = getStaffDataFromTextFile();
dataFileTransactionsThread = new Thread(UpdateDataFileThread);
}
public void UpdateDataFile(ObservableCollection<PersonEntitiy> allStaff)
{
dataFileTransactionsThread.Start(allStaff);
// If you want to wait until the save finishes, uncomment the following line
// dataFileTransactionsThread.Join();
}
private void UpdateDataFileThread(object data) {
var allStaff = (ObservableCollection<PersonEntitiy>)data;
System.Diagnostics.Debug.WriteLine("dataFileTransactions Thread Status:"+ dataFileTransactionsThread.ThreadState);
string containsWillBeSaved = "";
// ...
File.WriteAllText(fullPathToDataFile, containsWillBeSaved);
System.Diagnostics.Debug.WriteLine("Data Save Successfull");
System.Diagnostics.Debug.WriteLine("dataFileTransactions Thread Status:" + dataFileTransactionsThread.ThreadState);
}
Now, if sequentially delete two instances of the PersonEntity, System.Threading.ThreadStateException: Thread is still executing or don't finished yet. Restart is impossible. will occur.`.
I understand this exception meaning as a whole, however, the following solution will not be enough: next time, the file will not be saved.
if (!dataFileTransactionsThread.IsAlive) {
dataFileTransactionsThread.Start(allStaff);
}
Probably, it't better to restart the thread when it finished, and then save the file again. However, it's also required to provide the code for the case when will be deleted sequentially three or more instances. Just on the conception level, it's simple: we need only newest allStaff collection, so the previous unsaved allStaff collections or not necessary anymore.
How can I realize above concept on C#?
I'm going to suggest using Microsoft's Reactive Framework. NuGet "System.Reactive".
Then you can do this:
IObservable<List<PersonEntity>> query =
Observable
.FromEventPattern<NotifyCollectionChangedEventHandler, NotifyCollectionChangedEventArgs>(
h => allStaff.CollectionChanged += h, h => allStaff.CollectionChanged -= h)
.Throttle(TimeSpan.FromSeconds(2.0))
.Select(x => allStaff.ToList())
.ObserveOn(Scheduler.Default);
IDisposable subscription =
query
.Subscribe(u =>
{
string containsWillBeSaved = "";
// ...
File.WriteAllText(fullPathToDataFile, containsWillBeSaved);
System.Diagnostics.Debug.WriteLine("Data Save Successful");
});
This code will watch your allStaff collection for all changes and then, for every change, it will wait 2 seconds to see if any other changes come thru and if they don't it then takes a copy of your collection (this is crucial for threading to work) and it saves your collection.
It will save no more than once every 2 seconds and it will only save when there has been one or more changes.
I have a method like this :
public ConcurrentBag<FileModel> GetExceptionFiles(List<string> foldersPath, List<string> someList)
{
for (var i = 0; i < foldersPath.Count; i++)
{
var index = i;
new Thread(delegate()
{
foreach (var file in BrowseFiles(foldersPath[index]))
{
if (file.Name.Contains(someList[0]) || file.Name.Contains(someList[1]))
{
using (var fileStream = File.Open(file.Path, FileMode.Open))
using (var bufferedStream = new BufferedStream(fileStream))
using (var streamReader = new StreamReader(bufferedStream))
...
To give you more details:
This methods starts n threads (= foldersPath.Count) and each thread is going to read all the files which contains the strings listed in someList.
Right now my list contains only 2 strings (conditions), this is why im doing :
file.Name.Contains(someList[0]) || file.Name.Contains(someList[1])
What I want to do now is to replace this line with something that check all elements in the list someList
How can I do that?
Edit
Now that I replaced that line by if (someList.Any(item => file.Name.Contains(item)))
The next question is how can I optimize the performance of this code, knowing that each item in foldersPath is a separate hard drive in my network (which is always not more that 5 hard drives).
You could use something like if (someList.Any(item => file.Name.Contains(item)))
This will iterate each item in someList, and check if any of the items are contained in the file name, returning a boolean value to indicate whether any matches were found or not
Fristly.
There is an old saying is computer science, "There are two hard problems in CS, Naming, Cache Invalidation and Off by One Errors."
Don't use for loops, unless you absolutely have to, the tiny perf gain you get isn't worth the debug time (assuming there is any perf gain in this version of .net).
Secondly
new Thread. Don't do that. The creation of a thread is extremely slow and takes up lots of resources, especially for a short lived process like this. Added to the fact, there is overhead in passing data between threads. Use the ThreadPool.QueueUserWorkItem(WaitCallback) instead, if you MUST do short lived threads.
However, as I previously alluded to. Threads are an abstraction for CPU resources. I honestly doubt you are CPU bound. Threading is going to cost you more than you think. Stick to single threads. However you ARE I/O bound, therefore make full usage of asynchronous I/O.
public async Task<IEnumerable<FileModel>> GetExceptionFiles(List<string> foldersPath, List<string> someList)
{
foreach (var folderPath in foldersPath)
foreach (var file in BrowseFiles(folderPath))
{
if (false == someList.Any(x => file.Name.Contains(x, StringComparer.InvariantCultureCaseIgnore)))
continue;
using (var fileStream = await File.OpenTaskAsync(file.Path, FileMode.Open))
using (var bufferedStream = new BufferedStream(fileStream))
using (var streamReader = new StreamReader(bufferedStream))
...
yield return new FileModel();
i am extremely new to C# so excuse me if i don't explain this well.
i'm retrieving images from my computer camera, and along with displaying them in a PictureBox, i'm encoding them to jpegs and sending them to a shared dictionary. here's my code:
void CurrentCamera_OnImageCaptured(object sender, CameraEventArgs e)
{
this.pictureBoxMe.Image = e.Image;
if (myName != "" && Form1.PicSent)
{
SendPic sendP = new SendPic((Image)e.Image.Clone());
new System.Threading.Thread(new System.Threading.ThreadStart(sendP.send)).Start();
}
}
public class SendPic
{
Image im;
public SendPic (Image im)
{
this.im = im;
}
public void send(){
Form1.PicSent = false;
var memoryStream = new MemoryStream();
im.Save(memoryStream, ImageFormat.Jpeg);
var byteArray = memoryStream.ToArray();
Form1.sd["/" + myName + "/video"] = byteArray;
memoryStream.Close();
Form1.PicSent = true;
}
}
the problem is that i'm getting the "Object is currently in use elsewhere." error on the line: SendPic sendP = new SendPic((Image)e.Image.Clone());
based on other forum posts i've found, i already changed it so that the image is passed to the thread, and so that it's a clone. however i'm still getting the same error (though it lasts longer before crashing now).
i read something about locking? how do i implement that in this case? or is there something else i need to do?
thanks.
It behaves as though the OnImageCaptured method runs on a thread. Which isn't unlikely for camera interfaces. Set a breakpoint and use the debugger's Debug + Windows + Threads window to see what thread is running this code.
The failure mode is then that the UI thread is accessing the image to paint the picture box, simultaneously with this worker thread calling Clone(). GDI+ does not permit two threads accessing the same image object at the same time. It would indeed be flaky, no telling at what exact moment in the time the UI thread starts painting. PicSent is another accident waiting to happen.
One thing that catches my eye is that the SendPic class accesses your dictionary asynchronously (the line: Form1.sd["/" + myName + "/video"] = byteArray;).
However, dictionaries and hashtables are not guaranteed to be thread-safe for write operations. You should be safe if you make the code accessing the dictionary to be thread-safe. A simple lock would be a way to start.
Sort of like this:
public class SendPic
{
private object lockobj = new object();
// .... whatever other code ...
public void send()
{
// .... whatever previous code ...
lock(lockobj)
{
// assuming that the sd dictionary already has the relevant key
// otherwise you'd need to do a Form1.sd.Add(key, byteArray)
Form1.sd["/" + myName + "/video"] = byteArray;
}
// .... whatever following code ...
}
}
Quote from MSDN:
Thread Safety A Dictionary<(Of <(TKey, TValue>)>) can support
multiple readers concurrently, as long
as the collection is not modified.
Even so, enumerating through a
collection is intrinsically not a
thread-safe procedure. In the rare
case where an enumeration contends
with write accesses, the collection
must be locked during the entire
enumeration. To allow the collection
to be accessed by multiple threads for
reading and writing, you must
implement your own synchronization.