Parallel.ForEach use case - c#

I am wondering if I should be using Parallel.ForEach() for my case. For a bit of context: I am developing a small music player using the NAudio library. I want to use Parallel.ForEach() in a factory method to quickly access .mp3 files and create TrackModel objects to represent them (about 400). The code looks like this:
public static List<TrackModel> CreateTracks(string[] files)
{
// Guard Clause
if (files == null || files.Length == 0) throw new ArgumentException();
var output = new List<TrackModel>();
TrackModel track;
Parallel.ForEach(files, file =>
{
using (MusicPlayer musicPlayer = new MusicPlayer(file, 0f))
{
track = new TrackModel()
{
FilePath = file,
Title = File.Create(file).Tag.Title,
Artist = File.Create(file).Tag.FirstPerformer,
TrackLength = musicPlayer.GetLengthInSeconds(),
};
}
lock (output)
{
output.Add(track);
}
});
return output;
}
Note: I use lock to prevent multiple Threads from adding elements to the list at the same time.
My question is the following: Should I be using Parallel.ForEach() in this situation or am I better off writing a normal foreach loop? Is this the right approach to achieve better performance and should I be using multithreading in combination with file access in the first place?

You're better off avoiding both a foreach and Parallel.ForEach. In this case AsParallel() is your friend.
Try this:
public static List<TrackModel> CreateTracks(string[] files)
{
if (files == null || files.Length == 0) throw new ArgumentException();
return
files
.AsParallel()
.AsOrdered()
.WithDegreeOfParallelism(2)
.Select(file =>
{
using (MusicPlayer musicPlayer = new MusicPlayer(file, 0f))
{
return new TrackModel()
{
FilePath = file,
Title = File.Create(file).Tag.Title,
Artist = File.Create(file).Tag.FirstPerformer,
TrackLength = musicPlayer.GetLengthInSeconds(),
};
}
})
.ToList();
}
This handles all the parallel logic and the locking behind the scenes for you.

Combining the suggestion from comments and answers and adapting them to my code I was able to solve my issue with the following code:
public List<TrackModel> CreateTracks(string[] files)
{
var output = files
.AsParallel()
.Select(file =>
{
using MusicPlayer musicPlayer = new MusicPlayer(file, 0f);
using File musicFile = File.Create(file);
return new TrackModel()
{
FilePath = file,
Title = musicFile.Tag.Title,
Artist = musicFile.Tag.FirstPerformer,
Length = musicPlayer.GetLengthInSeconds(),
};
})
.ToList();
return output;
}
Using AsParallel() helped significantly decrease the loading time which is what I was looking for. I will mark Enigmativity's answer as correct because of the clever idea. Initially, it threw a weird AggregateException, but I was able to solve it by saving the output in a variable and then returning it.
Credit to marsze as well, whose suggestion helped me fix a memory leak in the application and shave off 16MB of memory (!).

Related

Update console text from multiple threads not working

I am executing/processing very big files in multi threaded mode in a console app.
When I don't update/write to the console from threads, for testing the whole process take about 1 minute.
But when I try to update/write to console from threads to show the progress, the process stuck and it never finishes (waited several minutes even hours). And also console text/window does not updated as it should.
Update-1: As requested by few kind responder, i added minimal code that can reproduce the same error/problem
Here is the code from the thread function/method:
using System;
using System.Collections;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.IO;
using System.Threading;
using System.Threading.Tasks;
namespace Large_Text_To_Small_Text
{
class Program
{
static string sAppPath;
static ArrayList objThreadList;
private struct ThreadFileInfo
{
public string sBaseDir, sRFile;
public int iCurFile, iTFile;
public bool bIncludesExtension;
}
static void Main(string[] args)
{
string sFileDir;
DateTime dtStart;
Console.Clear();
sAppPath = Path.GetDirectoryName(System.Reflection.Assembly.GetExecutingAssembly().Location);
sFileDir = #"d:\Test";
dtStart = DateTime.Now;
///process in multi threaded mode
List<string> lFiles;
lFiles = new List<string>();
lFiles.AddRange(Directory.GetFiles(sFileDir, "*.*", SearchOption.AllDirectories));
if (Directory.Exists(sFileDir + "-Processed") == true)
{
Directory.Delete(sFileDir + "-Processed", true);
}
Directory.CreateDirectory(sFileDir + "-Processed");
sPrepareThreading();
for (int iFLoop = 0; iFLoop < lFiles.Count; iFLoop++)
{
//Console.WriteLine(string.Format("{0}/{1}", (iFLoop + 1), lFiles.Count));
sThreadProcessFile(sFileDir + "-Processed", lFiles[iFLoop], (iFLoop + 1), lFiles.Count, Convert.ToBoolean(args[3]));
}
sFinishThreading();
Console.WriteLine(DateTime.Now.Subtract(dtStart).ToString());
Console.ReadKey();
return;
}
private static void sProcSO(object oThreadInfo)
{
var inputLines = new BlockingCollection<string>();
long lACounter, lCCounter;
ThreadFileInfo oProcInfo;
lACounter = 0;
lCCounter = 0;
oProcInfo = (ThreadFileInfo)oThreadInfo;
var readLines = Task.Factory.StartNew(() =>
{
foreach (var line in File.ReadLines(oProcInfo.sRFile))
{
inputLines.Add(line);
lACounter++;
}
inputLines.CompleteAdding();
});
var processLines = Task.Factory.StartNew(() =>
{
Parallel.ForEach(inputLines.GetConsumingEnumerable(), line =>
{
lCCounter++;
/*
some process goes here
*/
/*If i Comment out these lines program get stuck!*/
//Console.SetCursorPosition(0, oProcInfo.iCurFile);
//Console.Write(oProcInfo.iCurFile + " = " + lCCounter.ToString());
});
});
Task.WaitAll(readLines, processLines);
}
private static void sPrepareThreading()
{
objThreadList = new ArrayList();
for (var iTLoop = 0; iTLoop < 5; iTLoop++)
{
objThreadList.Add(null);
}
}
private static void sThreadProcessFile(string sBaseDir, string sRFile, int iCurFile, int iTFile, bool bIncludesExtension)
{
Boolean bMatched;
Thread oCurThread;
ThreadFileInfo oProcInfo;
Salma_RecheckThread:
bMatched = false;
for (int iTLoop = 0; iTLoop < 5; iTLoop++)
{
if (objThreadList[iTLoop] == null || ((System.Threading.Thread)(objThreadList[iTLoop])).IsAlive == false)
{
oProcInfo = new ThreadFileInfo()
{
sBaseDir = sBaseDir,
sRFile = sRFile,
iCurFile = iCurFile,
iTFile = iTFile,
bIncludesExtension = bIncludesExtension
};
oCurThread = new Thread(sProcSO);
oCurThread.IsBackground = true;
oCurThread.Start(oProcInfo);
objThreadList[iTLoop] = oCurThread;
bMatched = true;
break;
}
}
if (bMatched == false)
{
System.Threading.Thread.Sleep(250);
goto Salma_RecheckThread;
}
}
private static void sFinishThreading()
{
Boolean bRunning;
Salma_RecheckThread:
bRunning = false;
for (int iTLoop = 0; iTLoop < 5; iTLoop++)
{
if (objThreadList[iTLoop] != null && ((System.Threading.Thread)(objThreadList[iTLoop])).IsAlive == true)
{
bRunning = true;
}
}
if (bRunning == true)
{
System.Threading.Thread.Sleep(250);
goto Salma_RecheckThread;
}
}
}
}
And here is the screenshot, if I try to update console window:
You see? Nor the line number (oProcInfo.iCurFile) or the whole line is correct!
It should be like this:
1 = xxxxx
2 = xxxxx
3 = xxxxx
4 = xxxxx
5 = xxxxx
Update-1: To test just change the sFileDir to any folder that has some big text file or if you like you can download some big text files from following link:
https://wetransfer.com/downloads/8aecfe05bb44e35582fc338f623ad43b20210602005845/bcdbb5
Am I missing any function/method to update console text from threads?
I can't reproduce it. In my tests the process always runs to completion, without getting stuck. The output is all over the place though, because the two lines below are not synchronized:
Console.SetCursorPosition(0, oProcInfo.iCurFile);
Console.Write(oProcInfo.iCurFile + " = " + lCCounter.ToString());
Each thread of the many threads involved in the computation invokes these two statements concurrently with the other threads. This makes it possible for one thread to preempt another, and move the cursor before the first thread has the chance to write in the console. To solve this problem you must add proper synchronization, and the easiest way to do it is to use the lock statement:
class Program
{
static object _locker = new object();
And in the sProcSO method:
lock (_locker)
{
Console.SetCursorPosition(0, oProcInfo.iCurFile);
Console.Write(oProcInfo.iCurFile + " = " + lCCounter.ToString());
}
If you want to know more about thread synchronization, I recommend this online resource: Threading in C# - Part 2: Basic Synchronization
If you would like to hear my opinion about the code in the question, and you don't mind receiving criticism, my opinion is that honestly the code is so much riddled with problems that the best course of action would be to throw it away and restart from scratch. Use of archaic data structures (ArrayList???), liberal use of casting from object to specific types, liberal use of the goto statement, use of hungarian notation in public type members, all make the code difficult to follow, and easy for bugs to creep in. I found particularly problematic that each file is processed concurrently with all other files using a dedicated thread, and then each dedicated thread uses a ThreadPool thread (Task.Factory.StartNew) to starts a parallel loop (Parallel.ForEach) with unconfigured MaxDegreeOfParallelism. This setup ensures that the ThreadPool will be saturated so badly, that there is no hope that the availability of threads will ever match the demand. Most probably it will also result to a highly inefficient use of the storage device, especially if the hardware is a classic hard disk.
Your freezing problem may not be C# or code related
on the top left of your console window, on the icon .. right click
select Properties
remove the option of Quick Edit Mode and Insert Mode
you can google that feature, but essentially manifests in the problem you describe above
The formatting problem on the other hand does seem to be, here you need to create a class that serializes writes to the console window from a singe thread. a consumer/producer pattern would work (you could use a BlockingCollection to implement this quite easily)

Proper way to execute a long task as to not affect frame rate in Unity

I have a task of deserialization that happens when the game starts. I need to basically pull some images from the persistent path and create bunch of assets from them. The images can be large (10-50MB) and there can be lots of them, so basically this can freeze my frame on the single task for ever. I tried using Coroutines but I might misunderstand how to work them properly.
Since Coroutines are really single threaded, they are not exactly going to let me finish creating these assets while the UI is running. I can't also just create a new thread to do this work, and jump back on the main thread when done with a callback because Unity won't let me access their API (I am creating Texture2D, Button(), parenting objects etc.).
How the hell do I go about this? Do I really need to create a massive IEnumerable function and put bunch of yield return null every other line of code? That seems a little excessive. Is there a way to call a time consuming method that requires access to the main thread in Unity, and have Unity spread it across as many frames as needed so that it doesn't bog down the UI?
Here's an example of a Deserialize method:
public IEnumerator Deserialize()
{
// (Konrad) Deserialize Images
var dataPath = Path.Combine(Application.persistentDataPath, "Images");
if (File.Exists(Path.Combine(dataPath, "images.json")))
{
try
{
var images = JsonConvert.DeserializeObject<Dictionary<string, Item>>(File.ReadAllText(Path.Combine(dataPath, "images.json")));
if (images != null)
{
foreach (var i in images)
{
if (!File.Exists(Path.Combine(dataPath, i.Value.Name))) continue;
var bytes = File.ReadAllBytes(Path.Combine(dataPath, i.Value.Name));
var texture = new Texture2D(2, 2);
if (bytes.Length <= 0) continue;
if (!texture.LoadImage(bytes)) continue;
i.Value.Texture = texture;
}
}
Images = images;
}
catch (Exception e)
{
Debug.Log("Failed to deserialize Images: " + e.Message);
}
}
// (Konrad) Deserialize Projects.
if (Projects == null) Projects = new List<Project>();
if (File.Exists(Path.Combine(dataPath, "projects.json")))
{
try
{
var projects = JsonConvert.DeserializeObject<List<Project>>(File.ReadAllText(Path.Combine(dataPath, "projects.json")));
if (projects != null)
{
foreach (var p in projects)
{
AddProject(p);
foreach (var f in p.Folders)
{
AddFolder(f, true);
foreach (var i in f.Items)
{
var image = Images != null && Images.ContainsKey(i.ParentImageId)
? Images[i.ParentImageId]
: null;
if (image == null) continue;
i.ThumbnailTexture = image.Texture;
// (Konrad) Call methods that would normally be called by the event system
// as content is getting downloaded.
AddItemThumbnail(i, true); // creates new button
UpdateImageDescription(i, image); // sets button description
AddItemContent(i, image); // sets item Material
}
}
}
}
}
catch (Exception e)
{
Debug.Log("Failed to deserialize Projects: " + e.Message);
}
}
if (Images == null) Images = new Dictionary<string, Item>();
yield return true;
}
So this would take like 10s to complete. It needs to deserialize images from drive, create button assets, set bunch of parenting relationships etc. I would appreciate any ideas.
Ps. I haven't updated to the experimental .NET 4.6 so I am still on .NET 3.5.
OK, reading your comments below I figured I can give this a try. I put the IO operations into a different thread. They don't need Unity API so I can finish those and store the byte[] and load the bytes into the Texture when done. Here's a try:
public IEnumerator Deserialize()
{
var dataPath = Path.Combine(Application.persistentDataPath, "Images");
var bytes = new Dictionary<Item, byte[]>();
var done = false;
new Thread(() => {
if (File.Exists(Path.Combine(dataPath, "images.json")))
{
var items = JsonConvert.DeserializeObject<Dictionary<string, Item>>(File.ReadAllText(Path.Combine(dataPath, "images.json"))).Values;
foreach (var i in items)
{
if (!File.Exists(Path.Combine(dataPath, i.Name))) continue;
var b = File.ReadAllBytes(Path.Combine(dataPath, i.Name));
if (b.Length <= 0) continue;
bytes.Add(i, b);
}
}
done = true;
}).Start();
while (!done)
{
yield return null;
}
var result = new Dictionary<string, Item>();
foreach (var b in bytes)
{
var texture = new Texture2D(2, 2);
if (!texture.LoadImage(b.Value)) continue;
b.Key.Texture = texture;
result.Add(b.Key.Id, b.Key);
}
Debug.Log("Finished loading images!");
Images = result;
// (Konrad) Deserialize Projects.
if (Projects == null) Projects = new List<Project>();
if (File.Exists(Path.Combine(dataPath, "projects.json")))
{
var projects = JsonConvert.DeserializeObject<List<Project>>(File.ReadAllText(Path.Combine(dataPath, "projects.json")));
if (projects != null)
{
foreach (var p in projects)
{
AddProject(p);
foreach (var f in p.Folders)
{
AddFolder(f, true);
foreach (var i in f.Items)
{
var image = Images != null && Images.ContainsKey(i.ParentImageId)
? Images[i.ParentImageId]
: null;
if (image == null) continue;
i.ThumbnailTexture = image.Texture;
// (Konrad) Call methods that would normally be called by the event system
// as content is getting downloaded.
AddItemThumbnail(i, true); // creates new button
UpdateImageDescription(i, image); // sets button description
AddItemContent(i, image); // sets item Material
}
}
}
}
}
if (Images == null) Images = new Dictionary<string, Item>();
yield return true;
}
I have to concede that it helps a little, but it's still not great. Looking at the profiler I am getting pretty big stall right out of the gate:
That's my Deserialize routine that is causing it:
Any way to work around this?
There are two main ways to spread work across multiple frames:
multithreading and
coroutines
Multithreading has the limitation you pointed out, so a coroutine seems appropriate.
The key thing to remember with coroutines is that they will not allow the next frame to begin until a yield statement is run. The other thing to remember is that if you yield too frequently, there is a cap on how many times you will hit a yield return per second, based on your framerate, so you don't want too yield to early, or it will take too much real time for the work to finish.
What you want is a frequent opportunity for the function to yield, but you don't want the opportunity to always be taken. The best way to do this is to use the Stopwatch class (be sure to use the full name or add a "using" statement at the top of your file) or something similar.
Here is an example modification of your second code snippet.
public IEnumerator Deserialize()
{
var dataPath = Path.Combine(Application.persistentDataPath, "Images");
var bytes = new Dictionary<Item, byte[]>();
var done = false;
new Thread(() => {
if (File.Exists(Path.Combine(dataPath, "images.json")))
{
var items = JsonConvert.DeserializeObject<Dictionary<string, Item>>(File.ReadAllText(Path.Combine(dataPath, "images.json"))).Values;
foreach (var i in items)
{
if (!File.Exists(Path.Combine(dataPath, i.Name))) continue;
var b = File.ReadAllBytes(Path.Combine(dataPath, i.Name));
if (b.Length <= 0) continue;
bytes.Add(i, b);
}
}
done = true;
}).Start();
while (!done)
{
yield return null;
}
// MOD: added stopwatch and started
System.Diagnostics.Stopwatch watch = new System.Diagnostics.Stopwatch();
int MAX_MILLIS = 5; // tweak this to prevent frame rate reduction
watch.Start();
var result = new Dictionary<string, Item>();
foreach (var b in bytes)
{
// MOD: Check if enough time has passed since last yield
if (watch.ElapsedMilliseconds() > MAX_MILLIS)
{
watch.Reset();
yield return null;
watch.Start();
}
var texture = new Texture2D(2, 2);
if (!texture.LoadImage(b.Value)) continue;
b.Key.Texture = texture;
result.Add(b.Key.Id, b.Key);
}
Debug.Log("Finished loading images!");
Images = result;
// (Konrad) Deserialize Projects.
if (Projects == null) Projects = new List<Project>();
if (File.Exists(Path.Combine(dataPath, "projects.json")))
{
var projects = JsonConvert.DeserializeObject<List<Project>>(File.ReadAllText(Path.Combine(dataPath, "projects.json")));
if (projects != null)
{
foreach (var p in projects)
{
AddProject(p);
foreach (var f in p.Folders)
{
AddFolder(f, true);
foreach (var i in f.Items)
{
// MOD: check if enough time has passed since the last yield
if (watch.ElapsedMilliseconds() > MAX_MILLIS)
{
watch.Reset();
yield return null;
watch.Start();
}
var image = Images != null && Images.ContainsKey(i.ParentImageId)
? Images[i.ParentImageId]
: null;
if (image == null) continue;
i.ThumbnailTexture = image.Texture;
// (Konrad) Call methods that would normally be called by the event system
// as content is getting downloaded.
AddItemThumbnail(i, true); // creates new button
UpdateImageDescription(i, image); // sets button description
AddItemContent(i, image); // sets item Material
}
}
}
}
}
if (Images == null) Images = new Dictionary<string, Item>();
yield return true;
}
Edit: Further notes for those wanting more general advice...
The two main systems are multithreading and coroutines. Their pros and cons are:
Coroutine Advantages:
Little setup.
No data-sharing or locking concerns.
Can perform any unity main-thread operation.
Multithreading Advantages:
Doesn't take time away from the main thread, leaving you as much CPU power as possible
Can utilize a full CPU core rather than whatever is left over from the main thread.
To sum up, coroutines are best for quick-and-dirty solutions or when modifications to unity objects need to be made. However, if large amounts of processing need to be performed, it's best to offload as much as possible to another thread. Very few devices have fewer than two cores these days (safe to say non that are used to play games?).
In this case, a hybrid solution was possible, offloading some work to separate thread and keeping the unity dependent work on the main thread. This is a powerful solution, and coroutines can make it easy.
tauting accomplishments As an example, I made a voxel engine which offloaded running of the algorithm onto a separate thread and then created the actual meshes on the main thread, allowing for a 50-70% reduction in how long it took to generate meshes, and perhaps more importantly reducing the impact to the game's end performance. It did this with queues of jobs that were passed back and forth between the threads.

File search optimisation in C# using Parallel

I have a folder with many CSV files in it, which are around 3MB each in size.
example of content of one CSV:
afkla890sdfa9f8sadfkljsdfjas98sdf098,-1dskjdl4kjff;
afkla890sdfa9f8sadfkljsdfjas98sdf099,-1kskjd11kjsj;
afkla890sdfa9f8sadfkljsdfjas98sdf100,-1asfjdl1kjgf;
etc...
Now I have a Console app written in C#, that searches each CSV file for a certain string.
And those strings to search for are in a txt file.
example of search txt file:
-1gnmjdl5dghs
-17kn3mskjfj4
-1plo3nds3ddd
then I call the method to search each search string in all files in given folder:
private static object _lockObject = new object();
public static IEnumerable<string> SearchContentListInFiles(string searchFolder, List<string> searchList)
{
var result = new List<string>();
var files = Directory.EnumerateFiles(searchFolder);
Parallel.ForEach(files, (file) =>
{
var fileContent = File.ReadLines(file);
if (fileContent.Any(x => searchList.Any(y => x.ToLower().Contains(y))))
{
lock (_lockObject)
{
foreach (string searchFound in fileContent.Where(x => searchList.Any(y => x.ToLower().Contains(y))))
{
result.Add(searchFound);
}
}
}
});
return result;
}
Question now is, can I anyhow improve performance of this operation?
I have around 100GB of files to search trough.
It takes aproximatly 1 hour to search all ~30.000 files with around 25 search strings, on a SSD disk and a good i7 CPU.
Would it make a difference to have larger CSV files or smaller CSV? I just want this search to be as fast as possible.
UPDATE
I have tried every suggestion that you wrote, and this is now what best performed for me (Removing ToLower from the LINQ yielded best performance boost. Search time from 1hour is now 16minutes!):
public static IEnumerable<string> SearchContentListInFiles(string searchFolder, HashSet<string> searchList)
{
var result = new BlockingCollection<string>();
var files = Directory.EnumerateFiles(searchFolder);
Parallel.ForEach(files, (file) =>
{
var fileContent = File.ReadLines(file); //.Select(x => x.ToLower());
if (fileContent.Any(x => searchList.Any(y => x.Contains(y))))
{
foreach (string searchFound in fileContent.Where(x => searchList.Any(y => x.Contains(y))))
{
result.Add(searchFound);
}
}
});
return result;
}
Probably something like Lucene could be a performance boost: why don't you index your data so you can search it easily?
Take a look at Lucene .NET
You'll avoid searching data sequentially. In addition, you can model many indexes based on the same data to be able to get to certain results at the light speed.
Try to:
Do .ToLower one time for a line instead of do .ToLower for each element in searchList.
Do one scan of file instead of two pass any and where. Get the list and then add with lock if any found. In your sample you waste time for two pass and block all threads when search and add.
If you know position where to look for (in your sample you know) you can scan from position, not in all string
Use producer consumer pattern for example use: BlockingCollection<T>, so no need to use lock
If you need to strictly search in field, build HashSet of searchList and do searchHash.Contains(fieldValue) this will increase process dramatically
So here a sample (not tested):
using(var searcher = new FilesSearcher(
searchFolder: "path",
searchList: toLookFor))
{
searcher.SearchContentListInFiles();
}
here is the searcher:
public class FilesSearcher : IDisposable
{
private readonly BlockingCollection<string[]> filesInMemory;
private readonly string searchFolder;
private readonly string[] searchList;
public FilesSearcher(string searchFolder, string[] searchList)
{
// reader thread stores lines here
this.filesInMemory = new BlockingCollection<string[]>(
// limit count of files stored in memory, so if processing threads not so fast, reader will take a break and wait
boundedCapacity: 100);
this.searchFolder = searchFolder;
this.searchList = searchList;
}
public IEnumerable<string> SearchContentListInFiles()
{
// start read,
// we not need many threads here, probably 1 thread by 1 storage device is the optimum
var filesReaderTask = Task.Factory.StartNew(ReadFiles, TaskCreationOptions.LongRunning);
// at least one proccessing thread, because reader thread is IO bound
var taskCount = Math.Max(1, Environment.ProcessorCount - 1);
// start search threads
var tasks = Enumerable
.Range(0, taskCount)
.Select(x => Task<string[]>.Factory.StartNew(Search, TaskCreationOptions.LongRunning))
.ToArray();
// await for results
Task.WaitAll(tasks);
// combine results
return tasks
.SelectMany(t => t.Result)
.ToArray();
}
private string[] Search()
{
// if you always get unique results use list
var results = new List<string>();
//var results = new HashSet<string>();
foreach (var content in this.filesInMemory.GetConsumingEnumerable())
{
// one pass by a file
var currentFileMatches = content
.Where(sourceLine =>
{
// to lower one time for a line, and we don't need to make lowerd copy of file
var lower = sourceLine.ToLower();
return this.searchList.Any(sourceLine.Contains);
});
// store current file matches
foreach (var currentMatch in currentFileMatches)
{
results.Add(currentMatch);
}
}
return results.ToArray();
}
private void ReadFiles()
{
var files = Directory.EnumerateFiles(this.searchFolder);
try
{
foreach (var file in files)
{
var fileContent = File.ReadLines(file);
// add file, or wait if filesInMemory are full
this.filesInMemory.Add(fileContent.ToArray());
}
}
finally
{
this.filesInMemory.CompleteAdding();
}
}
public void Dispose()
{
if (filesInMemory != null)
filesInMemory.Dispose();
}
}
This operation is first and foremost disk bound. Disk bound operations do not benefit from Multithreading. Indeed all you will do is swamp the Disk controler with a ton of conflictign requests at the same time, that a feature like NCQ has to striahgten out again.
If you had loaded all the files into memory first, your operation would be Memory Bound. And memory bound operations do not benefit from Multithreading either (usually; it goes into details of CPU and memory architecture here).
While a certain amount of Multitasking is mandatory in Programming, true Multithreading only helps with CPU bound operations. Nothing in there looks remotely CPU bound. So multithreading taht search (one thread per file) will not make it faster. And indeed likely make it slower due to all the Thread switching and synchronization overhead.

tesseract multithreading c#

I have a code for tesseract to run in 1 instance how can i parallelize the code so that it can run in quad core processor or 8 core processor systems.here is my code block.thanks in advance.
using (TesseractEngine engine = new TesseractEngine(#"./tessdata", "tel+tel1", EngineMode.Default))
{
foreach (string ab in files)
{
using (var pages = Pix.LoadFromFile(ab))
{
using (Tesseract.Page page = engine.Process(pages,Tesseract.PageSegMode.SingleBlock))
{
string text = page.GetText();
OCRedText.Append(text);
}
}
}
This has worked for me:
static IEnumerable<string> Ocr(string directory, string sep)
=> Directory.GetFiles(directory, sep)
.AsParallel()
.Select(x =>
{
using var engine = new TesseractEngine(tessdata, "eng", EngineMode.Default);
using var img = Pix.LoadFromFile(x);
using var page = engine.Process(img);
return page.GetText();
}).ToList();
I am no expert on the matter of parallelization, but this function ocr's 8 Tiff's in 12 seconds.
However, it creates an Engine for every Tiff. I have not been able to call engine.Process concurrently.
The most simple way to run this code in parallel is using PLINQ. Calling AsParallel() on enumeration will automatically run query that follows it (.Select(...)) simultaneously on all available CPU cores.
It is crucial to run in parallel only thread-safe code. Assuming TesseractEngine is thread-safe (as you suggest in comment, I didn't verify it myself) as well as Pix.LoadFromFile(), then the only problematic part could be OCRedText.Append(). It is not clear from code, what OCRedText is, so I assume it is StringBuilder or List and therefore it is not thread-safe. So I removed this part from code that will run in parallel and I process it later in single-thread - since method .Append() is likely to run fast, this shouldn't have significant adverse effect on overall performance.
using (TesseractEngine engine = new TesseractEngine(#"./tessdata", "tel+tel1", EngineMode.Default))
{
var texts = files.AsParallel().Select(ab =>
{
using (var pages = Pix.LoadFromFile(ab))
{
using (Tesseract.Page page = engine.Process(pages, eract.PageSegMode.SingleBlock))
{
return page.GetText();
}
}
});
foreach (string text in texts)
{
OCRedText.Append(text);
}
}

Should I use a ConcurrentQueue this way or individual threads

I'm doing what amounts to a glorified mail merge and then file conversion to PDF... Based on .Net 4.5 I see a couple ways I can do the threading. The one using a thread safe queue seems interesting (Plan A), but I can see a potential problem. What do you think? I'll try to keep it short, but put in what is needed.
This works on the assumption that it will take far more time to do the database processing than the PDF conversion.
In both cases, the database processing for each file is done in its own thread/task, but PDF conversion could be done in many single threads/tasks (Plan B) or it can be done in a single long running thread (Plan A). It is that PDF conversion I am wondering about. It is all in a try/catch statement, but that thread must not fail or all fails (Plan A). Do you think that is a good idea? Any suggestions would be appreciated.
/* A class to process a file: */
public class c_FileToConvert
{
public string InFileName { get; set; }
public int FileProcessingState { get; set; }
public string ErrorMessage { get; set; }
public List<string> listData = null;
c_FileToConvert(string inFileName)
{
InFileName = inFileName;
FileProcessingState = 0;
ErrorMessage = ""; // yah, yah, yah - String.Empty
listData = new List<string>();
}
public void doDbProcessing()
{
// get the data from database and put strings in this.listData
DAL.getDataForFile(this.InFileName, this.ErrorMessage); // static function
if(this.ErrorMessage != "")
this.FileProcessingState = -1; //fatal error
else // Open file and append strings to it
{
foreach(string s in this.listData}
...
FileProcessingState = 1; // enum DB_WORK_COMPLETE ...
}
}
public void doPDFProcessing()
{
PDFConverter cPDFConverter = new PDFConverter();
cPDFConverter.convertToPDF(InFileName, InFileName + ".PDF");
FileProcessingState = 2; // enum PDF_WORK_COMPLETE ...
}
}
/*** These only for Plan A ***/
public ConcurrentQueue<c_FileToConvert> ConncurrentQueueFiles = new ConcurrentQueue<c_FileToConvert>();
public bool bProcessPDFs;
public void doProcessing() // This is the main thread of the Windows Service
{
List<c_FileToConvert> listcFileToConvert = new List<c_FileToConvert>();
/*** Only for Plan A ***/
bProcessPDFs = true;
Task task1 = new Task(new Action(startProcessingPDFs)); // Start it and forget it
task1.Start();
while(1 == 1)
{
List<string> listFileNamesToProcess = new List<string>();
DAL.getFileNamesToProcessFromDb(listFileNamesToProcess);
foreach(string s in listFileNamesToProcess)
{
c_FileToConvert cFileToConvert = new c_FileToConvert(s);
listcFileToConvert.Add(cFileToConvert);
}
foreach(c_FileToConvert c in listcFileToConvert)
if(c.FileProcessingState == 0)
Thread t = new Thread(new ParameterizedThreadStart(c.doDbProcessing));
/** This is Plan A - throw it on single long running PDF processing thread **/
foreach(c_FileToConvert c in listcFileToConvert)
if(c.FileProcessingState == 1)
ConncurrentQueueFiles.Enqueue(c);
/*** This is Plan B - traditional thread for each file conversion ***/
foreach(c_FileToConvert c in listcFileToConvert)
if(c.FileProcessingState == 1)
Thread t = new Thread(new ParameterizedThreadStart(c.doPDFProcessing));
int iCount = 0;
for(int iCount = 0; iCount < c_FileToConvert.Count; iCount++;)
{
if((c.FileProcessingState == -1) || (c.FileProcessingState == 2))
{
DAL.updateProcessingState(c.FileProcessingState)
listcFileToConvert.RemoveAt(iCount);
}
}
sleep(1000);
}
}
public void startProcessingPDFs() /*** Only for Plan A ***/
{
while (bProcessPDFs == true)
{
if (ConncurrentQueueFiles.IsEmpty == false)
{
try
{
c_FileToConvert cFileToConvert = null;
if (ConncurrentQueueFiles.TryDequeue(out cFileToConvert) == true)
cFileToConvert.doPDFProcessing();
}
catch(Exception e)
{
cFileToConvert.FileProcessingState = -1;
cFileToConvert.ErrorMessage = e.message;
}
}
}
}
Plan A seems like a nice solution, but what if the Task fails somehow? Yes, the PDF conversion can be done with individual threads, but I want to reserve them for the database processing.
This was written in a text editor as the simplest code I could, so there may be something, but I think I got the idea across.
How many files are you working with? 10? 100,000? If the number is very large, using 1 thread to run the DB queries for each file is not a good idea.
Threads are a very low-level control flow construct, and I advise you try to avoid a lot of messy and detailed thread spawning, joining, synchronizing, etc. etc. in your application code. Keep it stupidly simple if you can.
How about this: put the data you need for each file in a thread-safe queue. Create another thread-safe queue for results. Spawn some number of threads which repeatedly pull items from the input queue, run the queries, convert to PDF, then push the output into the output queue. The threads should share absolutely nothing but the input and output queues.
You can pick any number of worker threads which you like, or experiment to see what a good number is. Don't create 1 thread for each file -- just pick a number which allows for good CPU and disk utilization.
OR, if your language/libraries have a parallel map operator, use that. It will save you a lot of messing around.

Categories