tesseract multithreading c# - c#

I have a code for tesseract to run in 1 instance how can i parallelize the code so that it can run in quad core processor or 8 core processor systems.here is my code block.thanks in advance.
using (TesseractEngine engine = new TesseractEngine(#"./tessdata", "tel+tel1", EngineMode.Default))
{
foreach (string ab in files)
{
using (var pages = Pix.LoadFromFile(ab))
{
using (Tesseract.Page page = engine.Process(pages,Tesseract.PageSegMode.SingleBlock))
{
string text = page.GetText();
OCRedText.Append(text);
}
}
}

This has worked for me:
static IEnumerable<string> Ocr(string directory, string sep)
=> Directory.GetFiles(directory, sep)
.AsParallel()
.Select(x =>
{
using var engine = new TesseractEngine(tessdata, "eng", EngineMode.Default);
using var img = Pix.LoadFromFile(x);
using var page = engine.Process(img);
return page.GetText();
}).ToList();
I am no expert on the matter of parallelization, but this function ocr's 8 Tiff's in 12 seconds.
However, it creates an Engine for every Tiff. I have not been able to call engine.Process concurrently.

The most simple way to run this code in parallel is using PLINQ. Calling AsParallel() on enumeration will automatically run query that follows it (.Select(...)) simultaneously on all available CPU cores.
It is crucial to run in parallel only thread-safe code. Assuming TesseractEngine is thread-safe (as you suggest in comment, I didn't verify it myself) as well as Pix.LoadFromFile(), then the only problematic part could be OCRedText.Append(). It is not clear from code, what OCRedText is, so I assume it is StringBuilder or List and therefore it is not thread-safe. So I removed this part from code that will run in parallel and I process it later in single-thread - since method .Append() is likely to run fast, this shouldn't have significant adverse effect on overall performance.
using (TesseractEngine engine = new TesseractEngine(#"./tessdata", "tel+tel1", EngineMode.Default))
{
var texts = files.AsParallel().Select(ab =>
{
using (var pages = Pix.LoadFromFile(ab))
{
using (Tesseract.Page page = engine.Process(pages, eract.PageSegMode.SingleBlock))
{
return page.GetText();
}
}
});
foreach (string text in texts)
{
OCRedText.Append(text);
}
}

Related

Parallel.ForEach use case

I am wondering if I should be using Parallel.ForEach() for my case. For a bit of context: I am developing a small music player using the NAudio library. I want to use Parallel.ForEach() in a factory method to quickly access .mp3 files and create TrackModel objects to represent them (about 400). The code looks like this:
public static List<TrackModel> CreateTracks(string[] files)
{
// Guard Clause
if (files == null || files.Length == 0) throw new ArgumentException();
var output = new List<TrackModel>();
TrackModel track;
Parallel.ForEach(files, file =>
{
using (MusicPlayer musicPlayer = new MusicPlayer(file, 0f))
{
track = new TrackModel()
{
FilePath = file,
Title = File.Create(file).Tag.Title,
Artist = File.Create(file).Tag.FirstPerformer,
TrackLength = musicPlayer.GetLengthInSeconds(),
};
}
lock (output)
{
output.Add(track);
}
});
return output;
}
Note: I use lock to prevent multiple Threads from adding elements to the list at the same time.
My question is the following: Should I be using Parallel.ForEach() in this situation or am I better off writing a normal foreach loop? Is this the right approach to achieve better performance and should I be using multithreading in combination with file access in the first place?
You're better off avoiding both a foreach and Parallel.ForEach. In this case AsParallel() is your friend.
Try this:
public static List<TrackModel> CreateTracks(string[] files)
{
if (files == null || files.Length == 0) throw new ArgumentException();
return
files
.AsParallel()
.AsOrdered()
.WithDegreeOfParallelism(2)
.Select(file =>
{
using (MusicPlayer musicPlayer = new MusicPlayer(file, 0f))
{
return new TrackModel()
{
FilePath = file,
Title = File.Create(file).Tag.Title,
Artist = File.Create(file).Tag.FirstPerformer,
TrackLength = musicPlayer.GetLengthInSeconds(),
};
}
})
.ToList();
}
This handles all the parallel logic and the locking behind the scenes for you.
Combining the suggestion from comments and answers and adapting them to my code I was able to solve my issue with the following code:
public List<TrackModel> CreateTracks(string[] files)
{
var output = files
.AsParallel()
.Select(file =>
{
using MusicPlayer musicPlayer = new MusicPlayer(file, 0f);
using File musicFile = File.Create(file);
return new TrackModel()
{
FilePath = file,
Title = musicFile.Tag.Title,
Artist = musicFile.Tag.FirstPerformer,
Length = musicPlayer.GetLengthInSeconds(),
};
})
.ToList();
return output;
}
Using AsParallel() helped significantly decrease the loading time which is what I was looking for. I will mark Enigmativity's answer as correct because of the clever idea. Initially, it threw a weird AggregateException, but I was able to solve it by saving the output in a variable and then returning it.
Credit to marsze as well, whose suggestion helped me fix a memory leak in the application and shave off 16MB of memory (!).

File search optimisation in C# using Parallel

I have a folder with many CSV files in it, which are around 3MB each in size.
example of content of one CSV:
afkla890sdfa9f8sadfkljsdfjas98sdf098,-1dskjdl4kjff;
afkla890sdfa9f8sadfkljsdfjas98sdf099,-1kskjd11kjsj;
afkla890sdfa9f8sadfkljsdfjas98sdf100,-1asfjdl1kjgf;
etc...
Now I have a Console app written in C#, that searches each CSV file for a certain string.
And those strings to search for are in a txt file.
example of search txt file:
-1gnmjdl5dghs
-17kn3mskjfj4
-1plo3nds3ddd
then I call the method to search each search string in all files in given folder:
private static object _lockObject = new object();
public static IEnumerable<string> SearchContentListInFiles(string searchFolder, List<string> searchList)
{
var result = new List<string>();
var files = Directory.EnumerateFiles(searchFolder);
Parallel.ForEach(files, (file) =>
{
var fileContent = File.ReadLines(file);
if (fileContent.Any(x => searchList.Any(y => x.ToLower().Contains(y))))
{
lock (_lockObject)
{
foreach (string searchFound in fileContent.Where(x => searchList.Any(y => x.ToLower().Contains(y))))
{
result.Add(searchFound);
}
}
}
});
return result;
}
Question now is, can I anyhow improve performance of this operation?
I have around 100GB of files to search trough.
It takes aproximatly 1 hour to search all ~30.000 files with around 25 search strings, on a SSD disk and a good i7 CPU.
Would it make a difference to have larger CSV files or smaller CSV? I just want this search to be as fast as possible.
UPDATE
I have tried every suggestion that you wrote, and this is now what best performed for me (Removing ToLower from the LINQ yielded best performance boost. Search time from 1hour is now 16minutes!):
public static IEnumerable<string> SearchContentListInFiles(string searchFolder, HashSet<string> searchList)
{
var result = new BlockingCollection<string>();
var files = Directory.EnumerateFiles(searchFolder);
Parallel.ForEach(files, (file) =>
{
var fileContent = File.ReadLines(file); //.Select(x => x.ToLower());
if (fileContent.Any(x => searchList.Any(y => x.Contains(y))))
{
foreach (string searchFound in fileContent.Where(x => searchList.Any(y => x.Contains(y))))
{
result.Add(searchFound);
}
}
});
return result;
}
Probably something like Lucene could be a performance boost: why don't you index your data so you can search it easily?
Take a look at Lucene .NET
You'll avoid searching data sequentially. In addition, you can model many indexes based on the same data to be able to get to certain results at the light speed.
Try to:
Do .ToLower one time for a line instead of do .ToLower for each element in searchList.
Do one scan of file instead of two pass any and where. Get the list and then add with lock if any found. In your sample you waste time for two pass and block all threads when search and add.
If you know position where to look for (in your sample you know) you can scan from position, not in all string
Use producer consumer pattern for example use: BlockingCollection<T>, so no need to use lock
If you need to strictly search in field, build HashSet of searchList and do searchHash.Contains(fieldValue) this will increase process dramatically
So here a sample (not tested):
using(var searcher = new FilesSearcher(
searchFolder: "path",
searchList: toLookFor))
{
searcher.SearchContentListInFiles();
}
here is the searcher:
public class FilesSearcher : IDisposable
{
private readonly BlockingCollection<string[]> filesInMemory;
private readonly string searchFolder;
private readonly string[] searchList;
public FilesSearcher(string searchFolder, string[] searchList)
{
// reader thread stores lines here
this.filesInMemory = new BlockingCollection<string[]>(
// limit count of files stored in memory, so if processing threads not so fast, reader will take a break and wait
boundedCapacity: 100);
this.searchFolder = searchFolder;
this.searchList = searchList;
}
public IEnumerable<string> SearchContentListInFiles()
{
// start read,
// we not need many threads here, probably 1 thread by 1 storage device is the optimum
var filesReaderTask = Task.Factory.StartNew(ReadFiles, TaskCreationOptions.LongRunning);
// at least one proccessing thread, because reader thread is IO bound
var taskCount = Math.Max(1, Environment.ProcessorCount - 1);
// start search threads
var tasks = Enumerable
.Range(0, taskCount)
.Select(x => Task<string[]>.Factory.StartNew(Search, TaskCreationOptions.LongRunning))
.ToArray();
// await for results
Task.WaitAll(tasks);
// combine results
return tasks
.SelectMany(t => t.Result)
.ToArray();
}
private string[] Search()
{
// if you always get unique results use list
var results = new List<string>();
//var results = new HashSet<string>();
foreach (var content in this.filesInMemory.GetConsumingEnumerable())
{
// one pass by a file
var currentFileMatches = content
.Where(sourceLine =>
{
// to lower one time for a line, and we don't need to make lowerd copy of file
var lower = sourceLine.ToLower();
return this.searchList.Any(sourceLine.Contains);
});
// store current file matches
foreach (var currentMatch in currentFileMatches)
{
results.Add(currentMatch);
}
}
return results.ToArray();
}
private void ReadFiles()
{
var files = Directory.EnumerateFiles(this.searchFolder);
try
{
foreach (var file in files)
{
var fileContent = File.ReadLines(file);
// add file, or wait if filesInMemory are full
this.filesInMemory.Add(fileContent.ToArray());
}
}
finally
{
this.filesInMemory.CompleteAdding();
}
}
public void Dispose()
{
if (filesInMemory != null)
filesInMemory.Dispose();
}
}
This operation is first and foremost disk bound. Disk bound operations do not benefit from Multithreading. Indeed all you will do is swamp the Disk controler with a ton of conflictign requests at the same time, that a feature like NCQ has to striahgten out again.
If you had loaded all the files into memory first, your operation would be Memory Bound. And memory bound operations do not benefit from Multithreading either (usually; it goes into details of CPU and memory architecture here).
While a certain amount of Multitasking is mandatory in Programming, true Multithreading only helps with CPU bound operations. Nothing in there looks remotely CPU bound. So multithreading taht search (one thread per file) will not make it faster. And indeed likely make it slower due to all the Thread switching and synchronization overhead.

Amazon S3 - Transfer utility won't work in paralel

I'm trying to upload multiple files to S3 using the TransferUtility class from Amazon SDK for .NET
My thinking was, that since the SDK doesn't allow to upload multiple files form different folders at once, I'd create multiple threads and upload there, but it looks like Amazon SDK has some kind of checking against this, since I don't notice any parallel execution of my uploading method.
Here is the pseudo-code I'm using:
int totalUploaded = 0;
foreach (var dItem in Detection.Items.AsParallel())
{
UploadFile(dItem);
totalUploaded++;
Action a = () => { lblStatus.Text = $"Uploaded {totalUploaded} from {Detection.ItemsCount}"; };
BeginInvoke(a);
}
I'm using .AsParallel to spawn multiple threads. My CPU (i7-5930K) has 6 cores and supports multi-threading, so AsParallel must spawn more threads as needed.
And here is the upload method
private void UploadFile(Detection.Item item)
{
Debug.WriteLine("Enter " + item.FileInfo);
Interlocked.Increment(ref _threadsCount);
...
using (var client = AmazonS3Client)
{
....
// if we are here we need to upload
TransferUtilityUploadRequest request = new TransferUtilityUploadRequest
{
Key = s3Key,
BucketName = settings.BucketName,
CannedACL = S3CannedACL.PublicRead,
FilePath = item.FileInfo.FullName,
ContentType = "image/png",
};
TransferUtility utility = new TransferUtility(client);
utility.Upload(request);
}
}
Сan't see what can be wrong here? Any idea highly appreciated.
Thx
The problem in your code is that you are only constructing the ParallelEnumerable, but treat it like a simple IEnumerable:
foreach (var dItem in Detection.Items.AsParallel())
This part of code simply iterates over collection. In case you want the parallel executing, you have to use the extension method ForAll():
Detection.Items.AsParallel().ForAll(dItem =>
{
//Do parallel stuff here
});
Also you can simply use the Parallel class:
Parallel.ForEach(Detection.Items, dItem =>
{
//Do parallel stuff here
});

How can I simplify and optimize this C# code

I have a method like this :
public ConcurrentBag<FileModel> GetExceptionFiles(List<string> foldersPath, List<string> someList)
{
for (var i = 0; i < foldersPath.Count; i++)
{
var index = i;
new Thread(delegate()
{
foreach (var file in BrowseFiles(foldersPath[index]))
{
if (file.Name.Contains(someList[0]) || file.Name.Contains(someList[1]))
{
using (var fileStream = File.Open(file.Path, FileMode.Open))
using (var bufferedStream = new BufferedStream(fileStream))
using (var streamReader = new StreamReader(bufferedStream))
...
To give you more details:
This methods starts n threads (= foldersPath.Count) and each thread is going to read all the files which contains the strings listed in someList.
Right now my list contains only 2 strings (conditions), this is why im doing :
file.Name.Contains(someList[0]) || file.Name.Contains(someList[1])
What I want to do now is to replace this line with something that check all elements in the list someList
How can I do that?
Edit
Now that I replaced that line by if (someList.Any(item => file.Name.Contains(item)))
The next question is how can I optimize the performance of this code, knowing that each item in foldersPath is a separate hard drive in my network (which is always not more that 5 hard drives).
You could use something like if (someList.Any(item => file.Name.Contains(item)))
This will iterate each item in someList, and check if any of the items are contained in the file name, returning a boolean value to indicate whether any matches were found or not
Fristly.
There is an old saying is computer science, "There are two hard problems in CS, Naming, Cache Invalidation and Off by One Errors."
Don't use for loops, unless you absolutely have to, the tiny perf gain you get isn't worth the debug time (assuming there is any perf gain in this version of .net).
Secondly
new Thread. Don't do that. The creation of a thread is extremely slow and takes up lots of resources, especially for a short lived process like this. Added to the fact, there is overhead in passing data between threads. Use the ThreadPool.QueueUserWorkItem(WaitCallback) instead, if you MUST do short lived threads.
However, as I previously alluded to. Threads are an abstraction for CPU resources. I honestly doubt you are CPU bound. Threading is going to cost you more than you think. Stick to single threads. However you ARE I/O bound, therefore make full usage of asynchronous I/O.
public async Task<IEnumerable<FileModel>> GetExceptionFiles(List<string> foldersPath, List<string> someList)
{
foreach (var folderPath in foldersPath)
foreach (var file in BrowseFiles(folderPath))
{
if (false == someList.Any(x => file.Name.Contains(x, StringComparer.InvariantCultureCaseIgnore)))
continue;
using (var fileStream = await File.OpenTaskAsync(file.Path, FileMode.Open))
using (var bufferedStream = new BufferedStream(fileStream))
using (var streamReader = new StreamReader(bufferedStream))
...
yield return new FileModel();

How does IEnumerable differ from IObservable under the hood?

I'm curious as to how IEnumerable differs from IObservable under the hood. I understand the pull and push patterns respectively but how does C#, in terms of memory etc, notify subscribers (for IObservable) that it should receive the next bit of data in memory to process? How does the observed instance know it's had a change in data to push to the subscribers.
My question comes from a test I was performing reading in lines from a file. The file was about 6Mb in total.
Standard Time Taken: 4.7s, lines: 36587
Rx Time Taken: 0.68s, lines: 36587
How is Rx able to massively improve a normal iteration over each of the lines in the file?
private static void ReadStandardFile()
{
var timer = Stopwatch.StartNew();
var linesProcessed = 0;
foreach (var l in ReadLines(new FileStream(_filePath, FileMode.Open)))
{
var s = l.Split(',');
linesProcessed++;
}
timer.Stop();
_log.DebugFormat("Standard Time Taken: {0}s, lines: {1}",
timer.Elapsed.ToString(), linesProcessed);
}
private static void ReadRxFile()
{
var timer = Stopwatch.StartNew();
var linesProcessed = 0;
var query = ReadLines(new FileStream(_filePath, FileMode.Open)).ToObservable();
using (query.Subscribe((line) =>
{
var s = line.Split(',');
linesProcessed++;
}));
timer.Stop();
_log.DebugFormat("Rx Time Taken: {0}s, lines: {1}",
timer.Elapsed.ToString(), linesProcessed);
}
private static IEnumerable<string> ReadLines(Stream stream)
{
using (StreamReader reader = new StreamReader(stream))
{
while (!reader.EndOfStream)
yield return reader.ReadLine();
}
}
My hunch is the behavior you're seeing is reflecting the OS caching the file. I would imagine if you reversed the order of the calls you would see a similar difference in speeds, just swapped.
You could improve this benchmark by performing a few warm-up runs or by copying the input file to a temp file using File.Copy prior to testing each one. This way the file would not be "hot" and you would get a fair comparison.
I'd suspect that you're seeing some kind of internal optimization of the CLR. It probably caches the content of the file in memory between the two calls so that ToObservable can pull the content much faster...
Edit: Oh, the good colleague with the crazy nickname eeh ... #sixlettervariables was faster and he's probably right: it's rather the OS who's optimizing than the CLR.

Categories