C# Parallel Task usage in OCR Application?

C# Parallel Task usage in OCR Application? - c#

I'm building a Windows Service application that takes as input a directory containing scanned images. My application will iterates through all images and for every image, it will perform some OCR operations in order to grab the barcode, invoice number and customer number.
Some background info:
The tasks performed by the application are pretty CPU intensive
There are large number of images to procss and the scanned image file are large (~2MB)
The application runs on a 8-core server with 16GB of RAM.
My question:
Since it's doing stuff with images on the file system I'm unsure if it will really make a difference if I change my application in a way that it will use .NET Parallel Tasks.
Can anybody give me advice about this?
Many thanks!

If processing an image takes longer than reading N images from the disk, then processing multiple images concurrently is a win. Figure you can read a 2 MB file from disk in under 100 ms (including seek time). Figure one second to read 8 images into memory.
So if your image processing takes more than a second per image, I/O isn't a problem. Do it concurrently. You can scale that down if you need to (i.e. if processing takes 1/2 second, then you're probably best off with only 4 concurrent images).
You should be able to test this fairly quickly: write a program that randomly reads images off the disk, and calculate the average time to open, read, and close the file. Also write a program that processes a sample of the images and compute the average processing time. Those numbers should tell you whether or not concurrent processing will be helpful.

I think the answer is, 'It Depends'.
I'd try running the application with some type of Performance Monitoring (even the one in Task Manager) and see how high the CPU gets.
If the CPU is maxing out; it would improve performance to run it in paralell. If not, the disk is the bottleneck and without some other changes, you probably wouldn't get much (if any) gain.

Related

How to use multithreading or any other .NET technology to scale a program performing network, disk and processor intensive jobs?

The Problem:
Download a batch of PDF files from pickup.fileserver (SFTP or windows share) to local hard drive (Polling is involved here to check if files are available to download)
Process (resize, apply barcodes etc) the PDF files, create some metadata files, update database etc
Upload this batch to dropoff.fileserver (SFTP)
Await response from dropoff.fileserver (Again polling is the only option). Once the batch response is available, download it local HD.
Parse the batch response, update database and finally upload report to pickup.fileserver
Archive all batch files to a SAN location and go back to step 1.
The Current Solution
We are expecting many such batches so we have created a windows service which can keep polling at certain time intervals and perform the steps mentioned above. It takes care of one batch at a time.
The Concern
The current solution works file, however, I'm concerned that it is NOT making best use of available resources, there is certainly a lot of room for improvement. I have very little idea about how I can scale this windows service to be able to process as many batches simultaneously as it can. And then if required, how to involve multiple instances of this windows service hosted on different servers to scale further.
I have read some MSDN articles and some SO answers on similar topics. There are suggestions about using producer-consumer patterns (BlockingCollectiong<T> etc.) Some say that it wouldn't make sense to create multi-threaded app for IO intensive tasks. What we have here is a mixture of disk + network + processor intensive tasks. I need to understand how best to use threading or any other technology to make best use of available resources on one server and go beyond one server (if required) to scale further.
Typical Batch Size
We regularly get batches of 200~ files, 300 MB~ total size. # of batches can grow to about 50 to 100, in next year or two. A couple of times in a year, we get batches of 5k to 10k files.

As you say, what you have is a mixture of tasks, and it's probably going to be hard to implement a single pipeline that optimizes all your resources. I would look at breaking this down into 6 services (one per step) that can then be tuned, multiplied or multi-threaded to provide the throughput you need.
Your sources are probably correct that you're not going to improve performance of your network tasks much by multithreading them. By breaking your application into several services, your resizing and barcoding service can start processing a file as soon as it's done downloading, while the download service moves on to downloading the next file.

The current solution works fine
Then keep it. That's my $0.02. Who cares if it's not terribly efficient? As long as it is efficient enough, then why change it?
That said...
I need to understand how best to use threading or any other technology to make best use of available resources on one server
If you want a new toy, I'd recommend using TPL Dataflow. It is designed specifically for wiring up pipelines that contain a mixture of I/O-bound and CPU-bound steps. Each step can be independently parallelized, and TPL Dataflow blocks understand asynchronous code, so they also work well with I/O.
and go beyond one server (if required) to scale further.
That's a totally different question. You'd need to use reliable queues and break the different steps into different processes, which can then run anywhere. This is a good place to start.

According to this article you may implement background worker jobs (Hangfire preferably) in your application layer and reduce code and deployment management of multiple windows services and achieve the same result possibly.
Also, you won't need to bother about handling multiple windows services.
Additionally it can restore in case of failure at application level or restart events.

There is no magic technology that will solve your problem, you need to analyse each part of it step by step.
You will need to profile the application and determine what areas are slow performing and refactor the code to resolve the problem.
This might mean increasing the demand on one resource to decrease demand on another, for example: You might find that you are doing a database lookup 10 times for each file you process. But caching the data before starting processing files is quicker, but maybe only if you have a batch larger than xx files.
You might find that to increase the processing speed of the whole batch that this is maybe not the optimal method for a single file.
As your program has multiple steps then you can look at each of these in turn, and as a whole.
My guess would be that the ftp download and upload would take the most time. So, you can look at running this in parallel. Whether this means running xx threads at once each processing a file, or having a separate task/thread for each stage in your process you can only determine with testing.
A good design is critical for performance. But there are limits and sometimes it just takes time to do some tasks.
Don’t forget that you must weight this up against the time and effort needed to implement this and the benefit. If the service runs overnight and takes 6 hours to run is it really a benefit if it takes 4 hours, if the people who need to work on the result will not be in the office anyway until much later.

To this kind of problem do you have the any specific file types that you download from the SFTP. I have a similar problem in downloading the large files but it is not a windows service in my case its EXE that runs on the System.timers.
Try to create the threads for each file types which are large in
size eg: PDF's.
You can check for these file types while downloading the SFTP file
path and assign them to a thread process to download.
You also need to upload the files also in vice versa.
--In my case all I was able to do was to tweak the existing one and create a separate thread process for a large file types. that solved my problem as flat files and Large PDF files are downloaded parallel threads.

Sending large file with HttpWebRequest, growing/shrinking buffer as needed

I'm writing an application that uploads large files to a web service using HttpWebRequest.
This application will be run by various people with various internet speeds.
I asynchronously read the file in chunks, and asynchronously write those chunks to the request stream. I do this in a loop using callbacks. And I keep doing this until the whole file has been sent.
The speed of the upload is calculated between writes and the GUI is subsequently updated to show said speed.
The issue I'm facing is deciding on a buffer size. If I make it too large, users with slow connections will not see frequent updates to the speed. If I make it too small, users with fast connections will end up "hammering" the read/write methods causing CPU usage to spike.
What I'm doing now is starting the buffer off at 128kb, and then every 10 writes I check the average write speed of those 10 writes, and if it's under a second I increase the buffer size by 128kb. I also shrink the buffer in a similar fashion if the write speed drops below 5 seconds.
This works quite well, but it all feels very arbitrary and it seems like there is room for improvement. My question is, has anybody dealt with a similar situation and what course of action did you take?
Thanks

I think this is a good approach. I too used in large file upload. But there was a small tweek in that. I determined the connection speed in the first request by placing a call to my different service. This would actually save the overhead of recalculating the speed with every request. The primary reason of doing so was
In slow connection, the speed usually fluctuate very much. Thus recalculating it every request does not make sense.
I was supposed to provide resume facility also where the user would be able to reupload the file from the point where it ended last time.
Considering the scalability, I used to fix the buffer with the first request. Let me know if it helped

Multithreaded application does not reach 100% of processor usage

My multithreaded application take some files from the HD and then process the data in this files. I reuse the same instance of a class (dataProcessing)) to create threads (I just change the parameters of the calling method).
processingThread[i] = new Thread(new ThreadStart(dataProcessing.parseAll));
I am wondering if the cause could be all threads reading from the same memory.
It takes about half a minute to process each file. The files are quickly read since they are just 200 KB. After I process the files I write all the results in a single destination file. I dont think the problem is reading or writing to the disk. All the threads are working on the task, but for some reason the processor is not being fully used. I try adding more threads to see if I could reach 100% of processor usage, but it comes to a point where it slows down and decresease the processing usage instead of fully use it. Anyone do have an idea what could be wrong?

Here some points you might want to consider:
most CPUs today are Hyper threaded. Even though the OS assumes that each hyper threaded core has 2 pipe lines this is not the case and very dependent on the CPU and the arithmetic operations you are performing. While on most CPUs there are 2 integer units on each pipe-line, there is only one FP so most FP operations are not gaining any befit from the hyper-threaded architecture.
Since the file is only 200k I can only assume that it is all copied to the cache so this is not a memory/disk issue.
Are you using external DLLs? some operations, like reading/saving JPEG files using native Bitmap class, are not parallel and you won't see any speed-up if you are doing multiple executions at once.
Performance decrease as you are reaching a point that switching between the threads costs more than the operation they are doing.
Are you only reading the data or are you also modifying it? If each thread also modify the data then there are many locks on the cache. It would be better for each thread to gather its own data in its own memory and combine all the data together only after all the threads have does their job.

how to improve a large number of smaller files read and write speed or performance

Yesterday,I asked the question at here:how do disable disk cache in c# invoke win32 CreateFile api with FILE_FLAG_NO_BUFFERING.
In my performance test show(write and read test,1000 files and total size 220M),the FILE_FLAG_NO_BUFFERING can't help me improve performance and lower than .net default disk cache,since i try change FILE_FLAG_NO_BUFFERING to FILE_FLAG_SEQUENTIAL_SCAN can to reach the .net default disk cache and faster little.
before,i try use mongodb's gridfs feature replace the windows file system，not good(and i don't need to use distributed feature,just taste).
in my Product,the server can get a lot of the smaller files(60-100k) on per seconds through tcp/ip,then need save it to the disk,and third service read these files once(just read once and process).if i use asynchronous I/O whether can help me,whether can get best speed and best low cpu cycle?. someone can give me suggestion?or i can still use FileStream class?
update 1
the memory mapped file whether can to achieve my demand.that all files write to one big file or more and read from it?

If your PC is taking 5-10 seconds to write a 100kB file to disk, then you either have the world's oldest, slowest PC, or your code is doing something very inefficient.
Turning off disk caching will probably make things worse rather than better. With a disk cache in place, your writes will be fast, and Windows will do the slow part of flushing the data to disk later. Indeed, increasing I/O buffering usually results in significantly improved I/O in general.
You definitely want to use asynchronous writes - that means your server starts the data writing, and then goes back to responding to its clients while the OS deals with writing the data to disk in the background.
There shouldn't be any need to queue the writes (as the OS will already be doing that if disc caching is enabled), but that is something you could try if all else fails - it could potentially help by writing only one file at a time to minimise the need for disk seeks..
Generally for I/O, using larger buffers helps to increase your throughput. For example instead of writing each individual byte to the file in a loop, write a buffer-ful of data (ideally the entire file, for the sizes you mentioned) in one Write operation. This will minimise the overhead (instead of calling a write function for every byte, you call a function once for the entire file). I suspect you may be doing something like this, as it's the only way I know to reduce performance to the levels you've suggested you are getting.
Memory-mapped files will not help you. They're really best for accessing the contents of huge files.

One of buggest and significant improvements, in your case, can be, imo, process the filles without saving them to a disk and after, if you really need to store them, push them on Queue and provess it in another thread, by saving them on disk. By doing this you will immidiately get processed data you need, without losing time to save a data on disk, but also will have a file on disk after, without losing computational power of your file processor.

How to measure characteristics of file (hard-disk) I/O?

How to measure characteristics of file (hard-disk) I/O? For example on a machine with a hard-disk (with speed X) and a cpu i7 (or whatever number of cores) and Y amount of ram (with Z Hz BIOS) what would be (on Windows OS):
Optimum number of files that can be written to the HD in parallel?
Optimum number of files that can be read from HD in parallel?
Facilities of file-system helping faster writings. (Like: Is there a feature or tool there that let you write batches of binary data on different sectors (or hards) and then bind them as a file? I do not know much about underlying file I/O in OS. But it would be reasonable to have such tools!)
If there are such tools as part before, are there in .NET too?
I want to write large files (streamed over the web or another source) as fast (and as parallel) as possible! I am coding this in C#. And it acts like a download manager; so if streaming got interrupted, it can carry on later.

The answer (as so often) depends on your usage. The whole operating system is one big tradeoff between different use scenarios. For the NTFS filesystem one could mention block size set to 4k, NTFS storing files less than block size in MTF, size of files, number of files, fragmentation, etc.
If you are planning to write large files then a block size of 64k may be good. That is if you plan to read large amounts of data. If you read smaller amounts of data then smaller sizes are good. The OS works in 4k pages, so 4k is good. Compression (and encryption?) as well as SQL and Exchange only work on 4k pages (iirc).
If you write small files (<4k)they will be stored inside the MFT so you don't have to make "an ekstra jump". This is especially useful in write operations (read may have MFT cached). MFT stores files in sequences (i.e. blocks 1000-1010,2000-2010) so fragmentation will make the MFT bigger. Writing files to disk in parallell is one of the main causes to fragmentation, the other is deleting files. You may pre-allocate the required size for a file and Windows will try to find a suitable place on the disk to counter fragmentation. There are also real-time defragmentation programs like O&O Defrag.
Windows maps a binarystream pretty much directly to the physical location on the disk, so using different read/write methods will not yield as much performance boost as other factors. For maximum speed programs use technieue for direct memory mapping to disk. See http://en.wikipedia.org/wiki/Memory-mapped_file
There is an option in Windows (under Device Manager, Harddisks) to increase caching on the disk. This is dangerous as ut could damage the filesystem if the computer bluescreens or looses power, but gives a big performance boost on writing smaller files (and on all writes). If the disk is busy this is especially valuable as the seek-time will decrease. Windows use what is called the elevator algorithm which basically means it moves the harddisk heads over the surface back and forth serving any application in the direction it is moving.
Hope this helps. :)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.