Generating PDF for 90K records - c#

Currently I am using LocalReport. Render to create PDF's for 90K records. Using normal 'for' loop, it takes around 4 hours to create PDF only. I have tried many options.
Tried with Parallel. Foreach with and without setting MaxDegreeOfParallelism with different values. There are 2 processors in my system. With setting MaxDegreeOfParallelism(MDP) =4, it is taking the time as normal 'for' loop. I thought increasing MDP to 40 will speed up the process. But didn't get expected results since it took 900 minutes.
Used
var list=List<Thread ()>;
foreach (var record in records) {
var thread = new Thread (=> GeneratePDF());
thread.Start();
list.Add(thread);
}
foreach(var listThreads in thread){
listThreads. Join();
}
I used the code above like that. But it ended up creating too many threads and took so longer time.
I need help in using Parallel. Foreach to speed up the process of creating PDF's for 90K records. Suggestions to change the code is also acceptable.
Any help would be much appreciated.
Thanks

I don't know any pdf generators, so I can only assume there is a lot overhead in initializing and in finalizing things. That's what I'd do:
Find an open source pdf generator.
Let it generate a few separate pieces of a pdf - header, footer, etc.
Dig the code to find where the header/footer is done and try work around them to reuse generator states without running through the entire process.
Try to stich together a pdf from stored states and a generator writing only the different parts.

Related

Is there a more efficient way to write this loop? [duplicate]

This question already has answers here:
How do I export to Excel?
(3 answers)
Closed 2 years ago.
I have the following loops which are iterating for a long time, the queryResult has 397464 rows and each row has 15 columns, so the number of iterations will be 397464*15 = 5961960 + outer loop (397464) = 6359424 iterations.
The problem is that this is taking a very long time resulting page timeouts.
Could this be written in a more efficient way?
var rowHtml = String.Empty;
foreach (DataRow row in queryResult.Rows)
{
rowHtml += "<tr>";
for (int i = 0; i < queryResult.Columns.Count; i++)
{
rowHtml += $"<td>{row[i]}</td>";
}
rowHtml += "</tr>";
}
Building string: Consider using a StringBuilder. Every time you concatenate strings using the + operator, a new string is created on the heap. This is fine for individual uses, but can be a major slowdown in large workloads like yours. You can specify the StringBuilder's maximum and starting capacities in the constructor, giving you more control over the app's memory usage.
Parallelization: I do not know your app's exact context, but I suggest having a look at the System.Threading.Parallel class. Its For/Foreach methods allow you to iterate over a collection using a threadpool, which can greatly accelerate processing by shifting it to multiple cores.
Be careful though: If the order of elements is relevant, you should divide the workload into packages instead and build substrings for each of those.
Edit: Correction: String concatenation can only truly be parallelized in some rare cases where the exact length of each substring produced by the loop is fixed and known. In this special case, it is possible to write results directly to a large pre-allocated destination buffer. This is perfectly viable when working with char arrays or pointers, but not advisable with normal C# strings or StringBuilders.
Asynchronous Processing: It looks like you are writing some kind of web app or server backend. If your content is required on demand, and does not need to be ready the exact moment the page is loaded, consider displaying a loading bar or some notification along the lines of "please wait", while the page waits for your server to send the finished processing results.
Edit: As suggested in comments, there are better ways to solve this issue than constructing the HTML string from a table. Consider using those instead of some elaborate content loading scheme.

Different option than Dictionary

I made a program that is at its heart is a keyboard hook. I press a specific button and it performs a specific action. Since there is a fairly large list of options that I can select from using a Combobox, I decided to make a Dictionary called ECCMDS (stands for embedded controller commands). I can then set my Combobox items to ECCMDS.Keys and select by a command by name. Makes for easy saving too because its a string I just save it to a XML file. Well the program monitors anywhere from 4-8 buttons. The problem comes from runtime. The program uses about 53 megs of memory (of course I look over at it now and it says 16 megs :/) Well the tablet that this is running on has 3Gb's of memory and a Atom processor. Normally i'd scoff at 53megs, but using a huge switch statement and the program uses about 2 or 3 megs (been sometime since I actually looked at its usage, so I can't remember exactly)
So although the Dictionary greatly reduces the complexity of my RunCommand method I'm wondering about the memory usage. This tablet at idle is using 80% of its memory, so I'd like to make as little of impact on that as possible. Is there another solution to this problem? Here is a small example of the dictionary
ECCMDS = new Dictionary<string, Action>()
{
{"Decrease Backlight", EC.DescreaseBrightness},
{"Increase Backlight", EC.IncreaseBrightness},
{"Toggle WiFi", new Action(delegate{EC.WirelessState = GetToggledState(EC.WirelessState);})},
{"Enable WiFi", new Action(delegate{EC.WirelessState = ObjectState.Enabled;})},
{"Disable WiFi", new Action(delegate{EC.WirelessState = ObjectState.Disabled;})},
{"{PRINTSCRN}", new Action(delegate{VKeys.User32Input.DoPressRawKey(0x2C);})},
};
is it possible to use reflection or something to achieve this?
EDIT
So after the nice suggestion of making a new program and comparing the 2 methods I've determained that it is not my Dictionary. I didn't think that WPF was that big of a difference between Winforms, but it must be. The new program doesn't hardly have any pictures (like it used to, most of my graphics are generated now) but the results are as follows
Main Entry Point:32356 kb
Before Huge Dictionary:33724 kb
After Initialization:35732 kb
After 10000 runs:37824 kb
That took 932ms to run
After Huge Dictionary:38444 kb
Before Huge Switch Statement:39060 kb
After Initialization:39696 kb
After 10000 runs:40076 kb
That took 1136ms to run
After Huge Switch Statement:40388 kb
I suggest you extract the Dictonary to a separate program and see how much space it occupies before you worry about how much space it is taking and if that is your problem.

c# multithreading file reading and page parsing

I have a file with more than 500 000 urls. Now I want to read the file and parse every url with my function which return string message. For now everyting is working fine but the performance is not good so I need start the parsing in simulataneus threads (for example 100 threads)
ParseEngine parseEngine = new ParserEngine(parseFormulas);
StreamReader reader = new StreamReader("urls.txt");
String line = string.Empty;
while ((line = reader.ReadLine()) != null)
{
string result = parseEngine.Parse(line);
Console.WriteLine(result);
}
reader.Close();
It will be good when I can stop all the threads by button clicking and change the number of threads. Any help and tips?
Be sure to check out this article on PLINQ performance compared to other techniques for parsing a text file, line-by-line, using multi-threading.
Not only does it provide sample source code for doing something almost identical to what you want, but they also discovered a "gotcha" with PLINQ that can result in abnormally slow times. In a nutshell, if you try to use File.ReadAllLines() or StreamReader.ReadLine() you'll spoil the performance because PLINQ can't properly divide the file up that way. They solved the problem by reading all the lines into an indexed array, and THEN processing it with PLINQ.
Honestly for the performance difference I would just try parallel foreach in .net 4.0 if that is an option.
using System.Threading.Tasks;
Parallel.ForEach(enumerableList, p =>{
parseEngine.Parse(p);
});
Its a decent start to running things parallel and should minimize your thread troubleshooting headaches.
A producer/consumer setup would be good for this. One thread reading from the file and writing to a Queue, and the other threads can read from the queue.
You mentioned and example of 100 threads. If you had this many threads, you would want to read from the Queue in batches, since you'd probably have to lock the Queue before reading as a Queue is only thread safe for a single reader+writer.
I think there is a new ConcurrentQueue generic in 4.0, but I can't remember for sure.
You really only want one reader to the file.
You could use Parallel.ForEach() to schedule a thread for each item in the list. That would spread the threads out among all available processors, assuming that parseEngine takes some time to run. If parseEngine runs pretty quickly (defined as less than 250ms), increase the number of "on-demand" threads by calling ThreadPool.SetMinThreads(), which will result in more threads executing at once.

Speeding up the loading of a List of images

I'm loading a List<Image> from a folder of about 250 images. I did a DateTime comparison and it takes a full 11 second to load those 250 images. That's slow as hell, and I'd very much like to speed that up.
The images are on my local harddrive, not even an external one.
The code:
DialogResult dr = imageFolderBrowser.ShowDialog();
if(dr == DialogResult.OK) {
DateTime start = DateTime.Now;
//Get all images in the folder and place them in a List<>
files = Directory.GetFiles(imageFolderBrowser.SelectedPath);
foreach(string file in files) {
sourceImages.Add(Image.FromFile(file));
}
DateTime end = DateTime.Now;
timeLabel.Text = end.Subtract(start).TotalMilliseconds.ToString();
}
EDIT: yes, I need all the pictures. The thing I'm planning is to take the center 30 pixelcolums of each and make a new image out of that. Kinda like a 360 degrees picture. Only right now, I'm just testing with random images.
I know there are probably way better frameworks out there to do this, but I need this to work first.
EDIT2: Switched to a stopwatch, the difference is just a few milliseconds. Also tried it with Directory.EnumerateFiles, but no difference at all.
EDIT3: I am running .NET 4, on a 32-bit Win7 client.
Do you actually need to load all the images? Can you get away with loading them lazily? Alternatively, can you load them on a separate thread?
You cannot speed up your HDD access and decoding speed. However a good idea would be to load the images in a background thread.
Perhaps you should consider showing a placeholder until the image is actually loaded.
Caution: you'll need to insert the loaded images in your UI thread anyway!
You could use Directory.EnumerateFiles along with Parallel.ForEach to spread the work over as many CPUs as you have.
var directory = "C:\\foo";
var files = Directory.EnumerateFiles(directory, "*.jpg");
var images = files.AsParallel().Select(file => Image.FromFile(file)).ToList();
As loading a image does both file IO and CPU work, you should get some speadup by using more then one thread.
If you are using .net 4, using tasks would be the way to go.
Given that you likely already know the path (from the dialog box?), you might be better using Directory.EnumerateFiles and then work with the collection it returns instead of a list.
http://msdn.microsoft.com/en-us/library/dd383458.aspx
[edit]
just noticed you're also loading the files into your app within the loop - how big are they? Depending on their size, it might actually be a pretty good speed!
Do you need to load them at this point? Can you change some display code elsewhere to load on demand?
You probably can't speed things up as the bottle neck is reading the files themselves from disk and maybe parsing them as images.
What you can do though is Cache the list after it's loaded and then any subsequent calls to your code will be lot faster.

Parallel programming in C#

I'm interested in learning about parallel programming in C#.NET (not like everything there is to know, but the basics and maybe some good-practices), therefore I've decided to reprogram an old program of mine which is called ImageSyncer. ImageSyncer is a really simple program, all it does is to scan trough a folder and find all files ending with .jpg, then it calculates the new position of the files based on the date they were taken (parsing of xif-data, or whatever it's called). After a location has been generated the program checks for any existing files at that location, and if one exist it looks at the last write-time of both the file to copy, and the file "in its way". If those are equal the file is skipped. If not a md5 checksum of both files is created and matched. If there is no match the file to be copied is given a new location to be copied to (for instance, if it was to be copied to "C:\test.jpg" it's copied to "C:\test(1).jpg" instead). The result of this operation is populated into a queue of a struct-type that contains two strings, the original file and the position to copy it to. Then that queue is iterated over untill it is empty and the files are copied.
In other words there are 4 operations:
1. Scan directory for jpegs
2. Parse files for xif and generate copy-location
3. Check for file existence and if needed generate new path
4. Copy files
And so I want to rewrite this program to make it paralell and be able to perform several of the operations at the same time, and I was wondering what the best way to achieve that would be. I've came up with two different models I can think of, but neither one of them might be any good at all. The first one is to parallelize the 4 steps of the old program, so that when step one is to be executed it's done on several threads, and when the entire of step 1 is finished step 2 is began. The other one (which I find more interesting because I have no idea of how to do that) is to create a sort of worker and consumer model, so when a thread is finished with step 1 another one takes over and performs step 2 at that object (or something like that). But as said, I don't know if any of these are any good solutions. Also, I don't know much about parallel programming at all. I know how to make a thread, and how to make it perform a function taking in an object as its only parameter, and I've also used the BackgroundWorker-class on one occasion, but I'm not that familiar with any of them.
Any input would be appreciated.
There are few a options:
Parallel LINQ: Running Queries On Multi-Core Processors
Task Parallel Library (TPL): Optimize Managed Code For Multi-Core Machines
If you are interested in basic threading primitives and concepts: Threading in C#
[But as #John Knoeller pointed out, the example you gave is likely to be sequential I/O bound]
This is the reference I use for C# thread: http://www.albahari.com/threading/
As a single PDF: http://www.albahari.com/threading/threading.pdf
For your second approach:
I've worked on some producer/consumer multithreaded apps where each task is some code that loops for ever. An external "initializer" starts a separate thread for each task and initializes an EventWaitHandle for each task. For each task is a global queue that can be used to produce/consume input.
In your case, your external program would add each directory to the queue for Task1, and Set the EventWaitHandler for Task1. Task 1 would "wake up" from its EventWaitHandler, get the count of directories in its queue, and then while the count is greater than 0, get the directory from the queue, scan for all the .jpgs, and add each .jpg location to a second queue, and set the EventWaitHandle for task 2. Task 2 reads its input, processes it, forwards it to a queue for Task 3...
It can be a bit of a pain getting all the locking to work right (I basically lock any access to the queue, even something as simple as getting its count). .NET 4.0 is supposed to have data structures that will automatically support a producer/consumer queue with no locks.
Interesting problem.
I came up with two approaches. The first is based on PLinq and the second is based on te Rx Framework.
The first one iterates through the files in parallel.
The second one generates asynchronously the files from the directory.
Here is how it looks like in a much simplified version (The first method does require .Net 4.0 since it uses PLinq)
string direcory = "Mydirectory";
var jpegFiles = System.IO.Directory.EnumerateFiles(direcory,"*.jpg");
// -- PLinq --------------------------------------------
jpegFiles
.AsParallel()
.Select(imageFile => new {OldLocation = imageFile, NewLocation = GenerateCopyLocation(imageFile) })
.Do(fileInfo =>
{
if (!File.Exists(fileInfo.NewLocation ) ||
(File.GetCreationTime(fileInfo.NewLocation)) != (File.GetCreationTime(fileInfo.NewLocation)))
File.Copy(fileInfo.OldLocation,fileInfo.NewLocation);
})
.Run();
// -----------------------------------------------------
//-- Rx Framework ---------------------------------------------
var resetEvent = new AutoResetEvent(false);
var doTheWork =
jpegFiles.ToObservable()
.Select(imageFile => new {OldLocation = imageFile, NewLocation = GenerateCopyLocation(imageFile) })
.Subscribe( fileInfo =>
{
if (!File.Exists(fileInfo.NewLocation ) ||
(File.GetCreationTime(fileInfo.NewLocation)) != (File.GetCreationTime(fileInfo.NewLocation)))
File.Copy(fileInfo.OldLocation,fileInfo.NewLocation);
},() => resetEvent.Set());
resetEvent.WaitOne();
doTheWork.Dispose();
// -----------------------------------------------------

Categories