I have a process I need to optimize and I was wondering how long a multiplication operation takes between two doubles. If I can cut off 1000 of these, I want to know if it will actually make a difference in the overall performance of my process?
This is highly system specific. On my system, it only takes a few milliseconds to do 10 million multiplication operations. Removing 1000 is probably not going to be noticeable.
If you really want to optimize your routine, this isn't the best approach. The better approach is to profile it, and find the bottleneck in your current implementation (which will likely not be what you expect). Then look at that bottleneck, and try to come up with a better algorithm. Focus on overall algorithms first, and optimize those.
If it's still too slow, then you can start trying to optimize the actual routine in the slower sections or the ones called many times, first.
The only effective means of profiling is to measure first (and after!).
That entirely depends on the size of the factors. I can do single-digit multiplication (e.g. 7×9) in my head in a fraction of a second, whereas it would take me a few minutes to compute 365286×475201.
Modern Intel CPU's do in the 10's of billions of floating point multiplies per second. I wouldn't worry about 1000 if I were you.
Intel doc showing FLOP performance of their CPUs
this depends on various things like, the cpu you are using, the other processes currently running, what the jit does ...
the only reliable method to get an answer to this question is using a profiler and meassuring the effect of your optimization
Related
I've been trying to get a deep understanding of how these concepts relate. Let me give a simple example and explain what thinking so that you can correct it.
Let's say I want to try to sort two arrays
int[] A = { ... }; // very large, very unsorted
int[] B = { ... }; // very large, very unsorted
by sorting each of them "as parallel as my system will allow me to sort them." I take advantage of the fact that a Parallel.ForEach does a lot of stuff under the hood, and I simply write
var arrays = new List<int[]>(A, B);
Parallel.ForEach(arrays, (arr) => { Array.Sort(arr); });
Now let's say I compile and run it on machines with the following specs:
1 processor, 1 core
1 processor, multiple cores
2 processors, at least one core on each
In case 1, there is absolutely no possibility of a performance gain. It sorts A, then sorts B, just like it would in a regular foreach loop.
In case 2, there is also no performance gain because unless you have multiple processors then your machine can not literally "do more than 1 thing at once." Even if it ends up sorting them in different threads, the CPU that controls the threads does a little sorting of A, a little sorting of B, a little more of A, etc., which can't be more efficient than just sorting all of A and then all of B.
Case 3 is the only one with a possibility of a performance gain, for the reason mentioned in the previous case.
Can someone critique my understanding? How right or wrong is this? (I didn't major in computer science. So please grade me on a curve.)
In case 1... It sorts A, then sorts B
That is not how threading works. The OS rapidly context-switches between the two threads. On Windows that happens by default 64/3 times per second. The interleaving makes it look like A and B get sorted at the same time. Not otherwise easily observed, the debugger would have to give you a look inside Array.Sort(), it won't. Not otherwise faster of course, the slowdown is however fairly minor. It is the cheap kind of context switch, no need to reload the page mapping tables since the threads belong to the same process. You only pay for the possibly trashed cache, adding ~5 microseconds per 3/64 second (0.1% slower) is quite hard to measure accurately.
In case 2, ...then your machine can not literally "do more than 1 thing at once
It can, each core can execute Sort() concurrently. Largely the point of multi-core processors. They do however have to share a single resource, the memory bus. What matters a great deal is the size of the arrays and the speed of the RAM chips. Large arrays don't fit in the processor caches, it is technically possible for the memory bus to get saturated by requests from the processor cores. What does not help in this case is the element type, comparing two int values is very fast since it takes only a single CPU instruction. Expectation is for a x2 speed-up but if you observe it taking longer then you know that the RAM is the bottleneck.
Case 3 is the only one with a possibility of a performance gain
Not likely. Multiple processor machines often have a NUMA architecture, giving each processor its own memory bus. The interconnect between them might be used to shovel data from one bus to another. But such processors also have multiple cores. It is the OS' job to figure out how to use them effectively. And since the threads belong to same process, so share data, it will strongly favor scheduling the threads on the cores of the same processor and avoid putting a load on the interconnect. So expectation is that it will perform the same as case 2.
These are rough guidelines, modern machine design demands that you actually measure.
I'm using a parallel for loop in my code to run a long running process on a large number of entities (12,000).
The process parses a string, goes through a number of input files (I've read that given the number of IO based things the benefits of threading could be questionable, but it seems to have sped things up elsewhere) and outputs a matched result.
Initially, the process goes quite quickly - however it ends up slowing to a crawl. It's possible that it's just hit a number of particularly tricky input data, but this seems unlikely looking closer at things.
Within the loop, I added some debug code that prints "Started Processing: " and "Finished Processing: " when it begins/ends an iteration and then wrote a program that pairs a start and a finish, initially in order to find which ID was causing a crash.
However, looking at the number of unmatched ID's, it looks like the program is processing in excess of 400 different entities at once. This seems like, with the large number of IO, it could be the source of the issue.
So my question(s) is(are) this(these):
Am I interpreting the unmatched ID's properly, or is there some clever stuff going behind the scenes I'm missing, or even something obvious?
If you'd agree what I've spotted is correct, how can I limit the number it spins off and does at once?
I realise this is perhaps a somewhat unorthodox question and may be tricky to answer given there is no code, but any help is appreciated and if there's any more info you'd like, let me know in the comments.
Without seeing some code, I can guess at the answers to your questions:
Unmatched IDs indicate to me that the thread that is processing that data is being de-prioritized. This could be due to IO or the thread pool trying to optimize, however it seems like if you are strongly IO bound then that is most likely your issue.
I would take a look at Parallel.For, specifically using ParallelOptions.MaxDegreesOfParallelism to limit the maximum number of tasks to a reasonable number. I would suggest trial and error to determine the optimum number of degrees, starting around the number of processor cores you have.
Good luck!
Let me start by confirming that is indeed a very bad idea to read 2 files at the same time from a hard drive (at least until the majority of HDs out there are SSDs), let alone whichever number your whole thing is using.
The use of parallelism serves to optimize processing using an actually paralellizable resource, which is the CPU power. If you paralellized process reads from a hard drive then you're losing most of the benefit.
And even then, even the CPU power is not prone to infinite paralellization. A normal desktop CPU has the capacity to run up to 10 threads at the same time (depends of the model obviously, but that's the order of magnitude).
So two things
first, I am going to make the assumption that your entities use all your files, but your files are not too big to be loaded into memory. If it's the case, you should read your files into objects (i.e. into memory), then paralellize the processing of your entities using those objects. If not, you're basically relying on your hard drive's cache to not reread your files every time you need them, and your hard drive's cache is far smaller than your memory (1000-fold).
second, you shouldn't be running Parallel.For on 12.000 items. Parallel.For will actually (try to) create 12.000 threads, and that is actually worse than 10 threads, because of the big overhead that paralellizing will create, and the fact your CPU will not benefit from it at all since it cannot run more than 10 threads at a time.
You should probably use a more efficient method, which is the IEnumerable<T>.AsParallel() extension (comes with .net 4.0). This one will, at runtime, determine what is the optimal thread number to run, then divide your enumerable into as many batches. Basically, it does the job for you - but it creates a big overhead too, so it's only useful if the processing of one element is actually costly for the CPU.
From my experience, using anything parallel should always be evaluated against not using it in real-life, i.e. by actually profiling your application. Don't assume it's going to work better.
I have a nested for loop.
I have replaced the first For with a Parallel.For() and the speed of calculation increased.
My question is about replacing the second for (inside one) with a Parallel.For(). Will it increase the speed? or there is no difference? or it will be slower?
Edit:
Since the cores are not unlimited (usually there is 2 to 8 cores), the inside loop is running parallel. So, if I change the inside for with a Parallel.For(), again it runs parallel. But i'm not sure how it changes the performance and speed.
From "Too fine-grained, too coarse-grained" subsection, "Anti-patterns" section in "Patterns of parallel programming" book by .NET parallel computing team:
The answer is that the best balance is found through performance
testing. If the overheads of parallelization are minimal as compared
to the work being done, parallelize as much as possible: in this case,
that would mean parallelizing both loops. If the overheads of
parallelizing the inner loop would degrade performance on most
systems, think twice before doing so, as it’ll likely be best only to
parallelize the outer loop.
Take a look at that subsection, it is self-contained with detailed examples from parallel ray tracing application. And its suggestion of flattening the loops to have better degree of parallelism may be helpful for you too.
It again depends on many scenarios,
Number of parallel threads your cpu can run.
Number of iterations.
If your CPU is a single-core processor, you will not get any benefits.
If the number of iterations is greater, you will get some improvements.
If there are just a few iterations, it will be slow as it involves extra overload.
It depends a lot on the data and functions you use inside the for and the machine. I have been messing lately with the parallel.for and parallel.foreach and found out that they made my apps even slower... (on a 4 core machine, probably if you have a 24 core server is another story)
I think that managing the threads means too much overhead...
Even MS on their documentation (here is a very long pdf on msdn about it http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=19222) admits it doesnt make the apps run faster. You have to try every time, and if it works, great, and if not bad luck.
You should try with the external for and the internal, but at least on the apps i tried none of them made the app faster. External or internal didnt matter much i was just getting the same execution times or even worse.
Maybe if you use Concurrent collections too, you get better performance. But again, without trying there is no way to tell.
EDIT:
I just found a nice link on MSDN that proved to be very useful (in my case) to improve Parallel.foreach performance
http://msdn.microsoft.com/en-us/library/dd560853.aspx
I am about to start a project which will be taking blocks of text, parsing a lot of data into them into some sort of object which can then be serialized, stored, and statistics / data gleaned from. This needs to be as fast as possible as I have > 10,000,000 blocks of text that I need to start on and will be getting 100,000's of thousands a day.
I am running this on a system with 12 xeon cores + hyper threading. I also have access / know a bit about CUDA programming but for string stuff think that its not appropriate. From each string I need to parse a lot of data and some of it I know the exact positions of, some I don't and need to use regex's / something smart.
So consider something like this:
object[] parseAll (string [] stringsToParse)
{
parallel foreach
parse( string[n] )
}
object parse(string s)
{
try to use exact positions / substring etc here instead of regex's
}
So my questions are:
How much slower is using regex's to substr.
Is .NET going to be significantly slower than other languages.
What sort of optimizations (if any) can I do to maximize parallelism.
Anything else I haven't considered?
Thanks for any help! Sorry if this is long winded.
How much slower is using regex's to substr.
If you are looking for an exact string, substr will be faster. Regular expressions however are highly optimized. They (or at least parts) are compiled to IL and you can even store these compiled versions in a separate assembly using Regex.CompileToAssembly. See http://msdn.microsoft.com/en-us/library/9ek5zak6.aspx for more information.
What you really need to do is do perform measurements. Using something like Stopwatch is by far the easiest way to verify whether one or the other code construct works faster.
What sort of optimizations (if any) can I do to maximize parallelism.
With Task.Factory.StartNew, you can schedule tasks to run on the thread pool. You may also have a look at the TPL (Task Parallel Library, of which Task is a part). This has lots of constructs that help you parallelize work and allows constructs like Parallel.ForEach() to execute an iteration on multiple threads. See http://msdn.microsoft.com/en-us/library/dd460717.aspx for more information.
Anything else I haven't considered?
One of the things that will hurt you with this volume of data is memory management. A few things to take into account:
Limit memory allocation: try to re-use the same buffers for a single document instead of copying them when you only need a part. Say you need to work on a range starting at char 1000 to 2000, don't copy that range into a new buffer, but construct your code to work only in that range. This will make your code complexer, but it saves you memory allocations;
StringBuilder is an important class. If you don't know of it yet, have a look.
I don't know what kind of processing you're doing here, but if you're talking hundreds of thousands of strings per day, it seems like a pretty small number. Let's assume that you get 1 million new strings to process every day, and you can fully task 10 of those 12 Xeon cores. That's 100,000 strings per core per day. There are 86,400 seconds in a day, so we're talking 0.864 seconds per string. That's a lot of parsing.
I'll echo the recommendations made by #Pieter, especially where he suggests making measurements to see how long it takes to do your processing. Your best bet is to get something up and working, then figure out how to make it faster if you need to. I think you'll be surprised at how often you don't need to do any optimization. (I know that's heresy to the optimization wizards, but processor time is cheap and programmer time is expensive.)
How much slower is using regex's to substr?
That depends entirely on how complex your regexes are. As #Pieter said, if you're looking for a single string, String.Contains will probably be faster. You might also consider using String.IndexOfAny if you're looking for constant strings. Regular expressions aren't necessary unless you're looking for patterns that can't be represented as constant strings.
Is .NET going to be significantly slower than other languages?
In processor-intensive applications, .NET can be slower than native apps. Sometimes. If so, it's typically in the range of 5 to 20 percent, and most often between 7 and 12 percent. That's just the code executing in isolation. You have to take into account other factors like how long it takes you to build the program in that other language and how difficult it is to share data between the native app and the rest of your system.
Google had recently announced it's internal text processing language (which seems like a Python/Perl subset made for heavily parallel processing).
http://code.google.com/p/szl/ - Sawzall.
If you want to do fast string parsing in C#, you might want to consider having a look at the new NLib project. It contains string extensions to facilitate searching strings in various ways rapidly. Such as, IndexOfAny(string[]) and IndexOfNotAny. They contain overloads with a StringComparison argument too.
I am developing a program in c#, and thanks to the matlab .net builder,
I am using a matlab mapping toolbox function "polybool", which in one of it's options calculate the difference of 2 polygons in 2-D.
The problem is that the functions takes about 0.01 seconds to finish in which is bad for me
because I call it a lot.
And this doesn't make sense at all because the polygons are 5 points each, so there is no
way that it take 0.01 second to find the results.
Does anyone has any ideas?
How are you computing the 0.01 seconds? If this is total operational time, it may very well be the marshaling in and out of the toolbox functionality, which will take some time. The actual routine may be running quickly, but getting your data from C# into the routine, and the results back, will have some overhead involved with the process.
Granted, this overhead probably scales well - since it's most likely (mostly) constant, so if you start dealing with larger polygons, you'll probably see your overall efficiencies scale very well.