Multithreading BigInteger Operation - c#

Doing very complex BigInteger operations are very slow e.g.
BigInteger.Pow(BigInteger(2),3231233282348);
I was wondering if there is any way I could multi thread any of these basic math functions.

It depends on the math function but I really can't see how you could speed up basic math functions. For these sort of calculations the next step in the process would typically depend on the previous step in the process. Threading would only really help where you have portions of a calculation that can be calculated independently. These could then be combined in a final step to produce the result. You would need to break up these calculations yourself into the portions that can run concurrently.
For example:
If you had a formula with 2 * 3 + 3 * 4. You could run two threads, the first calculating 2 * 3 and the second 3 * 4. You could then pull the results together at the end and sum the two results. You would need to work out how to break down the calculation into something smaller and then thread those accordingly.
In your example with power you could work out the following in 4 threads and then combine the results at the end by multiplying the results:
BigInteger.Pow(BigInteger(2),807808320587);
BigInteger.Pow(BigInteger(2),807808320587);
BigInteger.Pow(BigInteger(2),807808320587);
BigInteger.Pow(BigInteger(2),807808320587);
This would not save you any time at all because all 4 cores would be thrashing around trying to work out the same thing and you would just multiple them by each other at the end which is what a single threaded solution would do anyway. It would even be much slower on some processors as they will often speed up one core if the others are idle. I broke this up using the same thing as breaking up a 2^5 into 2^2 * 2^3.

The answer of
BigInteger.Pow(BigInteger(2), 3231233282348);
will contain
Log(2)/Log(10) * 3231233282348 == 9.727e11
digits; so it requires 900 GB to write the answer. That's why it that slow

If you're using .NET 4.5 read about async await:
http://blog.stephencleary.com/2012/02/async-and-await.html

Related

TPL Dataflow controlling max degree of parallelism between nested blocks

Let's say I have an action block A that perform some work in parallel with a max degree of parallelism of 4.
Say I have case where action block A is doing work X in some cases and work Y in others. X is some small work, Y is some larger work that requires it to be split into smaller work chunks and therefore I need to parallelise those too.
Inside work Y I therefore need to parallelise the work chunks to a max degree of 4, but at this point I might have 4 A blocks executing in parallel which could lead for example to "A-X, A-X, A-Y, A-Y" running in parallel. This would result in 1 + 1 + 4 + 4 parallel tasks which is too many parallel tasks for my purpose as I would always keep it limited to a maximum of 4 (or any other chosen number) overall.
Is there a way to control the maximum degree of parallelism including nested blocks?
While creating a block in TPL Dataflow, you can specify a custom scheduler for the block via its options.
Easy way to limit the number of concurrent tasks and concurrency level is to use the ConcurrentExclusiveSchedulerPair in your code, with parameters you need.

Arbitrary precision arithmetic with very big factorials

This is a mathematical problem, not programming to be something useful!
I want to count factorials of very big numbers (10^n where n>6).
I reached to arbitrary precision, which is very helpful in tasks like 1000!. But it obviously dies(StackOverflowException :) ) at much higher values. I'm not looking for a direct answer, but some clues on how to proceed further.
static BigInteger factorial(BigInteger i)
{
if (i < 1)
return 1;
else
return i * factorial(i - 1);
}
static void Main(string[] args)
{
long z = (long)Math.Pow(10, 12);
Console.WriteLine(factorial(z));
Console.Read();
}
Would I have to resign from System.Numerics.BigInteger? I was thinking of some way of storing necessary data in files, since RAM will obviously run out. Optimization is at this point very important. So what would You recommend?
Also, I need values to be as precise as possible. Forgot to mention that I don't need all of these numbers, just about 20 last ones.
As other answers have shown, the recursion is easily removed. Now the question is: can you store the result in a BigInteger, or are you going to have to go to some sort of external storage?
The number of bits you need to store n! is roughly proportional to n log n. (This is a weak form of Stirling's Approximation.) So let's look at some sizes: (Note that I made some arithmetic errors in an earlier version of this post, which I am correcting here.)
(10^6)! takes order of 2 x 10^6 bytes = a few megabytes
(10^12)! takes order of 3 x 10^12 bytes = a few terabytes
(10^21)! takes order of 10^22 bytes = ten billion terabytes
A few megs will fit into memory. A few terabytes is easily within your grasp but you'll need to write a memory manager probably. Ten billion terabytes will take the combined resources of all the technology companies in the world, but it is doable.
Now consider the computation time. Suppose we can perform a million multiplications per second per machine and that we can parallelize the work out to multiple machines somehow.
(10^6)! takes order of one second on one machine
(10^12)! takes order of 10^6 seconds on one machine =
10 days on one machine =
a few minutes on a thousand machines.
(10^21)! takes order of 10^15 seconds on one machine =
30 million years on one machine =
3 years on 10 million machines
1 day on 10 billion machines (each with a TB drive.)
So (10^6)! is within your grasp. (10^12)! you are going to have to write your own memory manager and math library, and it will take you some time to get an answer. (10^21)! you will need to organize all the resources of the world to solve this problem, but it is doable.
Or you could find another approach.
The solution is easy: Calculate the factorials without using recursion, and you won't blow out your stack.
I.e. you're not getting this error because the numbers are too large, but because you have too many levels of function calls. And fortunately, for factorials there's no reason to calculate them recursively.
Once you've solved your stack problem, you can worry about whether your number format can handle your "very big" factorials. Since you don't need the exact values, use one of the many efficient numeric approximations (which you can count on to get all of the most significant digits right). The most common one is Stirling's approximation:
n! ~ n^n e^{-n} sqrt(2 \pi n)
The image is from this page, where you'll find discussion and a second, more accurate formula (although "in most cases the difference is quite small", they say). Of course this number is still too large for you to store, but now you can work with logarithms and drop the unimportant digits before you extract the number. Or use the Wikipedia version of the approximation, which is already expressed as a logarithm.
Unroll recursion:
static BigInteger factorial(BigInteger n)
{
BigInteger res = 1;
for (BigInteger i = 2; i <= n; ++i)
res *= i;
return res;
}

C# Linq slower than PHP? Solving riddle #236A

I'm training with solving Olympic IT-riddles on one site.
I have provided two solutions:
- C#
http://ideone.com/exF1HJ
- PHP
http://ideone.com/WbaPHY
I was confused when online judgment showed , that PHP version was faster!!!
Why?
C#: 109 ms 3000 Kb
PHP: 45 ms 0 Kb
How could it be?
Given the programs given, the execution time of the important bit of the program - finding the unique characters - would definitely not take 109ms. It sounds like whatever "online judgement" is involved is measuring total execution time including process startup, JITting in the case of .NET, etc.
It's a bit like asking which car gets out of a garage faster, and thinking that represents the speed of the car.
Now it's entirely possible that PHP's array_unique function really is very fast, possibly faster than LINQ... but basically you can't get any useful information out of the benchmark results. You should be looking for benchmarks which execute for seconds rather than milliseconds, and which don't include startup/warm-up time, unless that's what you're particularly interested in.
Your C# version creates three arrays that you don't seem to need. You could replace it with:
string input = Console.ReadLine();
int charCount = input.Distinct().Count();
if(charCount % 2 == 0) ...
The following is probably quicker still:
int charCount = new HashSet<char>(input).Count;

Running maximum threads: Automatic performance adjustment

I'm developing an app which one scans thousands copies of a struct; ~1 GB RAM. Speed is important.
ParallelScan(_from, _to); //In a new thread
I manually adjust the threads count:
if (myStructs.Count == 0) { threads = 0; }
else if (myStructs.Count < 1 * Number.Thousand) { threads = 1; }
else if (myStructs.Count < 3 * Number.Thousand) { threads = 2; }
else if (myStructs.Count < 5 * Number.Thousand) { threads = 4; }
else if (myStructs.Count < 10 * Number.Thousand) { threads = 8; }
else if (myStructs.Count < 20 * Number.Thousand) { threads = 12; }
else if (myStructs.Count < 30 * Number.Thousand) { threads = 20; }
else if (myStructs.Count < 50 * Number.Thousand) { threads = 30; }
else threads = 40;
I just wrote it from scratch and I need to modify it for another CPU, etc. I think I could write a smarter code which one dynamically starts a new thread if CPU is available at the moment:
If CPU is not %100 start N thread
Measure CPU or thread process time & modify/estimate N
Loop until scan all struct array
Is there anyone think that "I did something similar" or "I have a better idea" ?
UPDATE: The solution
Parallel.For(0, myStructs.Count - 1, (x) =>
{
ParallelScan(x, x); // Will be ParallelScan(x);
});
I did trim tons of code. Thanks people!
UPDATE 2: Results
Scan time for 10K templates
1 Thread: 500 ms
10 Threads: 300 ms
40 Threads: 600 ms
Tasks: 100 ms
The standard answer: Use Tasks (TPL) , not Threads. Tasks require Fx4.
Your ParallelScan could just use Parallel.Foreach( ... ) or PLINQ (.AsParallel()).
The TPL framework includes a scheduler, and ForEach() uses a partitioner, to adapt to CPU cores and load. Your problem is most likely solved with the standard components but you can write custom-schedulers and -partitioners.
Actually, you won't get much benefit from spanning 50 threads, if you CPU only has two cores (even if each of them supports hyperthreading). If will actually run slower due to context switching which will occur every once in a while.
That means you should go for the Task Parallel Library (.NET 4), which takes care that all available cores are used efficiently.
Apart from that, improving the asymptotic duration of your search algorithm might prove more valuable for large quantities of data, regardless of the Moore's law.
[Edit]
If you are unable/unwilling to use .NET 4 TPL, you can start by getting the information about the current number of logical processors in the system (use Environment.ProcessorCount or check this answer for detailed info). Based on that number, you can partition your data and span a fixed number of threads. That is much simpler that checking the CPU utilization, and should prevent creating unnecessary threads which are starved anyway.
OK, sorry to keep going on but first to compile my comments:
Unless you have a very, very, very, good reason to think that scanning these structs will take any more than a handful of microseconds and that really, really, really matters, it's not a good idea to do this kind of optimisation. If you really want to do it, you should have one thread per core. But really - don't. If it's just 50,000 structs and you're doing something simple with them, don't bother.
FYI, starting a new thread takes a good amount of time (a measurable part of a second, several milliseconds).
How long does this operation take? It's very unlikely that it's useful for you to optimize multithreading like this. It will give you the worst improvement. Better improvement will be gained by a better algorithm, or not having to depend on this weird invented multithreading scheme.
I'm confused about your performance fixation partly because you say you're looking through 50,000 structs (a very quick and easy operation) and partly because you're using structs. Without boxing that's a value type and if you're passing them around threads you're copying data rather than references, i.e. using more memory. My point being that that's a lot of data/memory, unless the structs are small, in which case, what kind of processing can you possibly be doing on them that takes so long as to think about 40+ threads in parallel?
If performance is truly incredibly important and your goal, and you're not simply trying to do this as a nice engineering exercise, please share information about what kind of processing you're doing.

Fastest way to calculate primes in C#?

I actually have an answer to my question but it is not parallelized so I am interested in ways to improve the algorithm. Anyway it might be useful as-is for some people.
int Until = 20000000;
BitArray PrimeBits = new BitArray(Until, true);
/*
* Sieve of Eratosthenes
* PrimeBits is a simple BitArray where all bit is an integer
* and we mark composite numbers as false
*/
PrimeBits.Set(0, false); // You don't actually need this, just
PrimeBits.Set(1, false); // remindig you that 2 is the smallest prime
for (int P = 2; P < (int)Math.Sqrt(Until) + 1; P++)
if (PrimeBits.Get(P))
// These are going to be the multiples of P if it is a prime
for (int PMultiply = P * 2; PMultiply < Until; PMultiply += P)
PrimeBits.Set(PMultiply, false);
// We use this to store the actual prime numbers
List<int> Primes = new List<int>();
for (int i = 2; i < Until; i++)
if (PrimeBits.Get(i))
Primes.Add(i);
Maybe I could use multiple BitArrays and BitArray.And() them together?
You might save some time by cross-referencing your bit array with a doubly-linked list, so you can more quickly advance to the next prime.
Also, in eliminating later composites once you hit a new prime p for the first time - the first composite multiple of p remaining will be p*p, since everything before that has already been eliminated. In fact, you only need to multiply p by all the remaining potential primes that are left after it in the list, stopping as soon as your product is out of range (larger than Until).
There are also some good probabilistic algorithms out there, such as the Miller-Rabin test. The wikipedia page is a good introduction.
Parallelisation aside, you don't want to be calculating sqrt(Until) on every iteration. You also can assume multiples of 2, 3 and 5 and only calculate for N%6 in {1,5} or N%30 in {1,7,11,13,17,19,23,29}.
You should be able to parallelize the factoring algorithm quite easily, since the Nth stage only depends on the sqrt(n)th result, so after a while there won't be any conflicts. But that's not a good algorithm, since it requires lots of division.
You should also be able to parallelize the sieve algorithms, if you have writer work packets which are guaranteed to complete before a read. Mostly the writers shouldn't conflict with the reader - at least once you've done a few entries, they should be working at least N above the reader, so you only need a synchronized read fairly occasionally (when N exceeds the last synchronized read value). You shouldn't need to synchronize the bool array across any number of writer threads, since write conflicts don't arise (at worst, more than one thread will write a true to the same place).
The main issue would be to ensure that any worker being waited on to write has completed. In C++ you'd use a compare-and-set to switch to the worker which is being waited for at any point. I'm not a C# wonk so don't know how to do it that language, but the Win32 InterlockedCompareExchange function should be available.
You also might try an actor based approach, since that way you can schedule the actors working with the lowest values, which may be easier to guarantee that you're reading valid parts of the the sieve without having to lock the bus on each increment of N.
Either way, you have to ensure that all workers have got above entry N before you read it, and the cost of doing that is where the trade-off between parallel and serial is made.
Without profiling we cannot tell which bit of the program needs optimizing.
If you were in a large system, then one would use a profiler to find that the prime number generator is the part that needs optimizing.
Profiling a loop with a dozen or so instructions in it is not usually worth while - the overhead of the profiler is significant compared to the loop body, and about the only ways to improve a loop that small is to change the algorithm to do fewer iterations. So IME, once you've eliminated any expensive functions and have a known target of a few lines of simple code, you're better off changing the algorithm and timing an end-to-end run than trying to improve the code by instruction level profiling.
#DrPizza Profiling only really helps improve an implementation, it doesn't reveal opportunities for parallel execution, or suggest better algorithms (unless you've experience to the otherwise, in which case I'd really like to see your profiler).
I've only single core machines at home, but ran a Java equivalent of your BitArray sieve, and a single threaded version of the inversion of the sieve - holding the marking primes in an array, and using a wheel to reduce the search space by a factor of five, then marking a bit array in increments of the wheel using each marking prime. It also reduces storage to O(sqrt(N)) instead of O(N), which helps both in terms of the largest N, paging, and bandwidth.
For medium values of N (1e8 to 1e12), the primes up to sqrt(N) can be found quite quickly, and after that you should be able to parallelise the subsequent search on the CPU quite easily. On my single core machine, the wheel approach finds primes up to 1e9 in 28s, whereas your sieve (after moving the sqrt out of the loop) takes 86s - the improvement is due to the wheel; the inversion means you can handle N larger than 2^32 but makes it slower. Code can be found here. You could parallelise the output of the results from the naive sieve after you go past sqrt(N) too, as the bit array is not modified after that point; but once you are dealing with N large enough for it to matter the array size is too big for ints.
You also should consider a possible change of algorithms.
Consider that it may be cheaper to simply add the elements to your list, as you find them.
Perhaps preallocating space for your list, will make it cheaper to build/populate.
Are you trying to find new primes? This may sound stupid, but you might be able to load up some sort of a data structure with known primes. I am sure someone out there has a list. It might be a much easier problem to find existing numbers that calculate new ones.
You might also look at Microsofts Parallel FX Library for making your existing code multi-threaded to take advantage of multi-core systems. With minimal code changes you can make you for loops multi-threaded.
There's a very good article about the Sieve of Eratosthenes: The Genuine Sieve of Eratosthenes
It's in a functional setting, but most of the opimization do also apply to a procedural implementation in C#.
The two most important optimizations are to start crossing out at P^2 instead of 2*P and to use a wheel for the next prime numbers.
For concurrency, you can process all numbers till P^2 in parallel to P without doing any unnecessary work.
void PrimeNumber(long number)
{
bool IsprimeNumber = true;
long value = Convert.ToInt32(Math.Sqrt(number));
if (number % 2 == 0)
{
IsprimeNumber = false;
MessageBox.Show("No It is not a Prime NUmber");
return;
}
for (long i = 3; i <= value; i=i+2)
{
if (number % i == 0)
{
MessageBox.Show("It is divisible by" + i);
IsprimeNumber = false;
break;
}
}
if (IsprimeNumber)
{
MessageBox.Show("Yes Prime NUmber");
}
else
{
MessageBox.Show("No It is not a Prime NUmber");
}
}

Categories