efficient powerset algorithm for subsets of minimal length - c#

i am using the following C# function to get a powerset limited to subsets of a minimal length
string[] PowerSet(int min_len, string set)
{
IEnumerable<IEnumerable<string>> seed =
new List<IEnumerable<string>>() { Enumerable.Empty<string>() };
return set.Replace(" ", "")
.Split(',')
.Aggregate(seed, (a, b) => a.Concat(a.Select(x => x.Concat(new[] { b }))))
.Where(subset => subset.Count() >= min_len)
.Select(subset => string.Join(",", subset))
.ToArray();
}
the problem is that when the original set is large, the algorithm has to work very hard even if the minimal length is large as well.
e.g:
PowerSet(27, "1,11,12,17,22,127,128,135,240,254,277,284,292,296,399,309,322,326,333,439,440,442,447,567,580,590,692,697");
should be very easy, but is too lengthily for the above function. i am looking for a concise modification of my function which could efficiently handle these cases.

Taking a quick look at your method, one of the inefficiencies is that every possible subset is created, regardless of whether it has enough members to warrant inclusion in the limited super set.
Consider implementing the following extension method instead. This method can trim out some unnecessary subsets based on their count to avoid excess computation.
public static List<List<T>> PowerSet<T>(List<T> startingSet, int minSubsetSize)
{
List<List<T>> subsetList = new List<List<T>>();
//The set bits of each intermediate value represent unique
//combinations from the startingSet.
//We can start checking for combinations at (1<<minSubsetSize)-1 since
//values less than that will not yield large enough subsets.
int iLimit = 1 << startingSet.Count;
for (int i = (1 << minSubsetSize)-1; i < iLimit; i++)
{
//Get the number of 1's in this 'i'
int setBitCount = NumberOfSetBits(i);
//Only include this subset if it will have at least minSubsetSize members.
if (setBitCount >= minSubsetSize)
{
List<T> subset = new List<T>(setBitCount);
for (int j = 0; j < startingSet.Count; j++)
{
//If the j'th bit in i is set,
//then add the j'th element of the startingSet to this subset.
if ((i & (1 << j)) != 0)
{
subset.Add(startingSet[j]);
}
}
subsetList.Add(subset);
}
}
return subsetList;
}
The number of set bits in each incremental i tells you how many members will be in the subset. If there are not enough set bits, then there is no point in doing the work of creating the subset represented by the bit combination. NumberOfSetBits can be implemented a number of ways. See How to count the number of set bits in a 32-bit integer? for various approaches, explanations and references. Here is one example taken from that SO question.
public static int NumberOfSetBits(int i)
{
i = i - ((i >> 1) & 0x55555555);
i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}
Now, while this solution works for your example, I think you will run into long runtimes and memory issues if you lower the minimum subset size too far or continue to grow the size of the startingSet. Without specific requirements posted in your question, I can't judge if this solution will work for you and/or is safe for your range of expected input cases.
If you find that this solution is still too slow, the operations can be split up for parallel computation, perhaps using PLINQ features.
Lastly, if you would like to dress up the extension method with LINQ, it would look like the following. However, as written, I think you will see slower performance without some changes to it.
public static IEnumerable<List<T>> PowerSet<T>(List<T> startingSet, int minSubsetSize)
{
var startingSetIndexes = Enumerable.Range(0, startingSet.Count).ToList();
var candidates = Enumerable.Range((1 << minSubsetSize)-1, 1 << startingSet.Count)
.Where(p => NumberOfSetBits(p) >= minSubsetSize)
.ToList();
foreach (int p in candidates)
{
yield return startingSetIndexes.Where(setInd => (p & (1 << setInd)) != 0)
.Select(setInd => startingSet[setInd])
.ToList();
}
}

Related

Group List into Ranges By Specific Amount

Let's say I have a List of items in which look like this:
Number Amount
1 10
2 12
5 5
6 9
9 4
10 3
11 1
I need it so that the method takes in any number even as a decimal and use that number to group the list into ranges based on that number. So let's say my number was 1 the following output would be...
Ranges Total
1-2 22
5-6 14
9-11 8
Because it basically grouped the numbers that are 1 away from each other into ranges. What's the most efficient way I can convert my list to look like the output?
There are a couple of approaches to this. Either you can partition the data and then sum on the partitions, or you can roll the whole thing into a single method.
Since partitioning is based on the gaps between the Number values you won't be able to work on unordered lists. Building the partition list on the fly isn't going to work if the list isn't ordered, so make sure you sort the list on the partition field before you start.
Partitioning
Once the lists is ordered (or if it was pre-ordered) you can partition. I use this kind of extension method fairly often for breaking up ordered sequences into useful blocks, like when I need to grab sequences of entries from a log file.
public static partial class Ext
{
public static IEnumerable<T[]> PartitionStream<T>(this IEnumerable<T> source, Func<T, T, bool> partitioner)
{
var partition = new List<T>();
T prev = default;
foreach (var next in source)
{
if (partition.Count > 0 && !partitioner(prev, next))
{
new { p = partition.ToArray(), prev, next }.Dump();
yield return partition.ToArray();
partition.Clear();
}
partition.Add(prev = next);
}
if (partition.Count > 0)
yield return partition.ToArray();
}
}
The partitioner parameter compares two objects and returns true if they belong in the same partition. The extension method just collects all the members of the partition together and returns them as an array once it finds something for the next partition.
From there you can just do simple summing on the partition arrays:
var source = new (int n, int v)[] { (1,10),(2,12),(5,5),(6,9),(9,4),(10,3),(11,1) };
var maxDifference = 2;
var aggregate =
from part in source.PartitionStream((l, r) => (r.n - l.n) <= maxDifference)
let low = grp.Min(g => g.n)
let high = grp.Max(g => g.n)
select new { Ranges = $"{low}-{high}", Total = grp.Sum(g => g.v) };
This gives the same output as your example.
Stream Aggregation
The second option is both simpler and more efficient since it does barely any memory allocations. The downside - if you can call it that - is that it's a lot less generic.
Rather than partitioning and aggregating over the partitions, this just walks through the list and aggregates as it goes, spitting out results when the partitioning criteria is reached:
IEnumerable<(string Ranges, int Total)> GroupSum(IEnumerable<(int n, int v)> source, int maxDistance)
{
int low = int.MaxValue;
int high = 0;
int total = 0;
foreach (var (n, v) in source)
{
// check partition boundary
if (n < low || (n - high) > maxDistance)
{
if (n > low)
yield return ($"{low}-{high}", total);
low = high = n;
total = v;
}
else
{
high = n;
total += v;
}
}
if (total > 0)
yield return ($"{low}-{high}", total);
}
(Using ValueTuple so I don't have to declare types.)
Output is the same here, but with a lot less going on in the background to slow it down. No allocated arrays, etc.

Removing masked entries from an array

The task is to keep an array of objects untouched if input is null and, otherwise, remove the elements that are on positions specified by the input. I've got it working but I'm vastly dissatisfied with the code quality.
List<Stuff> stuff = new List<Stuff>{ new Stuff(1), new Stuff(2), new Stuff(3) };
String input = "5";
if(input == null)
return stuff;
int mask = Int32.Parse(input);
for (int i = stuff.Count - 1; i >= 0; i--)
if ((mask & (int)Math.Pow(2, i)) == 0)
stuff.RemoveAt(i);
return stuff;
The actual obtaining input and the fact that e.g. String.Empty will cause problems need not to be regarded. Let's assume that those are handled.
How can I make the code more efficient?
How can I make the syntax more compact and graspable?
Instead of the backwards running loop, you could use Linq with the following statement.
stuff = stuff.Where( (iStuff, idx) => (mask & (int)Math.Pow(2, idx)) != 0 );
Or even cooler using bitwise shit.
stuff = stuff.Where((_, index) => (mask >> index & 1) == 1);
It uses an overload of Where which can access the position in the sequence, as documented here. For a similar task, there is also an overload of Select which gives access to the index, as documented here.
Untested, but you could make an extension method that iterates the collection and filters, returning matching elements as it goes. Repeatedly bit-shifting the mask and checking the 0th bit seems the easiest to follow - for me at least.
static IEnumerable<T> TakeMaskedItemsByIndex(this IEnumerable<T> collection, ulong mask)
{
foreach (T item in collection)
{
if((mask & 1) == 1)
yield return item;
mask = mask >> 1;
}
}

Fast algorithm for pandigital check

I'm working on a project for which I need a very fast algorithm for checking whether a supplied number is pandigital. Though the logic seems sound, I'm not particularly happy with performance of the methods described below.
I can check up to one million 9-digit numbers in about 520ms, 600ms and 1600ms respectively. I'm working on a low-latency application and in production I'll have a dataset of about 9 or 9.5 billion 7- to 9-digit numbers that I'll need to check.
I have three candidiates right now (well, really two) that use the following logic:
Method 1: I take an input N, split into into a byte array of its constituent digits, sort it using an Array.Sort function and iterate over the array using a for loop checking for element vs counter consistency:
byte[] Digits = SplitDigits(N);
int len = NumberLength(N);
Array.Sort(Digits);
for (int i = 0; i <= len - 1; i++)
{
if (i + 1 != Digits[i])
return false;
}
Method 2: This method is based on a bit of dubious logic, but I split the input N into a byte array of constituent digits and then make the following test:
if (N * (N + 1) * 0.5 == DigitSum(N) && Factorial(len) == DigitProduct(N))
return true;
Method 3: I dislike this method, so not a real candidate but I cast the int to a string and then use String.Contains to determine if the required string is pandigital.
The second and third method have fairly stable runtimes, though the first method bounces around a lot - it can go as high as 620ms at times.
So ideally I really like to reduce the runtime for the million 9-digit mark to under 10ms. Any thoughts?
I'm running this on a Pentium 6100 laptop at 2GHz.
PS - is the mathematical logic of the second method sound?
Method 1
Pre-compute a sorted list of the 362880 9-digit pandigital numbers. This will take only a few milliseconds. Then for each request, first check if the number is divisible by 9: It must be to be pandigital. If it is, then use a binary search to check if it is in your pre-computed list.
Method 2
Again, check if the number is divisible by 9. Then use a bit vector to track the presence of digits. Also use modular multiplication to replace the division by a multiplication.
static bool IsPandigital(int n)
{
if (n != 9 * (int)((0x1c71c71dL * n) >> 32))
return false;
int flags = 0;
while (n > 0) {
int q = (int)((0x1999999aL * n) >> 32);
flags |= 1 << (n - q * 10);
n = q;
}
return flags == 0x3fe;
}
Method 1 comes in at 15ms/1M. Method 2 comes in at 5.5ms/1M on my machine. This is C# compiled to x64 on an i7 950.
just a thought: (after the definitition of pandigital from wikipedia)
int n = 1234567890;
int Flags = 0;
int Base = 10;
while(n != 0)
{
Flags |= 1<<(n % Base); n /= Base;
}
bool bPanDigital = Flags == ((1 << Base) - 1);

Interpolation in c# - performance problem

I need to resample big sets of data (few hundred spectra, each containing few thousand points) using simple linear interpolation.
I have created interpolation method in C# but it seems to be really slow for huge datasets.
How can I improve the performance of this code?
public static List<double> interpolate(IList<double> xItems, IList<double> yItems, IList<double> breaks)
{
double[] interpolated = new double[breaks.Count];
int id = 1;
int x = 0;
while(breaks[x] < xItems[0])
{
interpolated[x] = yItems[0];
x++;
}
double p, w;
// left border case - uphold the value
for (int i = x; i < breaks.Count; i++)
{
while (breaks[i] > xItems[id])
{
id++;
if (id > xItems.Count - 1)
{
id = xItems.Count - 1;
break;
}
}
System.Diagnostics.Debug.WriteLine(string.Format("i: {0}, id {1}", i, id));
if (id <= xItems.Count - 1)
{
if (id == xItems.Count - 1 && breaks[i] > xItems[id])
{
interpolated[i] = yItems[yItems.Count - 1];
}
else
{
w = xItems[id] - xItems[id - 1];
p = (breaks[i] - xItems[id - 1]) / w;
interpolated[i] = yItems[id - 1] + p * (yItems[id] - yItems[id - 1]);
}
}
else // right border case - uphold the value
{
interpolated[i] = yItems[yItems.Count - 1];
}
}
return interpolated.ToList();
}
Edit
Thanks, guys, for all your responses. What I wanted to achieve, when I wrote this questions, were some general ideas where I could find some areas to improve the performance. I haven't expected any ready solutions, only some ideas. And you gave me what I wanted, thanks!
Before writing this question I thought about rewriting this code in C++ but after reading comments to Will's asnwer it seems that the gain can be less than I expected.
Also, the code is so simple, that there are no mighty code-tricks to use here. Thanks to Petar for his attempt to optimize the code
It seems that all reduces the problem to finding good profiler and checking every line and soubroutine and trying to optimize that.
Thank you again for all responses and taking your part in this discussion!
public static List<double> Interpolate(IList<double> xItems, IList<double> yItems, IList<double> breaks)
{
var a = xItems.ToArray();
var b = yItems.ToArray();
var aLimit = a.Length - 1;
var bLimit = b.Length - 1;
var interpolated = new double[breaks.Count];
var total = 0;
var initialValue = a[0];
while (breaks[total] < initialValue)
{
total++;
}
Array.Copy(b, 0, interpolated, 0, total);
int id = 1;
for (int i = total; i < breaks.Count; i++)
{
var breakValue = breaks[i];
while (breakValue > a[id])
{
id++;
if (id > aLimit)
{
id = aLimit;
break;
}
}
double value = b[bLimit];
if (id <= aLimit)
{
var currentValue = a[id];
var previousValue = a[id - 1];
if (id != aLimit || breakValue <= currentValue)
{
var w = currentValue - previousValue;
var p = (breakValue - previousValue) / w;
value = b[id - 1] + p * (b[id] - b[id - 1]);
}
}
interpolated[i] = value;
}
return interpolated.ToList();
}
I've cached some (const) values and used Array.Copy, but I think these are micro optimization that are already made by the compiler in Release mode. However You can try this version and see if it will beat the original version of the code.
Instead of
interpolated.ToList()
which copies the whole array, you compute the interpolated values directly in the final list (or return that array instead). Especially if the array/List is big enough to qualify for the large object heap.
Unlike the ordinary heap, the LOH is not compacted by the GC, which means that short lived large objects are far more harmful than small ones.
Then again: 7000 doubles are approx. 56'000 bytes which is below the large object threshold of 85'000 bytes (1).
Looks to me you've created an O(n^2) algorithm. You are searching for the interval, that's O(n), then probably apply it n times. You'll get a quick and cheap speed-up by taking advantage of the fact that the items are already ordered in the list. Use BinarySearch(), that's O(log(n)).
If still necessary, you should be able to do something speedier with the outer loop, what ever interval you found previously should make it easier to find the next one. But that code isn't in your snippet.
I'd say profile the code and see where it spends its time, then you have somewhere to focus on.
ANTS is popular, but Equatec is free I think.
few suggestions,
as others suggested, use profiler to understand better where time is used.
the loop
while (breaks[x] < xItems[0])
could cause exception if x grows bigger than number of items in "breaks" list. You should use something like
while (x < breaks.Count && breaks[x] < xItems[0])
But you might not need that loop at all. Why treat the first item as special case, just start with id=0 and handle the first point in for(i) loop. I understand that id might start from 0 in this case, and [id-1] would be negative index, but see if you can do something there.
If you want to optimize for speed then you sacrifice memory size, and vice versa. You cannot usually have both, except if you make really clever algorithm. In this case, it would mean to calculate as much as you can outside loops, store those values in variables (extra memory) and use them later. For example, instead of always saying:
id = xItems.Count - 1;
You could say:
int lastXItemsIndex = xItems.Count-1;
...
id = lastXItemsIndex;
This is the same suggestion as Petar Petrov did with aLimit, bLimit....
next point, your loop (or the one Petar Petrov suggested):
while (breaks[i] > xItems[id])
{
id++;
if (id > xItems.Count - 1)
{
id = xItems.Count - 1;
break;
}
}
could probably be reduced to:
double currentBreak = breaks[i];
while (id <= lastXIndex && currentBreak > xItems[id]) id++;
and the last point I would add is to check if there is some property in your samples that is special for your problem. For example if xItems represent time, and you are sampling in regular intervals, then
w = xItems[id] - xItems[id - 1];
is constant, and you do not have to calculate it every time in the loop.
This is probably not often the case, but maybe your problem has some other property which you could use to improve performance.
Another idea is this: maybe you do not need double precision, "float" is probably faster because it is smaller.
Good luck
System.Diagnostics.Debug.WriteLine(string.Format("i: {0}, id {1}", i, id));
I hope it's release build without DEBUG defined?
Otherwise, it might depend on what exactly are those IList parameters. May be useful to store Count value instead of accessing property every time.
This is the kind of problem where you need to move over to native code.

What is the simplest way to initialize an Array of N numbers following a simple pattern?

Let's say the first N integers divisible by 3 starting with 9.
I'm sure there is some one line solution using lambdas, I just don't know it that area of the language well enough yet.
Just to be different (and to avoid using a where statement) you could also do:
var numbers = Enumerable.Range(0, n).Select(i => i * 3 + 9);
Update This also has the benefit of not running out of numbers.
Using Linq:
int[] numbers =
Enumerable.Range(9,10000)
.Where(x => x % 3 == 0)
.Take(20)
.ToArray();
Also easily parallelizeable using PLinq if you need:
int[] numbers =
Enumerable.Range(9,10000)
.AsParallel() //added this line
.Where(x => x % 3 == 0)
.Take(20)
.ToArray();
const int __N = 100;
const int __start = 9;
const int __divisibleBy = 3;
var array = Enumerable.Range(__start, __N * __divisibleBy).Where(x => x % __divisibleBy == 0).Take(__N).ToArray();
int n = 10; // Take first 10 that meet criteria
int[] ia = Enumerable
.Range(0,999)
.Where(a => a % 3 == 0 && a.ToString()[0] == '9')
.Take(n)
.ToArray();
I want to see how this solution stacks up to the above Linq solutions. The trick here is modifying the predicate using the fact that the set of (q % m) starting from s is (s + (s % m) + m*n) (where n represent's the nth value in the set). In our case s=q.
The only problem with this solution is that it has the side effect of making your implementation depend on the specific pattern you choose (and not all patterns have a suitable predicate). But it has the advantage of:
Always running in exactly n iterations
Never failing like the above proposed solutions (wrt to the limited Range).
Besides, no matter what pattern you choose, you will always need to modify the predicate, so you might as well make it mathematically efficient:
static int[] givemeN(int n)
{
const int baseVal = 9;
const int modVal = 3;
int i = 0;
return Array.ConvertAll<int, int>(
new int[n],
new Converter<int, int>(
x => baseVal + (baseVal % modVal) +
((i++) * modVal)
));
}
edit: I just want to illustrate how you could use this method with a delegate to improve code re-use:
static int[] givemeN(int n, Func<int, int> func)
{
int i = 0;
return Array.ConvertAll<int, int>(new int[n], new Converter<int, int>(a => func(i++)));
}
You can use it with givemeN(5, i => 9 + 3 * i). Again note that I modified the predicate, but you can do this with most simple patterns too.
I can't say this is any good, I'm not a C# expert and I just whacked it out, but I think it's probably a canonical example of the use of yield.
internal IEnumerable Answer(N)
{
int n=0;
int i=9;
while (true)
{
if (i % 3 == 0)
{
n++;
yield return i;
}
if (n>=N) return;
i++;
}
}
You have to iterate through 0 or 1 to N and add them by hand. Or, you could just create your function f(int n), and in that function, you cache the results inside session or a global hashtable or dictionary.
Pseudocode, where ht is a global Hashtable or Dictionary (strongly recommend the later, because it is strongly typed.
public int f(int n)
{
if(ht[n].containsValue)
return ht[n];
else
{
//do calculation
ht[n] = result;
return result;
}
}
Just a side note. If you do this type of functional programming all the time, you might want to check out F#, or maybe even Iron Ruby or Python.

Categories