Possible to do this in better than O(n^2) time? - c#

The problem I'm trying to solve gives me a matrix like
10101
11100
11010
00101
where the rows are supposed to represented topics that a person knows; e.g. Person 1, represented by 10101, knows topics 1, 3 and 5, but not 2 or 4. I need to find the maximum number of topics that a 2-person team could know; e.g. the team that is Person 1 and 3 knows all the topics because between 10101 and 11010 there are 1s at every index.
I have an O(n^2) solution
string[] topic = new string[n];
for(int topic_i = 0; topic_i < n; topic_i++)
{
topic[topic_i] = Console.ReadLine();
}
IEnumerable<int> teamTopics =
from t1 in topic
from t2 in topic
where !Object.ReferenceEquals(t1, t2)
select t1.Zip(t2, (c1, c2) => c1 == '1' || c2 == '1').Sum(b => b ? 1 : 0);
int max = teamTopics.Max();
Console.WriteLine(max);
which is passing all the test cases it doesn't time out on. I suspect the reason it's not fast enough has to do with the time complexity rather than the overhead of the LINQ machinery. But I can't think of a better way to do it.
I thought that maybe I could map the indices of topics to the persons who know them, like
1 -> {1,2,3}
2 -> {2,3}
3 -> {1,2,4}
4 -> {3}
5 -> {1,4}
but I can't think of where to go from there.
Can you supply me with a "hint"?

Let's say we have n people and m topics.
I would argue that your algorithm is O(n^2 * m), where n is number of people, because:
from t1 in topic gets you O(n)
from t2 in topic gets you to O(n^2)
t1.Zip(t2 ... get you to O(n^2 * m)
An optimisation that I see is first to modify strings a bit:
s1 = '0101', where i-th element shows whether a person i knows 1st topic
s2 = '1111', where i-th element shows whether a person i knows 2nd topic.
etc...
Then you analyse string s1. You pick all possible pairs of 1s (O(n^2) elements) that show pairs of people that together know 1st topic. Then go pick a pair from that list and check whether they know 2nd topic as well and so on. When they don't, delete it from the list and move on to another pair.
Unfortunately this looks to be O(n^2 * m) as well, but this should be quicker in practise. For very sparse matrix, it should be close to O(n2), and for dense matrices it should find a pair pretty soon.

Thoughts:
as a speculative optimization: you could do an O(n) sweep to find the individual with the highest number of skills (largest hamming weight); note them, and stop if they have everything: pair them with anyone, it doesn't matter
you can exclude anyone without testing who only has skilled shared with the "best" individual - we already know about everything they can offer and have tested against everyone; so only test if (newSkills & ~bestSkills) != 0 - meaning: the person being tested has something that the "best" worker didn't have; this leaves m workers with complementary skills plus the "best" worker (you must include them explicitly, as the ~/!=0 test above will fail for them)
now do another O(m) sweep of possible partners - checking to see if the "most skilled" plus any other gives you all the skills (obviously stop earlier if a single member has all the skills); but either way: keep track of best combination for later reference
you can further half the time by only considering the triangle, not the square - meaning: you compare row 0 to rows 1-(m-1), but row 1 to rows 2-(m-1), row 5 to 6-(m-1), etc
you can significantly improve things by using integer bit math along with an efficient "hamming weight" algorithm (to count the set bits) rather than strings and summing
get rid of the LINQ
short-circuit if you get all ones (compare to ~((~0)<<k), where k is the number of bits being tested for)
remember to compare any result to the "best" combination we found against the most skilled worker
This is still O(n) + O(m^2) where m <= n is the number of people with skills different to the most skilled worker
Pathological but technically correct answer:
insert a Thread.Sleep(FourYears) - all solutions are now essentially O(1)

Your solution is asymptotically as efficient as it gets, because you need to examine all pairs to arrive at the maximum. You can make your code more efficient by replacing strings with BitArray objects, like this:
var topic = new List<BitArray>();
string line;
while ((line = Console.ReadLine()) != null) {
topic.Add(new BitArray(line.Select(c => c=='1').ToArray()));
}
var res =
(from t1 in topic
from t2 in topic
select t1.Or(t2).Count).Max();
Console.WriteLine(res);
Demo.

Related

Best performing algorithm for unique trip selection using arrays?

If I am given three arrays of equal length. Each array represents the distance to a specific attraction (ie the first array is only theme parks, the second is only museums, the third is only beaches) on a road trip I am taking. I wan't to determine all possible trips stopping at one of each type of attraction on each trip, never driving backwards, and never visiting the same attraction twice.
IE if I have the following three arrays:
[29 50]
[61 37]
[37 70]
The function would return 3 because the possible combinations would be: (29,61,70)(29,37,70)(50,61,70)
What I've got so far:
public int test(int[] A, int[] B, int[] C) {
int firstStop = 0;
int secondStop = 0;
int thirdStop = 0;
List<List<int>> possibleCombinations = new List<List<int>>();
for(int i = 0; i < A.Length; i++)
{
firstStop = A[i];
for(int j = 0; j < B.Length; j++)
{
if(firstStop < B[j])
{
secondStop = B[j];
for(int k = 0; k < C.Length; k++)
{
if(secondStop < C[k])
{
thirdStop = C[k];
possibleCombinations.Add(new List<int>{firstStop, secondStop, thirdStop});
}
}
}
}
}
return possibleCombinations.Count();
}
This works for the folowing test cases:
Example test: ([29, 50], [61, 37], [37, 70])
OK Returns 3
Example test: ([5], [5], [5])
OK Returns 0
Example test: ([61, 62], [37, 38], [29, 30])
FAIL Returns 0
What is the correct algorithm to calculate this correctly?
What is the best performing algorithm?
How can I tell the performance of this algorithm's time complexity (ie is it O(N*log(N))?)
UPDATE: The question has been rewritten with new details and still is completely unclear and self-contradictory; attempts to clarify the problem with the original poster have been unsuccessful, and the original poster admits to having started coding before understanding the problem themselves. The solution below is correct for the problem as it was originally stated; what the solution to the real problem looks like, no one can say, because no one can say what the real problem is. I'll leave this here for historical purposes.
Let's re-state the problem:
We are given three arrays of distances to attractions along a road.
We wish to enumerate all sequences of possible stops at attractions that do not backtrack. (NOTE: The statement of the problem is to enumerate them; the wrong algorithm given counts them. These are completely different problems. Counting them can be extremely fast. Enumerating them is extremely slow! If the problem is to count them then clarify the problem.)
No other constraints are given in the problem. (For example, it is not given in the problem that we stop at no more than one beach, or that we must stop at one of every kind, or that we must go to a beach before we go to a museum. If those are constraints then they must be stated in the problem)
Suppose there are a total of n attractions. For each attraction either we visit it or we do not. It might seem that there are 2n possibilities. However, there's a problem. Suppose we have two museums, M1 and M2 both 5 km down the road. The possible routes are:
(Start, End) -- visit no attractions on your road trip
(Start, M1, End)
(Start, M2, End)
(Start, M1, M2, End)
(Start, M2, M1, End)
There are five non-backtracking possibilities, not four.
The algorithm you want is:
Partition the attractions by distance, so that all the partitions contain the attractions that are at the same distance.
For each partition, generate a set of all the possible orderings of all the subsets within that partition. Do not forget that "skip all of them" is a possible ordering.
The combinations you want are the Cartesian product of all the partition ordering sets.
That should give you enough hints to make progress. You have several problems to solve here: partitioning, permuting within a partition, and then taking the cross product of arbitrarily many sets. I and many others have written articles on all of these subjects, so do some research if you do not know how to solve these sub-problems yourself.
As for the asymptotic performance: As noted above, the problem given is to enumerate the solutions. The best possible case is, as noted before, 2n for cases where there are no attractions at the same distance, so we are at least exponential. If there are collisions then it becomes a product of many factorials; I leave it to you to work it out, but it's big.
Again: if the problem is to work out the number of solutions, that's much easier. You don't have to enumerate them to know how many solutions there are! Just figure out the number of orderings at each partition and then multiply all the counts together. I leave figuring out the asymptotic performance of partitioning, working out the number of orderings, and multiplying them together as an exercise.
Your solution runs in O(n ^ 3). But if you need to generate all possible combinations and the distances are sorted row and column wise i.e
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]
all solutions will degrade to O(n^3) as it requires to compute all possible subsequences.
If the input has lots of data and the distance between each of them is relatively far then a Sort + binary search + recursive solution might be faster.
static List<List<int>> answer = new List<List<int>>();
static void findPaths(List<List<int>> distances, List<int> path, int rowIndex = 0, int previousValue = -1)
{
if(rowIndex == distances.Count)
{
answer.Add(path);
return;
}
previousValue = previousValue == -1 ? distances[0][0] : previousValue;
int startIndex = distances[rowIndex].BinarySearch(previousValue);
startIndex = startIndex < 0 ? Math.Abs(startIndex) - 1 : startIndex;
// No further destination can be added
if (startIndex == distances[rowIndex].Count)
return;
for(int i=startIndex; i < distances[rowIndex].Count; ++i)
{
var temp = new List<int>(path);
int currentValue = distances[rowIndex][i];
temp.Add(currentValue);
findPaths(distances, temp, rowIndex + 1, currentValue);
}
}
The majority of savings in this solution comes from the fact that since the data is already sorted we need not look distances in the next destinations with distance less than the previous value we have.
For smaller and more closed distances this might be a overkill with the additional sorting and binary search overhead making it slower than the straightforward brute force approach.
Ultimately i think this comes down to how your data is and you can try out both approaches and try which one is faster for you.
Note: This solution does not assume strictly increasing distances i.e) [29, 37, 37] is valid here. If you do not want such solution you'll have to change Binary Search to do a upper bound as opposed to lower bound.
Use Dynamic Programming with State. As there are only 3 arrays, so there are only 2*2*2 states.
Combine the arrays and sort it. [29, 37, 37, 50, 61, 70]. And we make an 2d-array: dp[0..6][0..7]. There are 8 states:
001 means we have chosen 1st array.
010 means we have chosen 2nd array.
011 means we have chosen 1st and 2nd array.
.....
111 means we have chosen 1st, 2nd, 3rd array.
The complexity is O(n*8)=O(n)

Java / C# - Array[][] complexity task [duplicate]

This question already has answers here:
Algorithm: how to find a column in matrix filled with all 1, time complexity O(n)?
(5 answers)
Closed 9 years ago.
I'm dealing with some problematic complexity question via my university:
Program input : A n x n Array[][] that is filled with either 0 or 1.
DEFINITION: Define k as a SINK if in the k row all the values are 0, and in the k column all the values are 1 (except [k][k] itself which needs to be 0)
Program output : Is there a k number that is a SINK? If so, returnk, else return -1.
Example :
On Arr A k=3 is a SINK, on Arr B there in no SINK, so -1 is returned.
The main problem with this task is that the complexity of the program must be below O(n^2) , I have managed to solve this with that complexity, going over the oblique line summing the rows&columns. I haven't find a way to solve this with O(logn) or O(n). Also the task prevents you from using another Array[] (Due to memory complexity). Can anyone drop any light on that matter? thanks in advance!
Just to make explicit the answer harold links to in the OP's comments: start yourself off with a list of all n indices, S = {0, 1, .., n-1}. These are our candidates for sinks. At each step, we're going to eliminate one of them.
Consider the first two elements of S, say i and j.
Check whether A[i, j] is 1.
If it is, remove i from S (because the i th row isn't all 0s, so i can't be our sink )
If it isn't, remove j from S (because the j th column isn't all 1s, so j can't be our sink)
If there're still two or more elements in S, go back to Step 1.
When we get to the last element, say k, check whether the k th row is all zero and the k th column (other than A[k,k]) are all ones.
If they are, k is a sink and you can return it.
If they aren't, the matrix does not have a sink and you can return -1.
There are n elements in S to begin with, each step eliminates one of them and each step takes constant time, so it's O(n) overall.
You mention you don't want to use a second array. If that really is strict, you can just use two integers instead, one representing the "survivor" from the last step and one representing how far into the sequence 0, 1, .., n-1 you are.
I've never seen this algorithm before and I'm quite impressed with it's simplicity. Cheers.

Bitwise operation in large list as fast as possible in c#

I have a list from 10,000 long value
and I want to compare that data withe 100,000 other long value
compare is a bitwise operation -->
if (a&b==a) count++;
which algoritm I can use for getting best performance?
If I understand your question correctly, you want to check a against each b whether some predicate is true. So a naive solution to your problem would be as follows:
var result = aList.Sum(a => bList.Count(b => (a & b) == a));
I'm not sure this can really be sped up for an arbitrary predicate, because you can't get around checking each a against each b. What you could try is run the query in parallel:
var result = aList.AsParallel().Sum(a => bList.Count(b => (a & b) == a));
Example:
aList: 10,000 random long values; bList: 100,000 random long values.
without AsParallel: 00:00:13.3945187
with AsParallel: 00:00:03.8190386
Put all of your as into a trie data structure, where the first level of the tree corresponds to the first bit of the number, the second to the second bit, and so on. Then, for each b, walk down the trie; if this bit is 1 in b, then count both branches, or if this bit is 0 in b, count only the 0 branch of the trie. I think this should be O(n+m), but I haven't thought about it very hard.
You can probably get the same semantics but with better cache characteristics by sorting the list of as and using the sorted list in much the same way as the trie. This is going to be slightly worse in terms of number of operations - because you'll have to search for stuff a lot of the time - but the respect for the CPU cache might more than make up for it.
N.B. I haven't thought about correctness much harder than I've thought about big-O notation, which is to say probably not enough.

Find the Discrete Pair of {x,y} that Satisfy Inequality Constriants

I have a few inequalities regarding {x,y}, that satisfies the following equations:
x>=0
y>=0
f(x,y)=x^2+y^2>=100
g(x,y)=x^2+y^2<=200
Note that x and y must be integer.
Graphically it can be represented as follows, the blue region is the region that satisfies the above inequalities:
The question now is, is there any function in Matlab that finds every admissible pair of {x,y}? If there is an algorithm to do this kind of thing I would be glad to hear about it as well.
Of course, one approach we can always use is brute force approach where we test every possible combination of {x,y} to see whether the inequalities are satisfied. But this is the last resort, because it's time consuming. I'm looking for a clever algorithm that does this, or in the best case, an existing library that I can use straight-away.
The x^2+y^2>=100 and x^2+y^2<=200 are just examples; in reality f and g can be any polynomial functions of any degree.
Edit: C# code are welcomed as well.
This is surely not possible to do in general for a general set of polynomial inequalities, by any method other than enumerative search, even if there are a finite number of solutions. (Perhaps I should say not trivial, as it is possible. Enumerative search will work, subject to floating point issues.) Note that the domain of interest need not be simply connected for higher order inequalities.
Edit: The OP has asked about how one might proceed to do a search.
Consider the problem
x^3 + y^3 >= 1e12
x^4 + y^4 <= 1e16
x >= 0, y >= 0
Solve for all integer solutions of this system. Note that integer programming in ANY form will not suffice here, since ALL integer solutions are requested.
Use of meshgrid here would force us to look at points in the domain (0:10000)X(0:10000). So it would force us to sample a set of 1e8 points, testing every point to see if they satisfy the constraints.
A simple loop can potentially be more efficient than that, although it will still require some effort.
% Note that I will store these in a cell array,
% since I cannot preallocate the results.
tic
xmax = 10000;
xy = cell(1,xmax);
for x = 0:xmax
% solve for y, given x. This requires us to
% solve for those values of y such that
% y^3 >= 1e12 - x.^3
% y^4 <= 1e16 - x.^4
% These are simple expressions to solve for.
y = ceil((1e12 - x.^3).^(1/3)):floor((1e16 - x.^4).^0.25);
n = numel(y);
if n > 0
xy{x+1} = [repmat(x,1,n);y];
end
end
% flatten the cell array
xy = cell2mat(xy);
toc
The time required was...
Elapsed time is 0.600419 seconds.
Of the 100020001 combinations that we might have tested for, how many solutions did we find?
size(xy)
ans =
2 4371264
Admittedly, the exhaustive search is simpler to write.
tic
[x,y] = meshgrid(0:10000);
k = (x.^3 + y.^3 >= 1e12) & (x.^4 + y.^4 <= 1e16);
xy = [x(k),y(k)];
toc
I ran this on a 64 bit machine, with 8 gig of ram. But even so the test itself was a CPU hog.
Elapsed time is 50.182385 seconds.
Note that floating point considerations will sometimes cause a different number of points to be found, depending on how the computations are done.
Finally, if your constraint equations are more complex, you might need to use roots in the expression for the bounds on y, to help identify where the constraints are satisfied. The nice thing here is it still works for more complicated polynomial bounds.

Improving set comparisons by hashing them (being overly clever..?)

After my last, failed, attempt at asking a question here I'm trying a more precise question this time:
What I have:
A huge dataset (finite, but I wasted days of multi-core processing time to compute it before...) of ISet<Point>.
A list of input values between 0 to 2n, n≤17
What I need:
3) A table of [1], [2] where I map every value of [2] to a value of [1]
The processing:
For this computation I have a formula, that takes a bit value (from [2]) and a set of positions (from [1]) and creates a new ISet<Point>. I need to find out which of the original set is equal to the resulting set (i.e. The "cell" in the table at "A7" might point to "B").
The naive way:
Compute the new ISet<Point> and use .Contains(mySet) or something similar on the list of values from [1]. I did that in previous versions of this proof of concept/pet project and it was dead slow when I started feeding huge numbers. Yes, I used a profiler. No, this wasn't the only slow part of the system, but I wasted a considerable amount of time in this naive lookup/mapping.
The question, finally:
Since I basically just need to remap to the input, I thought about creating a List of hashed values for the list of ISet<Point>, doing the same for my result from the processing and therefor avoiding to compare whole sets.
Is this a good idea? Would you call this premature optimization (I know that the naive way above is too slow, but should I implement something less clever first? Performance is really important here, think days of runtime again)? Any other suggestions to ease the burdon here or ideas what I should read up on?
Update: Sorry for not providing a better explanation or a sample right away.
Sample for [1] (Note: These are real possible datapoints, but obviously it's limited) :
new List<ISet<Point>>() {
new HashSet() {new Point(0,0) },
new HashSet() {new Point(0,0), new Point(2,1) },
new HashSet() {new Point(0,1), new Point(3,1) }
}
[2] is just a boolean vector of the length n. For n = 2 it's
0,0
0,1
1,0
1,1
I can do that one by using an int or long, basically.
Now I have a function that takes an vector and an ISet<Point> and returns a new ISet<Point>. It's not a 1:1 transformation: An set of 5 might result in a set of 11 or whatever. The resulting ISet<Point> is however guaranteed to be part of the input.
Using letters for a set of points and numbers for the bit vectors, I'm starting with this
A B C D E F
1
2
3
4
5
6
7
What I need to have at the end is
A B C D E F
1 - C A E - -
2 B C E F A -
3 ................
4 ................
5 ................
6 F C B A - -
7 E - C A - D
There are several costly operations in these project, one is the preparation of the sets of point ([1]). But this question is about the matching now: I can easily (more or less, not that important now) compute a target ISet for a given bit vector and a source ISet. Now I need to match/find that in the original set.
The whole beast is going to be a state machine, where the set of points is a valid state. Later I don't care about the individual states, I can actually refer to them by anything (a letter, an index, whatever). I just need to keep the associations:
1, B => C
Update: Eric asked if a HashSet would be possible. The answer is yes, but only if the dataset stays small enough. My question (hashing) is: Might it be possible/a good idea to employ a hashing algorithm for this hashset? My idea is this:
Walk the (lazily generated) list/sequence of ISet<Point> (I could change this type, I just want to stress that it is a mathematical set of points, no duplicates).
Create a simpler representation of the input (a hash?) and store it (in a hashset?)
Compute all target sets for this input, but only store again a simple representation (see above)
Discard the set
Fix up the mapping (equal hash = equal state)
Good idea? Problems with this? One that I could come up with is a collision (how probable is that?) - and I wouldn't know a good hashing function to begin with..
OK, I think I understand the problem at least now. Let me see if I can rephrase.
Let's start by leaving sets out of it. We'll keep it abstract.
You have a large list L, containing instances of reference type S (for "set"). Such a list is of course logically a mapping from natural numbers N onto S.
L: N --> S
S has the property that two instances can be compared for both reference equality and value equality. That is, there can be two instances of S which are not reference equals, but logically represent the same value.
You have a function F which takes a value of type V (for "vector") and an instance of type S and produces another instance of type S.
F: (V, S) --> S
Furthermore, you know that if F is given an instance of S from L then the resulting instance of S will be value equals to something on the list, but not necessarily reference equals.
The problem you face is: given an instance s of S which is the result of a call to F, which member L(n) is value-equals to s?
Yes?
The naive method -- go down L(1), L(2), ... testing set equality along the way will be dead slow. It'll be at least linear in the size of L.
I can think of several different ways to proceed. The easiest is your initial thought: make L something other than a list. Can you make it a HashSet<S> instead of List<S>? If you implement a hashing algorithm and equality method on S then you can build a fast lookup table.
If that doesn't suit then we'll explore other options.
UPDATE:
OK, so I can see two basic ways to deal with your memory problem. (1) Keep everything in memory using data structures that are much smaller than your current implementation, or (2) change how you store stuff on disk so that you can store an "index" in memory and rapidly go to the right "page" of the disk file to extract the information you need.
You could be representing a point as a single short where the top byte is x and the bottom byte is y, instead of representing it as two ints; a savings of 75%.
A set of points could be implemented as a sorted array of shorts, which is pretty compact and easy to write a hash algorithm for.
That's probably the approach I'd go for since your data are so compressible.

Categories