This question already has answers here:
Algorithm: how to find a column in matrix filled with all 1, time complexity O(n)?
(5 answers)
Closed 9 years ago.
I'm dealing with some problematic complexity question via my university:
Program input : A n x n Array[][] that is filled with either 0 or 1.
DEFINITION: Define k as a SINK if in the k row all the values are 0, and in the k column all the values are 1 (except [k][k] itself which needs to be 0)
Program output : Is there a k number that is a SINK? If so, returnk, else return -1.
Example :
On Arr A k=3 is a SINK, on Arr B there in no SINK, so -1 is returned.
The main problem with this task is that the complexity of the program must be below O(n^2) , I have managed to solve this with that complexity, going over the oblique line summing the rows&columns. I haven't find a way to solve this with O(logn) or O(n). Also the task prevents you from using another Array[] (Due to memory complexity). Can anyone drop any light on that matter? thanks in advance!
Just to make explicit the answer harold links to in the OP's comments: start yourself off with a list of all n indices, S = {0, 1, .., n-1}. These are our candidates for sinks. At each step, we're going to eliminate one of them.
Consider the first two elements of S, say i and j.
Check whether A[i, j] is 1.
If it is, remove i from S (because the i th row isn't all 0s, so i can't be our sink )
If it isn't, remove j from S (because the j th column isn't all 1s, so j can't be our sink)
If there're still two or more elements in S, go back to Step 1.
When we get to the last element, say k, check whether the k th row is all zero and the k th column (other than A[k,k]) are all ones.
If they are, k is a sink and you can return it.
If they aren't, the matrix does not have a sink and you can return -1.
There are n elements in S to begin with, each step eliminates one of them and each step takes constant time, so it's O(n) overall.
You mention you don't want to use a second array. If that really is strict, you can just use two integers instead, one representing the "survivor" from the last step and one representing how far into the sequence 0, 1, .., n-1 you are.
I've never seen this algorithm before and I'm quite impressed with it's simplicity. Cheers.
Related
If I am given three arrays of equal length. Each array represents the distance to a specific attraction (ie the first array is only theme parks, the second is only museums, the third is only beaches) on a road trip I am taking. I wan't to determine all possible trips stopping at one of each type of attraction on each trip, never driving backwards, and never visiting the same attraction twice.
IE if I have the following three arrays:
[29 50]
[61 37]
[37 70]
The function would return 3 because the possible combinations would be: (29,61,70)(29,37,70)(50,61,70)
What I've got so far:
public int test(int[] A, int[] B, int[] C) {
int firstStop = 0;
int secondStop = 0;
int thirdStop = 0;
List<List<int>> possibleCombinations = new List<List<int>>();
for(int i = 0; i < A.Length; i++)
{
firstStop = A[i];
for(int j = 0; j < B.Length; j++)
{
if(firstStop < B[j])
{
secondStop = B[j];
for(int k = 0; k < C.Length; k++)
{
if(secondStop < C[k])
{
thirdStop = C[k];
possibleCombinations.Add(new List<int>{firstStop, secondStop, thirdStop});
}
}
}
}
}
return possibleCombinations.Count();
}
This works for the folowing test cases:
Example test: ([29, 50], [61, 37], [37, 70])
OK Returns 3
Example test: ([5], [5], [5])
OK Returns 0
Example test: ([61, 62], [37, 38], [29, 30])
FAIL Returns 0
What is the correct algorithm to calculate this correctly?
What is the best performing algorithm?
How can I tell the performance of this algorithm's time complexity (ie is it O(N*log(N))?)
UPDATE: The question has been rewritten with new details and still is completely unclear and self-contradictory; attempts to clarify the problem with the original poster have been unsuccessful, and the original poster admits to having started coding before understanding the problem themselves. The solution below is correct for the problem as it was originally stated; what the solution to the real problem looks like, no one can say, because no one can say what the real problem is. I'll leave this here for historical purposes.
Let's re-state the problem:
We are given three arrays of distances to attractions along a road.
We wish to enumerate all sequences of possible stops at attractions that do not backtrack. (NOTE: The statement of the problem is to enumerate them; the wrong algorithm given counts them. These are completely different problems. Counting them can be extremely fast. Enumerating them is extremely slow! If the problem is to count them then clarify the problem.)
No other constraints are given in the problem. (For example, it is not given in the problem that we stop at no more than one beach, or that we must stop at one of every kind, or that we must go to a beach before we go to a museum. If those are constraints then they must be stated in the problem)
Suppose there are a total of n attractions. For each attraction either we visit it or we do not. It might seem that there are 2n possibilities. However, there's a problem. Suppose we have two museums, M1 and M2 both 5 km down the road. The possible routes are:
(Start, End) -- visit no attractions on your road trip
(Start, M1, End)
(Start, M2, End)
(Start, M1, M2, End)
(Start, M2, M1, End)
There are five non-backtracking possibilities, not four.
The algorithm you want is:
Partition the attractions by distance, so that all the partitions contain the attractions that are at the same distance.
For each partition, generate a set of all the possible orderings of all the subsets within that partition. Do not forget that "skip all of them" is a possible ordering.
The combinations you want are the Cartesian product of all the partition ordering sets.
That should give you enough hints to make progress. You have several problems to solve here: partitioning, permuting within a partition, and then taking the cross product of arbitrarily many sets. I and many others have written articles on all of these subjects, so do some research if you do not know how to solve these sub-problems yourself.
As for the asymptotic performance: As noted above, the problem given is to enumerate the solutions. The best possible case is, as noted before, 2n for cases where there are no attractions at the same distance, so we are at least exponential. If there are collisions then it becomes a product of many factorials; I leave it to you to work it out, but it's big.
Again: if the problem is to work out the number of solutions, that's much easier. You don't have to enumerate them to know how many solutions there are! Just figure out the number of orderings at each partition and then multiply all the counts together. I leave figuring out the asymptotic performance of partitioning, working out the number of orderings, and multiplying them together as an exercise.
Your solution runs in O(n ^ 3). But if you need to generate all possible combinations and the distances are sorted row and column wise i.e
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]
all solutions will degrade to O(n^3) as it requires to compute all possible subsequences.
If the input has lots of data and the distance between each of them is relatively far then a Sort + binary search + recursive solution might be faster.
static List<List<int>> answer = new List<List<int>>();
static void findPaths(List<List<int>> distances, List<int> path, int rowIndex = 0, int previousValue = -1)
{
if(rowIndex == distances.Count)
{
answer.Add(path);
return;
}
previousValue = previousValue == -1 ? distances[0][0] : previousValue;
int startIndex = distances[rowIndex].BinarySearch(previousValue);
startIndex = startIndex < 0 ? Math.Abs(startIndex) - 1 : startIndex;
// No further destination can be added
if (startIndex == distances[rowIndex].Count)
return;
for(int i=startIndex; i < distances[rowIndex].Count; ++i)
{
var temp = new List<int>(path);
int currentValue = distances[rowIndex][i];
temp.Add(currentValue);
findPaths(distances, temp, rowIndex + 1, currentValue);
}
}
The majority of savings in this solution comes from the fact that since the data is already sorted we need not look distances in the next destinations with distance less than the previous value we have.
For smaller and more closed distances this might be a overkill with the additional sorting and binary search overhead making it slower than the straightforward brute force approach.
Ultimately i think this comes down to how your data is and you can try out both approaches and try which one is faster for you.
Note: This solution does not assume strictly increasing distances i.e) [29, 37, 37] is valid here. If you do not want such solution you'll have to change Binary Search to do a upper bound as opposed to lower bound.
Use Dynamic Programming with State. As there are only 3 arrays, so there are only 2*2*2 states.
Combine the arrays and sort it. [29, 37, 37, 50, 61, 70]. And we make an 2d-array: dp[0..6][0..7]. There are 8 states:
001 means we have chosen 1st array.
010 means we have chosen 2nd array.
011 means we have chosen 1st and 2nd array.
.....
111 means we have chosen 1st, 2nd, 3rd array.
The complexity is O(n*8)=O(n)
I am having trouble understanding what the for loop is doing.
To me, I see it as:
int i = 0; //Declaring i to become 0. i is the value in myArray?
i < myArray.Length; //When i is less than any value in myArray keep looping?
i++; //Every time this loop goes through increase i by 1?
//Making an array called myArray that contains 20,5,7,2,55
int[] myArray = { 20, 5, 7, 2, 55 };
//Using the built in feature, Array.Sort(); to sort out myArray
Array.Sort(myArray);
for (int i = 0; i < myArray.Length; i++)
{
Console.WriteLine(myArray[i]);
}
I'm going to make some assumptions about your knowledge of programming, so forgive me if this explanation covers topics you're already familiar with, but they are all important for understanding what a for loop does, what it's use is and what the semantics are going to be when someone comes behind you and reads your code. Your question demonstrates that you're super close to understanding it, so hopefully it'll hit you like a ton of bricks once you have a good explanation.
Consider an array of strings of length 5. You would initialize it in C# like so:
string[] arr = new string[5];
What this means is that you have an array that has allocated 5 slots for strings. The names of these slots are the indexes of the array. Unfortunately for those who are new to programming, like yourself, indexes start at 0 (this is called zero-indexing) instead of 1. What that means is that the first slot in our new string[] has the name or index of 0, the second of 1, the third of 3 and so on. That means that they length of the array will always be a number equal to the index of the final slot plus one; to put it another way, because arrays are 0 indexed and the first (1st) slot's index is 0, we know what the index of any given slot is n - 1 where n is what folks who are not programmers (or budding programmers!) would typically consider to be the position of that slot in the array as a whole.
We can use the index to pick out the value from an array in the slot that corresponds to the index. Using your example:
int[] myArray = { 20, 5, 7, 2, 55 };
bool first = myArray[0] == 20: //=> true
bool second = myArray[1] == 5; //=> true
bool third = myArray[2] == 7; //=> true
// and so on...
So you see that the number we are passing into the indexer (MSDN) (the square brackets []) corresponds to the location in the array that we are trying to access.
for loops in C syntax languages (C# being one of them along with C, C++, Java, JavaScript, and several others) generally follow the same convention for the "parameters":
for (index_initializer; condition; index_incrementer)
To understand the intended use of these fields it's important to understand what indexes are. Indexes can be thought of as the names or locations for each of the slots in the array (or list or anything that is list-like).
So, to explain each of the parts of the for loop, lets go through them one by one:
Index Initializer
Because we're going to use the index to access the slots in the array, we need to initialize it to a starting value for our for loop. Everything before the first semicolon in the for loop statement is going to run exactly once before anything else in the for loop is run. We call the variable initialized here the index as it keeps track of the current index we're on in the scope of the for loop's life. It is typical (and therefore good practice) to name this variable i for index with nested loops using the subsequent letters of the Latin alphabet. Like I said, this initializing statement happens exactly once so we assign 0 to i to represent that we want to start looping on the first element of the array.
Condition
The next thing that happens when you declare a for loop is that the condition is checked. This check will be the first thing that is run each time the loop runs and the loop will immediately stop if the check returns false. This condition can be anything as long as it results in a bool. If you have a particularly complicated for loop, you might delegate the condition to a method call:
for (int i = 0; ShouldContinueLooping(i); i++)
In the case of your example, we're checking against the length of the array. What we are saying here from an idiomatic standpoint (and what most folks will expect when they see that as the condition) is that you're going to do something with each of the elements of the array. We only want to continue the loop so long as our i is within the "bounds" of the array, which is always defined as 0 through length - 1. Remember how the last index of an array is equal to its length minus 1? That's important here because the first time this condition is going to be false (that is, i will not be less than the length) is when it is equal to the length of the array and therefore 1 greater than the final slot's index. We need to stop looping because the next part of the for statement increases i by one and would cause us to try to access an index outside the bounds of our array.
Index incrementer
The final part of the for loop is executed once as the last thing that happens each time the loop runs. Your comment for this part is spot on.
To recap the order in which things happen:
Index initializer
Conditional check ("break out" or stop lopping if the check returns false)
Body of loop
Index incrementer
Repeat from step 2
To make this clearer, here's your example with a small addition to make things a little more explicit:
// Making an array called myArray that contains 20,5,7,2,55
int[] myArray = { 20, 5, 7, 2, 55 };
// Using the built in feature, Array.Sort(); to sort out myArray
Array.Sort(myArray);
// Array is now [2, 5, 7, 20, 55]
for (int i = 0; i < myArray.Length; i++)
{
int currentNumber = myArray[i];
Console.WriteLine($"Index {i}; Current number {currentNumber}");
}
The output of running this will be:
Index 0; Current number 2
Index 1; Current number 5
Index 2; Current number 7
Index 3; Current number 20
Index 4; Current number 55
I am having trouble understanding what the for loop is doing.
Then let's take a big step back.
When you see
for (int i = 0; i < myArray.Length; i++)
{
Console.WriteLine(myArray[i]);
}
what you should mentally think is:
int i = 0;
while (i < myArray.Length)
{
Console.WriteLine(myArray[i]);
i++;
}
Now we have rewritten the for in terms of while, which is simpler.
Of course, this requires that you understand "while". We can understand while by again, breaking it down into something simpler. When you see while, think:
int i = 0;
START:
if (i < myArray.Length)
goto BODY;
else
goto END;
BODY:
Console.WriteLine(myArray[i]);
i++;
goto START;
END:
// the rest of your program here.
Now we have broken down your loop into its fundamental parts and the control flow is laid bare to your understanding. Walk through it.
We start with i equal to 0. Suppose the length of the array is 3.
Is 0 less than 3? Yes. So we go to BODY next. We write the 0th element of the array and increment i to 1. Now we go back to START.
Is 1 less than 3? Yes. So we go to BODY next. We write the 1th element of the array and increment i to 2. Now we go back to START.
Is 2 less than 3? Yes. So we go to BODY next. We write the 2th element of the array and increment i to 3. Now we go back to START.
Is 3 less than 3? No. So we go to END, and the rest of your program executes.
Now, you probably have noticed that the "goto" form is incredibly ugly and hard to read and reason about. That's why we invented while and for loops, so that you don't have to write awful code that uses gotos. But you can always reason about simple control flow by going back to the goto form mentally.
i < myArray.Length;
This is not testing against the values inside myArray but against the length (how many items the array contains). Therefore it means: When i is less than the length of the array.
So the loop will keep going, adding 1 to i (as you correctly said) each time it loops, when i is equal to the length of the array, meaning it has gone through all the values, it will exit the loop.
As Nicolás Straub pointed out, i is the index of the array, meaning the location of an item in an array, you have initialised it with the value of 0, this is correct because the first value in an array would have an index of 0.
To directly answer your question about for loops:
A for loop is executing lines of code iteratively (multiple times), the amount depends on its control statement:
for (int i = 0; i < myArray.Length; i++)
For loops are generally pre-condition (the condition to loop is before the code) and have loop counters, being i (i is actually a counter but can be seen as the index because you are going through every element, if you wanted to skip some then i would only be a counter). For is great for when you know how many times you want to loop before you start looping.
You are correct in your thinking except as what the others have stated. Think of an array as a sequence of data. You can even use the Reverse() method to apply that to your array. I would research more about arrays so you will understand different things you can do with an array and most importantly if you need to read or write them on the console, in a listbox, or a gridview from the text or csv file.
I suggest you add:
Console.ReadLine();
When you do this the application then will read like this:
2
5
7
20
55
I have two lists. The first one contains entries like
RB Leipzig vs SV Darmstadt 98
Hertha Berlin vs Hoffenheim
..
and in the second contains basically the same entries but could but written in different forms. For example:
Hertha BSC vs TSG Hoffenheim
RB Leipzig vs Darmstadt 98
..
and so on. Both lists represent the same sport games but they can use alternate team names and don't appear in the same order.
My goal (hehe pun) is to unify both lists to one and match the same entries and discard entries which don't appear in both lists.
I already tried to use Levensthein distance and fuzzy search.
I thought about using machine learning but have no idea how to start with that.
Would appriciate any help and ideas!
You can solve this problem using Linear Programming combined with the Levenshtein Distance you already mentioned. Linear Programming is a commonly used optimization technique for solving optimization problems, like this one. Check this link to find out an example how to use Solver Foundation in C#. This example isn't related with the specific problem you have, but is a good example how the library works.
Hints:
You need to build a matrix of distances between each pair of teams/strings between 2 lists. Let's say both lists have N elements. In i-th row of the matrix you will have N values, the j-th value will indicate the Levenshtein Distance between i-th element from the first and j-th element from the second list. Then, you need to set the constraints. The constraints would be:
The sum in each row needs to equal 1
The sum in each column equals 1
Each of the coefficient (matrix entry) needs to be either 0 or 1
I have solved the same problem a couple of months ago and this approach worked perfectly for me.
And the cost function would be the sum: `
sum(coef[i][j] * dist[i][j] for i in [1, n] and for j in [1, n])
`. You want to minimize this function, because you want the overall "distance" between the 2 sets after the mapping to be as low as possible.
You can use a BK-tree (I googled C# implementations and found two: 1, 2). Use the Levenshtein distance as the metric. Optionally, delete the all-uppercase substrings from the names in the lists in order to improve the metric (just be careful that this doesn't accidentally leave you with empty strings for names).
1. Put the names from the first list in the BK-tree
2. Look up the names from the second list in the BK-tree
a. Assign an integer token to the name pair, stored in a Map<Integer, Tuple<String, String>>
b. Replace each team name with the token
3. Sort each token pair (so [8 vs 4] becomes [4 vs 8])
4. Sort each list by its first token in the token pair,
then by the second token in the token pair (so the list
would look like [[1 vs 2], [1 vs 4], [2 vs 4]])
Now you just iterate through the two lists
int i1 = 0
int i2 = 0
while(i1 < list1.length && i2 < list2.length) {
if(list1[i1].first == list2[i2].first && list1[i1].second == list2[i2].second) {
// match
i1++
i2++
} else if(list1[i1].first < list2[i2].first) {
i1++
} else if(list1[i1].first > list2[i2].first) {
i2++
} else if(list1[i1].second < list2[i2].second {
i1++
} else {
i2++
}
}
The problem I'm trying to solve gives me a matrix like
10101
11100
11010
00101
where the rows are supposed to represented topics that a person knows; e.g. Person 1, represented by 10101, knows topics 1, 3 and 5, but not 2 or 4. I need to find the maximum number of topics that a 2-person team could know; e.g. the team that is Person 1 and 3 knows all the topics because between 10101 and 11010 there are 1s at every index.
I have an O(n^2) solution
string[] topic = new string[n];
for(int topic_i = 0; topic_i < n; topic_i++)
{
topic[topic_i] = Console.ReadLine();
}
IEnumerable<int> teamTopics =
from t1 in topic
from t2 in topic
where !Object.ReferenceEquals(t1, t2)
select t1.Zip(t2, (c1, c2) => c1 == '1' || c2 == '1').Sum(b => b ? 1 : 0);
int max = teamTopics.Max();
Console.WriteLine(max);
which is passing all the test cases it doesn't time out on. I suspect the reason it's not fast enough has to do with the time complexity rather than the overhead of the LINQ machinery. But I can't think of a better way to do it.
I thought that maybe I could map the indices of topics to the persons who know them, like
1 -> {1,2,3}
2 -> {2,3}
3 -> {1,2,4}
4 -> {3}
5 -> {1,4}
but I can't think of where to go from there.
Can you supply me with a "hint"?
Let's say we have n people and m topics.
I would argue that your algorithm is O(n^2 * m), where n is number of people, because:
from t1 in topic gets you O(n)
from t2 in topic gets you to O(n^2)
t1.Zip(t2 ... get you to O(n^2 * m)
An optimisation that I see is first to modify strings a bit:
s1 = '0101', where i-th element shows whether a person i knows 1st topic
s2 = '1111', where i-th element shows whether a person i knows 2nd topic.
etc...
Then you analyse string s1. You pick all possible pairs of 1s (O(n^2) elements) that show pairs of people that together know 1st topic. Then go pick a pair from that list and check whether they know 2nd topic as well and so on. When they don't, delete it from the list and move on to another pair.
Unfortunately this looks to be O(n^2 * m) as well, but this should be quicker in practise. For very sparse matrix, it should be close to O(n2), and for dense matrices it should find a pair pretty soon.
Thoughts:
as a speculative optimization: you could do an O(n) sweep to find the individual with the highest number of skills (largest hamming weight); note them, and stop if they have everything: pair them with anyone, it doesn't matter
you can exclude anyone without testing who only has skilled shared with the "best" individual - we already know about everything they can offer and have tested against everyone; so only test if (newSkills & ~bestSkills) != 0 - meaning: the person being tested has something that the "best" worker didn't have; this leaves m workers with complementary skills plus the "best" worker (you must include them explicitly, as the ~/!=0 test above will fail for them)
now do another O(m) sweep of possible partners - checking to see if the "most skilled" plus any other gives you all the skills (obviously stop earlier if a single member has all the skills); but either way: keep track of best combination for later reference
you can further half the time by only considering the triangle, not the square - meaning: you compare row 0 to rows 1-(m-1), but row 1 to rows 2-(m-1), row 5 to 6-(m-1), etc
you can significantly improve things by using integer bit math along with an efficient "hamming weight" algorithm (to count the set bits) rather than strings and summing
get rid of the LINQ
short-circuit if you get all ones (compare to ~((~0)<<k), where k is the number of bits being tested for)
remember to compare any result to the "best" combination we found against the most skilled worker
This is still O(n) + O(m^2) where m <= n is the number of people with skills different to the most skilled worker
Pathological but technically correct answer:
insert a Thread.Sleep(FourYears) - all solutions are now essentially O(1)
Your solution is asymptotically as efficient as it gets, because you need to examine all pairs to arrive at the maximum. You can make your code more efficient by replacing strings with BitArray objects, like this:
var topic = new List<BitArray>();
string line;
while ((line = Console.ReadLine()) != null) {
topic.Add(new BitArray(line.Select(c => c=='1').ToArray()));
}
var res =
(from t1 in topic
from t2 in topic
select t1.Or(t2).Count).Max();
Console.WriteLine(res);
Demo.
From a start and end where both data types are long's, I'd like to produce a randomly sorted list with them.
At the moment, I'm using a for loop to populate a list:
for (var i = idStart; i < idEnd; i++){ list.Add(i); }
Then I'm shuffle'ing the list using an extension method. However, when the difference between start and end are large (millions), the for loop causes out of memory exceptions.
Is there a more efficient, sleeker method for producing an randomly sequenced list of long's, where each number only appears once?
Is there a more efficient, sleeker method for producing an randomly sequenced list of long's, where each number only appears once?
Yes, if you eliminate the requirement that the sequence be truly random. Use the following technique.
Without loss of generality let us suppose that you wish to generate numbers from 0 through n-1 for some n. Clearly you can see how to generate numbers between x and y; just generate numbers from 0 through x-y and then add x to each.
Find a randomly generated number z that is coprime to n. Doing so is left as an exercise to the reader. It will help if the number is pretty large modulo n; the pattern will be easy to notice if z is small modulo n.
Find a randomly generated number m that is between 0 and n-1.
Now generate the sequence (m) * z % n, (m + 1) * z % n, (m + 2) * z % n, and so on. The sequence repeats at (m + n) * z % n; it does not repeat before that. Again, determining why it does not repeat is left as an exercise.
It is easy to see that this is not a true shuffle because there are fewer than n squared possible sequences generated, not the n factorial sequences that are possible with a true shuffle. But it might be good enough for your purposes; if you are using something like System.Random to do randomization you are already abandoning a true shuffle.
I note also that many of the comments suggest that there should be no problem with a large allocation. These comments forget that (1) the relevant measure is not amount of RAM in the box but rather size of the largest contiguous user mode address space block, and that can easily be less than a hundred million bytes in a 32 bit process, (2) that the list data structure intentionally over-allocates, that (3) when the list gets full a copy of the underlying array must be allocated to copy the old list into the new list, which more than doubles the actual memory load of the list, temporarily, and that (4) a user who naively attempts to allocate one hundred-million-byte structure may well attempt to allocate a dozen of them throughout the program. You should always avoid such large allocations; if you have data structures that require large amounts of storage then put them on disk.