Determining how close an array is to the target array

Determining how close an array is to the target array - c#

I'm playing a little experiment to increase my knowledge and I have reached a part where I feel i could really optimize it, but am not quite sure how to do this.
I have many arrays of numbers. (for simplicity, lets say each array has 4 numbers: 1, 2, 3, and 4)
The target is to have all of the numbers in ascending order (ie,
1-2-3-4), but the numbers are all scrambled in the different arrays.
A higher weight is placed upon larger numbers.
I need to sort all of these arrays in order of how close they are to
the target.
Ie, 4-3-2-1 would be the worst possible case.
Some example cases:
3-4-2-1 is better than 4-3-2-1
2-3-4-1 is better than 1-4-3-2 (even though two numbers match (1 and 3).
the biggest number is closer to its spot.)
So the big numbers always take precedence over the smaller numbers. Here is my attempt:
var tmp = from m in moves
let mx = m.Max()
let ranking = m.IndexOf(s => s == mx)
orderby ranking descending
select m;
return tmp.ToArray();
P.S IndexOf in the above example, is an extension I wrote to take an array and expression, and return the index of the element that satisfies the expression. It is needed because the situation is really a little more complicated, i'm simplifying it with my example.
The problem with my attempt here though, is that it would only sort by the biggest number, and forget all of the other numbers. it SHOULD rank by biggest number first, then by second largest, then by third.
Also, since it will be doing this operation over and over again for several minutes, it should be as efficient as possible.

You could implement a bubble sort, and count the number of times you have to move data around. The number of data moves will be large on arrays that are far away from the sorted ideal.
int GetUnorderedness<T>(T[] data) where T : IComparable<T>
{
data = (T[])data.Clone(); // don't modify the input data,
// we weren't asked to actually sort.
int swapCount = 0;
bool isSorted;
do
{
isSorted = true;
for(int i = 1; i < data.Length; i++)
{
if(data[i-1].CompareTo(data[i]) > 0)
{
T temp = data[i];
data[i] = data[i-1];
data[i-1] = temp;
swapCount++;
isSorted = false;
}
}
} while(!isSorted);
}
From your sample data, this will give slightly different results than you specified.
Some example cases:
3-4-2-1 is better than 4-3-2-1
2-3-4-1 is better than 1-4-3-2
3-4-2-1 will take 5 swaps to sort, 4-3-2-1 will take 6, so that works.
2-3-4-1 will take 3, 1-4-3-2 will also take 3, so this doesn't match up with your expected results.
This algorithm doesn't treat the largest number as the most important, which it seems you want; all numbers are treated equally. From your description, you'd consider 2-1-3-4 as much better than 1-2-4-3, because the first one has both the largest and second largest numbers in their proper place. This algorithm would consider those two equal, because each requires only 1 swap to sort the array.
This algorithm does have the advantage that it's not just a comparison algorithm, each input has a discrete output, so you only need to run the algorithm once for each input array.

I hope this helps
var i = 0;
var temp = (from m in moves select m).ToArray();
do
{
temp = (from m in temp
orderby m[i] descending
select m).ToArray();
}
while (++i < moves[0].Length);

Related

How do you do this in C# without using List?

I am new to C#. The following code was a solution I came up to solve a challenge. I am unsure how to do this without using List since my understanding is that you can't push to an array in C# since they are of fixed size.
Is my understanding of what I said so far correct?
Is there a way to do this that doesn't involve creating a new array every time I need to add to an array? If there is no other way, how would I create a new array when the size of the array is unknown before my loop begins?
Return a sorted array of all non-negative numbers less than the given n which are divisible both by 3 and 4. For n = 30, the output should be
threeAndFour(n) = [0, 12, 24].
int[] threeAndFour(int n) {
List<int> l = new List<int>(){ 0 };
for (int i = 12; i < n; ++i)
if (i % 12 == 0)
l.Add(i);
return l.ToArray();
}
EDIT: I have since refactored this code to be..
int[] threeAndFour(int n) {
List<int> l = new List<int>(){ 0 };
for (int i = 12; i < n; i += 12)
l.Add(i);
return l.ToArray();
}

A. Lists is OK
If you want to use a for to find out the numbers, then List is the appropriate data structure for collecting the numbers as you discover them.
B. Use more maths
static int[] threeAndFour(int n) {
var a = new int[(n / 12) + 1];
for (int i = 12; i < n; i += 12) a[i/12] = i;
return a;
}
C. Generator pattern with IEnumerable<int>
I know that this doesn't return an array, but it does avoid a list.
static IEnumerable<int> threeAndFour(int n) {
yield return 0;
for (int i = 12; i < n; i += 12)
yield return i;
}
D. Twist and turn to avoid a list
The code could for twice. First to figure the size or the array, and then to fill it.
int[] threeAndFour(int n) {
// Version: A list is really undesirable, arrays are great.
int size = 1;
for (int i = 12; i < n; i += 12)
size++;
var a = new int[size];
a[0] = 0;
int counter = 1;
for (int i = 12; i < n; i += 12) a[counter++] = i;
}

if (i % 12 == 0)
So you have figured out that the numbers which divides both 3 and 4 are precisely those numbers that divides 12.
Can you figure out how many such numbers there are below a given n? - Can you do so without counting the numbers - if so there is no need for a dynamically growing container, you can just initialize the container to the correct size.
Once you have your array just keep track of the next index to fill.

You could use Linq and Enumerable.Range method for the purpose. For example,
int[] threeAndFour(int n)
{
return Enumerable.Range(0,n).Where(x=>x%12==0).ToArray();
}
Enumerable.Range generates a sequence of integral numbers within a specified range, which is then filtered on the condition (x%12==0) to retrieve the desired result.

Since you know this goes in steps of 12 and you know how many there are before you start, you can do:
Enumerable.Range(0,n/12+1).Select(x => x*12).ToArray();

I am unsure how to do this without using List since my understanding is that you can't push to an array in C# since they are of fixed size.
It is correct that arrays can not grow. List were invented as a wrapper around a array that automagically grows whenever needed. Note that you can give List a integer via the Constructor, wich will tell it the minimum size it should expect. It will allocate at least that much the first time. This can limit growth related overhead.
And dictionaries are just a variation of the list mechanics, with Hash Table key search speed.
There is only 1 other Collection I know of that can grow. However it is rarely mentioned outside of theory and some very specific cases:
Linked Lists. The linked list has a unbeatable growth performance and the lowest issue of running into OutOfMemory Exceptions due to Fragmentation. Unfortunately, their random access times are the worst as a result. Unless you can process those collections exclusively sequentally from the start (or sometimes the end), their performance will be abysmal. Only stacks and queues are likely to use them. There is however still a implementation you could use in .NET: https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.linkedlist-1
Your code holds some potential too:
for (int i = 12; i < n; ++i)
if (i % 12 == 0)
l.Add(i);
It would way more effective to count up by 12 every itteration - you are only interested in every 12th number after all. You may have to change the loop, but I think a do...while would do. Also the array/minimum List size is easily predicted: Just divide n by 12 and add 1. But I asume that is mostly mock-up code and it is not actually that deterministic.

List generally works pretty well, as I understand your question you have challenged yourself to solve a problem without using the List class. An array (or List) uses a contiguous block of memory to store elements. Arrays are of fixed size. List will dynamically expand to accept new elements but still keeps everything in a single block of memory.
You can use a linked list https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.linkedlist-1?view=netframework-4.8 to produce a simulation of an array. A linked list allocates additional memory for each element (node) that is used to point to the next (and possibly the previous). This allows you to add elements without large block allocations, but you pay a space cost (increased use of memory) for each element added. The other problem with linked lists are you can't quickly access random elements. To get to element 5, you have to go through elements 0 through 4. There's a reason arrays and array like structures are favored for many tasks, but it's always interesting to try to do common things in a different way.

How to best implement K-nearest neighbours in C# for large number of dimensions?

I'm implementing the K-nearest neighbours classification algorithm in C# for a training and testing set of about 20,000 samples each, and 25 dimensions.
There are only two classes, represented by '0' and '1' in my implementation. For now, I have the following simple implementation :
// testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
// trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
static int[] TestKnnCase(IList<double[]> trainSamples, IList<double[]> testSamples, IList<int[]> trainClasses, int K)
{
Console.WriteLine("Performing KNN with K = "+K);
var testResults = new int[testSamples.Count()];
var testNumber = testSamples.Count();
var trainNumber = trainSamples.Count();
// Declaring these here so that I don't have to 'new' them over and over again in the main loop,
// just to save some overhead
var distances = new double[trainNumber][];
for (var i = 0; i < trainNumber; i++)
{
distances[i] = new double[2]; // Will store both distance and index in here
}
// Performing KNN ...
for (var tst = 0; tst < testNumber; tst++)
{
// For every test sample, calculate distance from every training sample
Parallel.For(0, trainNumber, trn =>
{
var dist = GetDistance(testSamples[tst], trainSamples[trn]);
// Storing distance as well as index
distances[trn][0] = dist;
distances[trn][1] = trn;
});
// Sort distances and take top K (?What happens in case of multiple points at the same distance?)
var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);
// Do a 'majority vote' to classify test sample
var yea = 0.0;
var nay = 0.0;
foreach (var voter in votingDistances)
{
if (trainClasses[(int)voter[1]] == 1)
yea++;
else
nay++;
}
if (yea > nay)
testResults[tst] = 1;
else
testResults[tst] = 0;
}
return testResults;
}
// Calculates and returns square of Euclidean distance between two vectors
static double GetDistance(IList<double> sample1, IList<double> sample2)
{
var distance = 0.0;
// assume sample1 and sample2 are valid i.e. same length
for (var i = 0; i < sample1.Count; i++)
{
var temp = sample1[i] - sample2[i];
distance += temp * temp;
}
return distance;
}
This takes quite a bit of time to execute. On my system it takes about 80 seconds to complete. How can I optimize this, while ensuring that it would also scale to larger number of data samples? As you can see, I've tried using PLINQ and parallel for loops, which did help (without these, it was taking about 120 seconds). What else can I do?
I've read about KD-trees being efficient for KNN in general, but every source I read stated that they're not efficient for higher dimensions.
I also found this stackoverflow discussion about this, but it seems like this is 3 years old, and I was hoping that someone would know about better solutions to this problem by now.
I've looked at machine learning libraries in C#, but for various reasons I don't want to call R or C code from my C# program, and some other libraries I saw were no more efficient than the code I've written. Now I'm just trying to figure out how I could write the most optimized code for this myself.
Edited to add - I cannot reduce the number of dimensions using PCA or something. For this particular model, 25 dimensions are required.

Whenever you are attempting to improve the performance of code, the first step is to analyze the current performance to see exactly where it is spending its time. A good profiler is crucial for this. In my previous job I was able to use the dotTrace profiler to good effect; Visual Studio also has a built-in profiler. A good profiler will tell you exactly where you code is spending time method-by-method or even line-by-line.
That being said, a few things come to mind in reading your implementation:
You are parallelizing some inner loops. Could you parallelize the outer loop instead? There is a small but nonzero cost associated to a delegate call (see here or here) which may be hitting you in the "Parallel.For" callback.
Similarly there is a small performance penalty for indexing through an array using its IList interface. You might consider declaring the array arguments to "GetDistance()" explicitly.
How large is K as compared to the size of the training array? You are completely sorting the "distances" array and taking the top K, but if K is much smaller than the array size it might make sense to use a partial sort / selection algorithm, for instance by using a SortedSet and replacing the smallest element when the set size exceeds K.

How can I get a random x number of decimals from a list of unique decimals that total up to y?

Say I have a sorted list of 1000 or so unique decimals, arranged by value.
List<decimal> decList
How can I get a random x number of decimals from a list of unique decimals that total up to y?
private List<decimal> getWinningValues(int xNumberToGet, decimal yTotalValue)
{
}
Is there any way to avoid a long processing time on this? My idea so far is to take xNumberToGet random numbers from the pool. Something like (cool way to get random selection from a list)
foreach (decimal d in decList.OrderBy(x => randomInstance.Next())Take(xNumberToGet))
{
}
Then I might check the total of those, and if total is less, i might shift the numbers up (to the next available number) slowly. If the total is more, I might shift the numbers down. I'm still now sure how to implement or if there is a better design readily available. Any help would be much appreciated.

Ok, start with a little extension I got from this answer,
public static IEnumerable<IEnumerable<T>> Combinations<T>(
this IEnumerable<T> source,
int k)
{
if (k == 0)
{
return new[] { Enumerable.Empty<T>() };
}
return source.SelectMany((e, i) =>
source.Skip(i + 1).Combinations(k - 1)
.Select(c => (new[] { e }).Concat(c)));
}
this gives you a pretty efficient method to yield all the combinations with k members, without repetition, from a given IEnumerable. You could make good use of this in your implementation.
Bear in mind, if the IEnumerable and k are sufficiently large this could take some time, i.e. much longer than you have. So, I've modified your function to take a CancellationToken.
private static IEnumerable<decimal> GetWinningValues(
IEnumerable<decimal> allValues,
int numberToGet,
decimal targetValue,
CancellationToken canceller)
{
IList<decimal> currentBest = null;
var currentBestGap = decimal.MaxValue;
var locker = new object();
allValues.Combinations(numberToGet)
.AsParallel()
.WithCancellation(canceller)
.TakeWhile(c => currentBestGap != decimal.Zero)
.ForAll(c =>
{
var gap = Math.Abs(c.Sum() - targetValue);
if (gap < currentBestGap)
{
lock (locker)
{
currentBestGap = gap;
currentBest = c.ToList();
}
}
}
return currentBest;
}
I've an idea that you could sort the initial list and quit iterating the combinations at a certain point, when the sum must exceed the target. After some consideration, its not trivial to identify that point and, the cost of checking may exceed the benefit. This benefit would have to be balanced agaist some function of the target value and mean of the set.
I still think further optimization is possible but I also think that this work has already been done and I'd just need to look it up in the right place.

There are k such subsets of decList (k might be 0).
Assuming that you want to select each one with uniform probability 1/k, I think you basically need to do the following:
iterate over all the matching subsets
select one
Step 1 is potentially a big task, you can look into the various ways of solving the "subset sum problem" for a fixed subset size, and adapt them to generate each solution in turn.
Step 2 can be done either by making a list of all the solutions and choosing one or (if that might take too much memory) by using the clever streaming random selection algorithm.
If your data is likely to have lots of such subsets, then generating them all might be incredibly slow. In that case you might try to identify groups of them at a time. You'd have to know the size of the group without visiting its members one by one, then you can choose which group to use weighted by its size, then you've reduced the problem to selecting one of that group at random.
If you don't need to select with uniform probability then the problem might become easier. At the best case, if you don't care about the distribution at all then you can return the first subset-sum solution you find -- whether you'd call that "at random" is another matter...

C# - Help optimizing a loop

I have a piece of code that in principal looks like the below. The issue is that I am triggering this code 10's of thousands of times and need it to be more optimized. Any suggestions would be welcome.
//This array is in reality enormous and needs to be triggered loads of times in my code
int[] someArray = { 1, 631, 632, 800, 801, 1600, 1601, 2211, 2212, 2601, 2602 };
//I need to know where in the array a certain value is located
//806 is located between entry 801 and 1600 so I want the array ID of 801 to be returned (4).
id = 806
//Since my arrays are very large, this operation takes far too long
for (int i = 0; i < someArrayLenght; i++)
{
if (someArray[i] <= id)
return i;
}
Edit: Sorry got the condition wrong. It should return the id when 806 is greater than 801. Hope you can make sense ot ouf it.

The array values look sorted. If that’s indeed the case, use binary search:
int result = Array.BinarySearch(someArray, id);
return result < 0 ? (~result - 1) : result;
If the searched value does not appear in the array, Array.BinarySearch will return the bitwise complement of the next greater value’s index. This is why I am testing for negative numbers and using the bitwise complement operator in the code above. The result should then be the same as in your code.
Binary search has logarithmic running time instead of linear. That is, in the worst case only log2 n many entries have to be searched instead of n (where n is the array’s size).

Providing someArray's content is sorted, use binary search — see also Array.BinarySearch.
Note: In your example the condition in if (someArray[i] <= id) return i; will trigger whenever id >= 1. I doubt that's what you want to do.

Series calculation

I have some random integers like
99 20 30 1 100 400 5 10
I have to find a sum from any combination of these integers that is closest(equal or more but not less) to a given number like
183
what is the fastest and accurate way of doing this?

If your numbers are small, you can use a simple Dynamic Programming(DP) technique. Don't let this name scare you. The technique is fairly understandable. Basically you break the larger problem into subproblems.
Here we define the problem to be can[number]. If the number can be constructed from the integers in your file, then can[number] is true, otherwise it is false. It is obvious that 0 is constructable by not using any numbers at all, so can[0] is true. Now you try to use every number from the input file. We try to see if the sum j is achievable. If an already achieved sum + current number we try == j, then j is clearly achievable. If you want to keep track of what numbers made a particular sum, use an additional prev array, which stores the last used number to make the sum. See the code below for an implementation of this idea:
int UPPER_BOUND = number1 + number2 + ... + numbern //The largest number you can construct
bool can[UPPER_BOUND + 1]; //can[number] is true if number can be constructed
can[0] = true; //0 is achievable always by not using any number
int prev[UPPER_BOUND + 1]; //prev[number] is the last number used to achieve sum "number"
for (int i = 0; i < N; i++) //Try to use every number(numbers[i]) from the input file
{
for (int j = UPPER_BOUND; j >= 1; j--) //Try to see if j is an achievable sum
{
if (can[j]) continue; //It is already an achieved sum, so go to the next j
if (j - numbers[i] >= 0 && can[j - numbers[i]]) //If an (already achievable sum) + (numbers[i]) == j, then j is obviously achievable
{
can[j] = true;
prev[j] = numbers[i]; //To achieve j we used numbers[i]
}
}
}
int CLOSEST_SUM = -1;
for (int i = SUM; i <= UPPER_BOUND; i++)
if (can[i])
{
//the closest number to SUM(larger than SUM) is i
CLOSEST_SUM = i;
break;
}
int currentSum = CLOSEST_SUM;
do
{
int usedNumber = prev[currentSum];
Console.WriteLine(usedNumber);
currentSum -= usedNumber;
} while (currentSum > 0);

This seems to be a Knapsack-like problem, where the value of your integers would be the "weight" of each item, the "profit" of each item is 1, and you are looking for the least number of items to exactly sum to the maximum allowable weight of the knapsack.

This is a variant of the SUBSET-SUM problem, and is also NP-Hard like SUBSET-SUM.
But if the numbers involved are small, pseudo-polynomial time algorithms exist. Check out:
http://en.wikipedia.org/wiki/Subset_sum_problem
Ok More details.
The following problem:
Given an array of integers, and integers a,b, is there
some subset whose sum lies in the
interval [a,b] is NP-Hard.
This is so because we can solve subset-sum by choosing a=b=0.
Now this problem easily reduces to your problem and so your problem is NP-Hard too.
Now you can use the polynomial time approximation algorithm mentioned in the wiki link above.
Given an array of N integers, a target S and an approximation threshold c,
there is a polynomial time approximation algorithm (involving 1/c) which tells if there is a subset sum in the interval [(1-c)S, S].
You can use this repeatedly (by some form of binary search) to find the best approximation to S you need. Note you can also use this on intervals of the from [S, (1+c)S], while the knapsack will only give you a solution <= S.
Of course there might be better algorithms, in fact I can bet on it. There should be plenty of literature on the web. Some search terms you can use: approximation algorithms for subset-sum, pseudo-polynomial time algorithms, dynamic programming algorithm etc.

A simple-brute-force-method would be to read the text in, parse it into numbers, and then go through all combinations until you find the required sum.
A quicker solution would be to sort the numbers, then...
Add the largest number to your sum, Is it too big? if so, take it off and try the next smallest.
if the sum is too small, add the next largest number and repeat.
Continue adding numbers not letting the sum exceed the target. Finish when you hit the target.
Note that when you backtrack, you may need to back track more than one level. Sounds like a good case for recursion...

If the numbers are large you can turn this into an Integer Programme. Using Mathematicas solver, it might look something like this
nums = {99, 20, 30 , 1, 100, 400, 5, 10};
vars = a /# Range#Length#nums;
Minimize[(vars.nums - 183)^2, vars, Integers]

You can sort the list of values, find the first value that's greater than the target, and start concentrating on the values that are less than the target. Find the sum that's closest to the target without going over, then compare that to the first value greater than the target. If the difference between the closest sum and the target is less than the difference between the first value greater than the target and the target, then you have the sum that's closest.
Kinda hokey, but I think the logic hangs together.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.