Better algorithm for a date comparison task - c#

I would like some help making this comparison faster (sample below). The sample take each value in an array, attach an hour to a comparison-variable. If no matching value, it's add the value to a second array (which are concatenated later).
if (ticks.TypeOf == Period.Hour)
while (compareAt <= endAt)
{
if (range.Where(d => d.time.AddMinutes(-d.time.Minute) == compareAt).Count() < 1)
gaps.Add(new SomeValue() {
...some dummy values.. });
compareAt = compareAt.AddTicks(ticks.Ticks);
}
This execution is too consuming when came to i.e. hours. There are 365 * 24 = 8760 values at most in this array. In future, there will also be minutes/seconds per month 60*24*31=44640, which means unusable.
If the array most often was complete (which means no gaps/empty slots), it could easily be by-passed with if (range.Count() == (hours/day * days)). Though, that day will not be today.
How would I solve it more effective?
One example: If ther are 7800 values in the array, we miss about 950, right? But can I find just the gaps-endings, and just create the missing values? That would make the o-notation depend on amount of gaps, not the amount of values..
One other welcome answer is just an more effective loop.
[Edit]
Sorry for bad english, I try my best to describe.

Your performance is low because the range lookup is not using any indexing and rechecks the entire range every time.
One way to do this a lot quicker;
if (ticks.TypeOf == Period.Hour)
{
// fill a hashset with the range's unique hourly values
var rangehs = new HashSet<DateTime>();
foreach (var r in range)
{
rangehs.Add(r.time.AddMinutes(-r.time.Minute));
}
// walk all the hours
while (compareAt <= endAt)
{
// quickly check if it's a gap
if (!rangehs.Contains(compareAt))
gaps.Add(new SomeValue() { ...some dummy values..});
compareAt = compareAt.AddTicks(ticks.Ticks);
}
}

Related

Find the closest previous DateTime from a list of DateTime

Is there a faster way of obtaining the closest previous (past) DateTime from a list of DateTimes when compared to a specific time? (the list comes from a SQL database)
public DateTime GetClosestPreviousDateTime(List<DateTime> dateTimes, DateTime specificTime)
{
DateTime ret = null;
var lowestDifference = TimeSpan.MaxValue;
foreach (var date in dateTimes)
{
if (date >= specificTime)
continue;
var difference = specificTime- date;
if (difference < lowestDifference)
{
lowestDifference = difference;
ret = date;
}
}
return ret;
}
The source list will be sorted since the dates in the list come from a SQL database where they are written consecutively.
It depends what you mean by "faster". The algorithm you show is O(N) so no you won't get faster than that - if by faster you mean is there a way to not have to iterate over all dates.
But if you mean can you shave off a few microseconds with some code that doesn't emit quite as many op codes, then yes of course. But is that really the issue here?
The answer will also change based on the size of the list, how accurate you need the answer to be, whether we can make any assumptions on the data (e.g. is it already sorted).
dateTimes.Sort();
var closest = dateTimes[dateTimes.IndexOf(search) - 1];
Your problem is a classic search algorithm and binary search might suit you.
Sort list: dateTimes.Sort();
Apply Binary Search algo with similar logic as in your for statement.
dateTimes.Where(x => x < specificTime).Max()
or if you want to handle the case where none exist:
dateTimes.Where(x => x < specificTime).DefaultIfEmpty().Max()
Later edit: Now you introduce new information that the List<> is already sorted. That was not in the question before.
With a sorted List<>, your algorithm is silly since it foreaches on and on, even after having reached the point where the entries "pass" the threshold specificTime. You can use instead BinarySearch (assuming List<> is sorted in ascending order and contains no duplicates):
static DateTime GetClosestPreviousDateTime(List<DateTime> dateTimes, DateTime specificTime)
{
var search = dateTimes.BinarySearch(specificTime);
var index = (search < 0 ? ~search : search) - 1;
if (index == -1)
throw new InvalidOperationException("Not found");
return dateTimes[index];
}
If you want to do it faster, just ask the database for the value, it will know how to find the answer fast; do not fetch the entire List<> to memory first. Use SQL or LINQ to the database.

Subset Sum algorithm efficiency

We have a number of payments (Transaction) that come into our business each day. Each Transaction has an ID and an Amount. We have the requirement to match a number of these transactions to a specific amount. Example:
Transaction Amount
1 100
2 200
3 300
4 400
5 500
If we wanted to find the transactions that add up to 600 you would have a number of sets (1,2,3),(2,4),(1,5).
I found an algorithm that I have adapted, that works as defined below. For 30 transactions it takes 15ms. But the number of transactions average around 740 and have a maximum close to 6000. Is the a more efficient way to perform this search?
sum_up(TransactionList, remittanceValue, ref MatchedLists);
private static void sum_up(List<Transaction> transactions, decimal target, ref List<List<Transaction>> matchedLists)
{
sum_up_recursive(transactions, target, new List<Transaction>(), ref matchedLists);
}
private static void sum_up_recursive(List<Transaction> transactions, decimal target, List<Transaction> partial, ref List<List<Transaction>> matchedLists)
{
decimal s = 0;
foreach (Transaction x in partial) s += x.Amount;
if (s == target)
{
matchedLists.Add(partial);
}
if (s > target)
return;
for (int i = 0; i < transactions.Count; i++)
{
List<Transaction> remaining = new List<Transaction>();
Transaction n = new Transaction(0, transactions[i].ID, transactions[i].Amount);
for (int j = i + 1; j < transactions.Count; j++) remaining.Add(transactions[j]);
List<Transaction> partial_rec = new List<Transaction>(partial);
partial_rec.Add(new Transaction(n.MatchNumber, n.ID, n.Amount));
sum_up_recursive(remaining, target, partial_rec, ref matchedLists);
}
}
With Transaction defined as:
class Transaction
{
public int ID;
public decimal Amount;
public int MatchNumber;
public Transaction(int matchNumber, int id, decimal amount)
{
ID = id;
Amount = amount;
MatchNumber = matchNumber;
}
}
As already mentioned your problem can be solved by pseudo polynomial algorithm in O(n*G) with n - number of items and G - your targeted sum.
The first part question: is it possible to achieve the targeted sum G. The following pseudo/python code solves it (have no C# on my machine):
def subsum(values, target):
reached=[False]*(target+1) # initialize as no sums reached at all
reached[0]=True # with 0 elements we can only achieve the sum=0
for val in values:
for s in reversed(xrange(target+1)): #for target, target-1,...,0
if reached[s] and s+val<=target: # if subsum=s can be reached, that we can add the current value to this sum and build an new sum
reached[s+val]=True
return reached[target]
What is the idea? Let's consider values [1,2,3,6] and target sum 7:
We start with an empty set - the possible sum is obviously 0.
Now we look at the first element 1 and have to options to take or not to take. That leaves as with possible sums {0,1}.
Now looking at the next element 2: leads to possible sets {0,1} (not taking)+{2,3} (taking).
Until now not much difference to your approach, but now for element 3 we have possible sets a. for not taking {0,1,2,3} and b. for taking {3,4,5,6} resulting in {0,1,2,3,4,5,6} as possible sums. The difference to your approach is that there are two way to get to 3 and your recursion will be started twice from that (which is not needed). Calculating basically the same staff over and over again is the problem of your approach and why the proposed algorithm is better.
As last step we consider 6 and get {0,1,2,3,4,5,6,7} as possible sums.
But you also need the subset which leads to the targeted sum, for this we just remember which element was taken to achieve the current sub sum. This version returns a subset which results in the target sum or None otherwise:
def subsum(values, target):
reached=[False]*(target+1)
val_ids=[-1]*(target+1)
reached[0]=True # with 0 elements we can only achieve the sum=0
for (val_id,val) in enumerate(values):
for s in reversed(xrange(target+1)): #for target, target-1,...,0
if reached[s] and s+val<=target:
reached[s+val]=True
val_ids[s+val]=val_id
#reconstruct the subset for target:
if not reached[target]:
return None # means not possible
else:
result=[]
current=target
while current!=0:# search backwards jumping from predecessor to predecessor
val_id=val_ids[current]
result.append(val_id)
current-=values[val_id]
return result
As an another approach you could use memoization to speed up your current solution remembering for the state (subsum, number_of_elements_not considered) whether it is possible to achieve the target sum. But I would say the standard dynamic programming is a less error prone possibility here.
Yes.
I can't provide full code at the moment, but instead of iterating each list of transactions twice until finding matches (O squared), try this concept:
setup a hashtable with the existing transaction amounts as entries, as well as the summation of each set of two transactions assuming each value is made of a max of two transactions (weekend credit card processing).
for each total, reference into the hashtable - the sets of transactions in that slot are the list of matching transactions.
Instead of O^2, you can get it down to 4*O, which would make a noticeable difference in speed.
Good luck!
Dynamic programming can solve this problem efficiently:
Assume you have n transactions and the max amount of transactions is m.
we can solve it just in the complexity of O(nm).
learn it at Knapsack problem.
for this problem we can define for pre i transactions the numbers of subset, add up to sum: dp[i][sum].
the equation:
for i 1 to n:
dp[i][sum] = dp[i - 1][sum - amount_i]
the dp[n][sum] is the numbers of you need, and you need to add some tricks to get what are all the subsets.
Blockquote
You have a couple of practical assumptions here that would make brute force with smartish branch pruning feasible:
items are unique, hence you wouldn't be getting combinatorial blow up of valid subsets (i.e. (1,1,1,1,1,1,1,1,1,1,1,1,1) adding up to 3)
if the number of resulting feasible sets is still huge, you would run out of memory collecting them before running into total runtime issues.
ordering input ascending would allow for an easy early stop check - if your remaining sum is smaller then the current element, then none of the yet unexamined items could possibly be in a result (as current and subsequent items would only get bigger)
keeping running sums would speed up each step, as you wouldn't be recalculating it over and over again
Here's a bit of code:
public static List<T[]> SubsetSums<T>(T[] items, int target, Func<T, int> amountGetter)
{
Stack<T> unusedItems = new Stack<T>(items.OrderByDescending(amountGetter));
Stack<T> usedItems = new Stack<T>();
List<T[]> results = new List<T[]>();
SubsetSumsRec(unusedItems, usedItems, target, results, amountGetter);
return results;
}
public static void SubsetSumsRec<T>(Stack<T> unusedItems, Stack<T> usedItems, int targetSum, List<T[]> results, Func<T,int> amountGetter)
{
if (targetSum == 0)
results.Add(usedItems.ToArray());
if (targetSum < 0 || unusedItems.Count == 0)
return;
var item = unusedItems.Pop();
int currentAmount = amountGetter(item);
if (targetSum >= currentAmount)
{
// case 1: use current element
usedItems.Push(item);
SubsetSumsRec(unusedItems, usedItems, targetSum - currentAmount, results, amountGetter);
usedItems.Pop();
// case 2: skip current element
SubsetSumsRec(unusedItems, usedItems, targetSum, results, amountGetter);
}
unusedItems.Push(item);
}
I've run it against 100k input that yields around 1k results in under 25 millis, so it should be able to handle your 740 case with ease.

How to best implement K-nearest neighbours in C# for large number of dimensions?

I'm implementing the K-nearest neighbours classification algorithm in C# for a training and testing set of about 20,000 samples each, and 25 dimensions.
There are only two classes, represented by '0' and '1' in my implementation. For now, I have the following simple implementation :
// testSamples and trainSamples consists of about 20k vectors each with 25 dimensions
// trainClasses contains 0 or 1 signifying the corresponding class for each sample in trainSamples
static int[] TestKnnCase(IList<double[]> trainSamples, IList<double[]> testSamples, IList<int[]> trainClasses, int K)
{
Console.WriteLine("Performing KNN with K = "+K);
var testResults = new int[testSamples.Count()];
var testNumber = testSamples.Count();
var trainNumber = trainSamples.Count();
// Declaring these here so that I don't have to 'new' them over and over again in the main loop,
// just to save some overhead
var distances = new double[trainNumber][];
for (var i = 0; i < trainNumber; i++)
{
distances[i] = new double[2]; // Will store both distance and index in here
}
// Performing KNN ...
for (var tst = 0; tst < testNumber; tst++)
{
// For every test sample, calculate distance from every training sample
Parallel.For(0, trainNumber, trn =>
{
var dist = GetDistance(testSamples[tst], trainSamples[trn]);
// Storing distance as well as index
distances[trn][0] = dist;
distances[trn][1] = trn;
});
// Sort distances and take top K (?What happens in case of multiple points at the same distance?)
var votingDistances = distances.AsParallel().OrderBy(t => t[0]).Take(K);
// Do a 'majority vote' to classify test sample
var yea = 0.0;
var nay = 0.0;
foreach (var voter in votingDistances)
{
if (trainClasses[(int)voter[1]] == 1)
yea++;
else
nay++;
}
if (yea > nay)
testResults[tst] = 1;
else
testResults[tst] = 0;
}
return testResults;
}
// Calculates and returns square of Euclidean distance between two vectors
static double GetDistance(IList<double> sample1, IList<double> sample2)
{
var distance = 0.0;
// assume sample1 and sample2 are valid i.e. same length
for (var i = 0; i < sample1.Count; i++)
{
var temp = sample1[i] - sample2[i];
distance += temp * temp;
}
return distance;
}
This takes quite a bit of time to execute. On my system it takes about 80 seconds to complete. How can I optimize this, while ensuring that it would also scale to larger number of data samples? As you can see, I've tried using PLINQ and parallel for loops, which did help (without these, it was taking about 120 seconds). What else can I do?
I've read about KD-trees being efficient for KNN in general, but every source I read stated that they're not efficient for higher dimensions.
I also found this stackoverflow discussion about this, but it seems like this is 3 years old, and I was hoping that someone would know about better solutions to this problem by now.
I've looked at machine learning libraries in C#, but for various reasons I don't want to call R or C code from my C# program, and some other libraries I saw were no more efficient than the code I've written. Now I'm just trying to figure out how I could write the most optimized code for this myself.
Edited to add - I cannot reduce the number of dimensions using PCA or something. For this particular model, 25 dimensions are required.
Whenever you are attempting to improve the performance of code, the first step is to analyze the current performance to see exactly where it is spending its time. A good profiler is crucial for this. In my previous job I was able to use the dotTrace profiler to good effect; Visual Studio also has a built-in profiler. A good profiler will tell you exactly where you code is spending time method-by-method or even line-by-line.
That being said, a few things come to mind in reading your implementation:
You are parallelizing some inner loops. Could you parallelize the outer loop instead? There is a small but nonzero cost associated to a delegate call (see here or here) which may be hitting you in the "Parallel.For" callback.
Similarly there is a small performance penalty for indexing through an array using its IList interface. You might consider declaring the array arguments to "GetDistance()" explicitly.
How large is K as compared to the size of the training array? You are completely sorting the "distances" array and taking the top K, but if K is much smaller than the array size it might make sense to use a partial sort / selection algorithm, for instance by using a SortedSet and replacing the smallest element when the set size exceeds K.

How can I get a random x number of decimals from a list of unique decimals that total up to y?

Say I have a sorted list of 1000 or so unique decimals, arranged by value.
List<decimal> decList
How can I get a random x number of decimals from a list of unique decimals that total up to y?
private List<decimal> getWinningValues(int xNumberToGet, decimal yTotalValue)
{
}
Is there any way to avoid a long processing time on this? My idea so far is to take xNumberToGet random numbers from the pool. Something like (cool way to get random selection from a list)
foreach (decimal d in decList.OrderBy(x => randomInstance.Next())Take(xNumberToGet))
{
}
Then I might check the total of those, and if total is less, i might shift the numbers up (to the next available number) slowly. If the total is more, I might shift the numbers down. I'm still now sure how to implement or if there is a better design readily available. Any help would be much appreciated.
Ok, start with a little extension I got from this answer,
public static IEnumerable<IEnumerable<T>> Combinations<T>(
this IEnumerable<T> source,
int k)
{
if (k == 0)
{
return new[] { Enumerable.Empty<T>() };
}
return source.SelectMany((e, i) =>
source.Skip(i + 1).Combinations(k - 1)
.Select(c => (new[] { e }).Concat(c)));
}
this gives you a pretty efficient method to yield all the combinations with k members, without repetition, from a given IEnumerable. You could make good use of this in your implementation.
Bear in mind, if the IEnumerable and k are sufficiently large this could take some time, i.e. much longer than you have. So, I've modified your function to take a CancellationToken.
private static IEnumerable<decimal> GetWinningValues(
IEnumerable<decimal> allValues,
int numberToGet,
decimal targetValue,
CancellationToken canceller)
{
IList<decimal> currentBest = null;
var currentBestGap = decimal.MaxValue;
var locker = new object();
allValues.Combinations(numberToGet)
.AsParallel()
.WithCancellation(canceller)
.TakeWhile(c => currentBestGap != decimal.Zero)
.ForAll(c =>
{
var gap = Math.Abs(c.Sum() - targetValue);
if (gap < currentBestGap)
{
lock (locker)
{
currentBestGap = gap;
currentBest = c.ToList();
}
}
}
return currentBest;
}
I've an idea that you could sort the initial list and quit iterating the combinations at a certain point, when the sum must exceed the target. After some consideration, its not trivial to identify that point and, the cost of checking may exceed the benefit. This benefit would have to be balanced agaist some function of the target value and mean of the set.
I still think further optimization is possible but I also think that this work has already been done and I'd just need to look it up in the right place.
There are k such subsets of decList (k might be 0).
Assuming that you want to select each one with uniform probability 1/k, I think you basically need to do the following:
iterate over all the matching subsets
select one
Step 1 is potentially a big task, you can look into the various ways of solving the "subset sum problem" for a fixed subset size, and adapt them to generate each solution in turn.
Step 2 can be done either by making a list of all the solutions and choosing one or (if that might take too much memory) by using the clever streaming random selection algorithm.
If your data is likely to have lots of such subsets, then generating them all might be incredibly slow. In that case you might try to identify groups of them at a time. You'd have to know the size of the group without visiting its members one by one, then you can choose which group to use weighted by its size, then you've reduced the problem to selecting one of that group at random.
If you don't need to select with uniform probability then the problem might become easier. At the best case, if you don't care about the distribution at all then you can return the first subset-sum solution you find -- whether you'd call that "at random" is another matter...

Fastest way to check a List<T> for a date

I have a list of dates that a machine has worked on, but it doesn't include a date that machine was down. I need to create a list of days worked and not worked. I am not sure of the best way to do this. I have started by incrementing through all the days of a range and checking to see if the date is in the list by iterating through the entire list each time. I am looking for a more efficient means of finding the dates.
class machineday
{
datetime WorkingDay;
}
class machinedaycollection : List<machineday>
{
public List<TimeCatEvent> GetAllByCat(string cat)
{
_CategoryCode = cat;
List<machineday> li = this.FindAll(delegate(machinedaydummy) { return true; });
li.Sort(sortDate);
return li;
}
int sortDate(machinedayevent1, machinedayevent2)
{
int returnValue = -1;
if (event2.date < event1.date)
{
returnValue = 0;
}
else if (event2.date == event1.date)
{
//descending
returnValue = event1.date.CompareTo(event2.date);
}
return returnValue;
}
}
Sort the dates and iterate the resulting list in parallel to incrementing a counter. Whenever the counter does not match the current list element, you've found a date missing in the list.
List<DateTime> days = ...;
days.Sort();
DateTime dt = days[0].Date;
for (int i = 0; i < days.Length; dt = dt.AddDays(1))
{
if (dt == days[i].Date)
{
Console.WriteLine("Worked: {0}", dt);
i++;
}
else
{
Console.WriteLine("Not Worked: {0}", dt);
}
}
(This assumes there are no duplicate days in the list.)
Build a list of valid dates and subtract your machine day collection from it using LINQ's Enumerable.Except extension method. Something like this:
IEnumerable<DateTime> dates = get_candidate_dates();
var holidays = dates.Except(machinedays.Select(m => m.WorkingDay));
The get_candidate_dates() method could even be an iterator that generates all dates within a range on the fly, rather than a pre-stored list of all dates.
Enumerable's methods are reasonably smart and will usually do a decent job on the performance side of things, but if you want the fastest possible algorithm, it will depend on how you plan to consume the result.
Sorry dudes, but I do not pretty much like your solutions.
I think you should create a HashTable with your dates. You can do this by interating only once the working days.
Then, you interate the full range of of days and for every one you query in the hashtable if the date is there or not, by using
myHashTable.ContainsKey(day); // this is efficient
Simple, elegant and fast.
I think your solution uses an exponencial time, this one is lineal or logarithmical (which is actually a good thing).
Assuming the list is sorted and the machine was "working" most of the time, you may be able to avoid iterating through all the dates by grouping the dates by month and skipping the dates in between. Something like this (you'll need to clean up):
int chunksize = 60; // adjust depending on data
machineday currentDay = myMachinedaycollection[0];
for (int i = 0; i < myMachinedaycollection.Count; i += chunksize)
{
if (currentDay.WorkingDay.AddDays(chunksize) != myMachinedaycollection[i + chunksize].WorkingDay)
{
// write code to iterate through current chunk and get all the non-working days
}
currentDay = myMachinedaycollection[i + chunksize];
}
I doubt you want a list of days working and not working.
Your question title suggests that you want to know whether the system was up on a particular date. It also seems reasonable to calculate % uptime. Neither of these requires building a list of all time instants in the interval.
Sort the service times. For the first question, do BinarySearch for the date you care about and check whether the preceding entry was the system being taken offline of maintenance or put back into service. For % uptime, take the (down for maintenance, service restored) pair-wise, use subtraction to find the duration of maintenance, add these up. Then use subtraction to find the length of the total interval.
If your question didn't actually mean you were keeping track of maintenance intervals (or equivalently usage intervals) then you can ignore this answer.

Categories