Finding similar records in LINQ

Finding similar records in LINQ - c#

I have the following LINQ query, which will be used to find any consignments that are 'similar':
from c in cons
group c by new { c.TripDate.Value, c.DeliveryPostcode, c.DeliveryName } into cg
let min = cg.Min(a => a.DeliverFrom)
let max = cg.Max(a => a.DeliverFrom)
let span = max - min
where span.TotalMinutes <= 59
select cg;
The main thing is the min, max and span. Basically, any consignments that are in the 'group', that have a DeliverFrom datetime within 59 minutes of any other one in the group, will be returned in the group.
The code above looked good originally to me, but upon further inspection it seems that if there's more than 2 records in the group - 2 with DeliverFrom dates 59 minutes of each other, and one with a DeliverFrom date not within 59 minutes of any, then the query would not return that group, as it'll be selecting the min and the max and seeing that the difference is more than 59 minutes. What I want to happen is to see that there are 2 consignments in the group with DeliverFrom dates close enough, and just select a group containing them two.
How would I go about doing this?
EDIT: Doh, another clause has been added in this. There's a field called 'Weight' and one called 'Spaces', each group can have a max of 26 Weight and 26 Spaces

If I'm not mistaken, what you are looking for is a statistical problem called cluster identification, and if so it is a far more complex problem than you might think.
As a thought exercise, imagine if you had 3 entries, at 1:00, 1:30, and 2:00. How would you want to group these? Either the first two or the last two would work as a group (less than 59 minutes apart), but all 3 would not.
If you just want to keep chaining items together into a group as long as they are within 59 minutes of any other item in the group, you'd need to keep iterating until you stop finding new items to add to any cluster.

I 'd group the consignments with the same logic like you do but use this overload of GroupBy instead, allowing me to project each group of consigments into another type. This type would here be an enumerable sequence of groups of consigments, each element in which represents consignments that non only were in the same group to begin with, but also should all be delivered within the duration of an hour. So the signature of resultSelector would be
Func<anontype, IEnumerable<Consignment>, IEnumerable<IEnumerable<Consignment>>>
At this point it becomes clear that it would probably be a good idea to define a type for the grouping so that you can get rid of the anonymous type in the above signature; otherwise you 'd be forced to define your resultSelector as a lambda.
Within resultSelector, you need to first of all sort the incoming group of consignments by DeliverFrom and then return sub-groups based on that time. So it might look like this:
IEnumerable<IEnumerable<Consignment>>
Partitioner(ConsignmentGroupKey key, IEnumerable<Consignment> cg)
{
cg = cg.OrderBy(c => c.DeliverFrom);
var startTime = cg.First().DeliverFrom;
var subgroup = new List<Consignment>();
foreach(var cons in cg) {
if ((cons.DeliverFrom - startTime).TotalMinutes < 60) {
subgroup.Add(cons);
}
else {
yield return subgroup;
startTime = cons.DeliverFrom;
subgroup = new List<Consignment>() { cons };
}
}
if (subgroup.Count > 0) {
yield return subgroup;
}
}
I haven't tried this, but as far as I can tell it should work.

Related

Algorithm for "consolidating" N items into K

I was wondering whether there's a known algorithm for doing the following, and also wondering how it would be implemented in C#. Maybe this is a known type of problem.
Example:
Suppose I have a class
class GoldMine
{
public int TonsOfGold { get; set; }
}
and a List of N=3 such items
var mines = new List<GoldMine>() {
new GoldMine() { TonsOfGold = 10 },
new GoldMine() { TonsOfGold = 12 },
new GoldMine() { TonsOfGold = 5 }
};
Then consolidating the mines into K=2 mines would be the consolidations
{ {Lines[0],Lines[1]}, {Lines[2]} }, // { 22 tons, 5 tons }
{ {Lines[0],Lines[2]}, {Lines[1]} }, // { 15 tons, 12 tons }
{ {Lines[1],Lines[2]}, {Lines[0]} } // { 17 tons, 10 tons }
and consolidating into K=1 mines would be the single consolidation
{ Lines[0],Lines[1],Lines[2] } // { 27 tons }
What I'm interested in is the algorithm for the consolidation process.

If I'm not mistaken, the problem you're describing is Number of k-combinations for all k
I found a code snippet which I believe addresses your use case but I just can't remember where I got it from. It must have been from StackOverflow. If anyone recognized this particular piece of code, please let me know and I'll make sure to credit it.
So here's the extension method:
public static class ListExtensions
{
public static List<ILookup<int, TItem>> GroupCombinations<TItem>(this List<TItem> items, int count)
{
var keys = Enumerable.Range(1, count).ToList();
var indices = new int[items.Count];
var maxIndex = items.Count - 1;
var nextIndex = maxIndex;
indices[maxIndex] = -1;
var groups = new List<ILookup<int, TItem>>();
while (nextIndex >= 0)
{
indices[nextIndex]++;
if (indices[nextIndex] == keys.Count)
{
indices[nextIndex] = 0;
nextIndex--;
continue;
}
nextIndex = maxIndex;
if (indices.Distinct().Count() != keys.Count)
{
continue;
}
var group = indices.Select((keyIndex, valueIndex) =>
new
{
Key = keys[keyIndex],
Value = items[valueIndex]
})
.ToLookup(x => x.Key, x => x.Value);
groups.Add(group);
}
return groups;
}
}
And a little utility method that prints the output:
public void PrintGoldmineCombinations(int count, List<GoldMine> mines)
{
Debug.WriteLine("count = " + count);
var groupNumber = 0;
foreach (var group in mines.GroupCombinations(count))
{
groupNumber++;
Debug.WriteLine("group " + groupNumber);
foreach (var set in group)
{
Debug.WriteLine(set.Key + ": " + set.Sum(m => m.TonsOfGold) + " tons of gold");
}
}
}
You would use it like so:
var mines = new List<GoldMine>
{
new GoldMine {TonsOfGold = 10},
new GoldMine {TonsOfGold = 12},
new GoldMine {TonsOfGold = 5}
};
PrintGoldmineCombinations(1, mines);
PrintGoldmineCombinations(2, mines);
PrintGoldmineCombinations(3, mines);
Which will produce the following output:
count = 1
group 1
1: 27 tons of gold
count = 2
group 1
1: 22 tons of gold
2: 5 tons of gold
group 2
1: 15 tons of gold
2: 12 tons of gold
group 3
1: 10 tons of gold
2: 17 tons of gold
group 4
2: 10 tons of gold
1: 17 tons of gold
group 5
2: 15 tons of gold
1: 12 tons of gold
group 6
2: 22 tons of gold
1: 5 tons of gold
count = 3
group 1
1: 10 tons of gold
2: 12 tons of gold
3: 5 tons of gold
group 2
1: 10 tons of gold
3: 12 tons of gold
2: 5 tons of gold
group 3
2: 10 tons of gold
1: 12 tons of gold
3: 5 tons of gold
group 4
2: 10 tons of gold
3: 12 tons of gold
1: 5 tons of gold
group 5
3: 10 tons of gold
1: 12 tons of gold
2: 5 tons of gold
group 6
3: 10 tons of gold
2: 12 tons of gold
1: 5 tons of gold
Note: this does not take into account duplicates by the contents of the sets and I'm not sure if you actually want those filtered out or not.
Is this what you need?
EDIT
Actually, looking at your comment it seems you don't want the duplicates and you also want the lower values of k included, so here is a minor modification that takes out the duplicates (in a really ugly way, I apologize) and gives you the lower values of k per group:
public static List<ILookup<int, TItem>> GroupCombinations<TItem>(this List<TItem> items, int count)
{
var keys = Enumerable.Range(1, count).ToList();
var indices = new int[items.Count];
var maxIndex = items.Count - 1;
var nextIndex = maxIndex;
indices[maxIndex] = -1;
var groups = new List<ILookup<int, TItem>>();
while (nextIndex >= 0)
{
indices[nextIndex]++;
if (indices[nextIndex] == keys.Count)
{
indices[nextIndex] = 0;
nextIndex--;
continue;
}
nextIndex = maxIndex;
var group = indices.Select((keyIndex, valueIndex) =>
new
{
Key = keys[keyIndex],
Value = items[valueIndex]
})
.ToLookup(x => x.Key, x => x.Value);
if (!groups.Any(existingGroup => group.All(grouping1 => existingGroup.Any(grouping2 => grouping2.Count() == grouping1.Count() && grouping2.All(item => grouping1.Contains(item))))))
{
groups.Add(group);
}
}
return groups;
}
It produces the following output for k = 2:
group 1
1: 27 tons of gold
group 2
1: 22 tons of gold
2: 5 tons of gold
group 3
1: 15 tons of gold
2: 12 tons of gold
group 4
1: 10 tons of gold
2: 17 tons of gold

This is actually the problem of enumerating all K-partitions of a set of N objects, often described as enumerating the ways to place N labelled objects into K unlabelled boxes.
As is almost always the case, the easiest way to solve a problem involving enumeration of unlabelled or unordered alternatives is to create a canonical ordering and then figure out how to generate only canonically-ordered solutions. In this case, we assume that the objects have some total ordering so that we can refer to them by integers between 1 and N, and then we place the objects in order into the partitions, and order the partitions by the index of the first object in each one. It's pretty easy to see that this ordering cannot produce duplicates and that every partitioning must correspond to some canonical ordering.
We can then represent a given canonical ordering by a sequence of N integers, where each integer is the number of the partition for the corresponding object. Not every sequence of N integers will work, however; we need to constrain the sequences so that the partitions are in the canonical order (sorted by the index of the first element). The constraint is simple: each element in the sequence must either be some integer which previously appeared in the sequence (an object placed into an already present partition) or it must be the index of the next partition, which is one more than the index of the last partition already present. In summary:
The first entry in the sequence must be 1 (because the first object can only be placed into the first partition); and
Each subsequent entry is at least 1 and no greater than one more than the largest entry preceding that point.
(These two criteria could be combined if we interpret "the largest entry preceding" the first entry as 0.)
That's not quite enough, since it doesn't restrict the sequence to exactly K. If we wanted to find all of the partitions, that would be fine, but if we want all the partitions whose size is precisely K then we need to constrain the last element in the sequence to be K, which means that the second last element must be at least K−1, the third last element at least K−2, and so on, as well as not allowing any element to be greater than K:
The element at position i must be in the range [max(1, K+i−N), K]
Generating sequences according to a simple set of constraints like the above can easily be done recursively. We start with an empty sequence, and then successively add each possible next elements, calling this procedure recursively to fill in the entire sequence. As long as it is simple to produce the list of possible next elements, the recursive procedure will be straight-forward. In this case, we need three pieces of information to produce this list: N, K, and the maximum value generated so far.
That leads to the following pseudo-code:
GenerateAllSequencesHelper(N, K, M, Prefix):
if length(Prefix) is N:
Prefix is a valid sequence; handle it
else:
# [See Note 1]
for i from max(1, length(Prefix) + 1 + K - N)
up to min(M + 1, K):
Append i to Prefix
GenerateAllSequencesHelper(N, K, max(M, i), Prefix)
Pop i off of Prefix
GenerateAllSequences(N, K):
GenerateAllSequencesHelper(N, K, 0, [])
Since the recursion depth will be extremely limited for any practical application of this procedure, the recursive solution should be fine. However, it is also quite simple to produce an iterative solution even without using a stack. This is an instance of a standard enumeration algorithm for constrained sequences:
Start with the lexicographically smallest possible sequence
While possible:
Scan backwards to find the last element which could be increased. ("Could be" means that increasing that element would still result in the prefix of some valid sequence.)
Increment that element to the next largest possible value
Fill in the rest of the sequence with the smallest possible suffix.
In the iterative algorithm, the backwards scan might involve checking O(N) elements, which apparently makes it slower than the recursive algorithm. However, in most cases they will have the same computational complexity, because in the recursive algorithm each generated sequence also incurs the cost of the recursive calls and returns required to reach it. If each (or, at least, most) recursive calls produce more than one alternative, the recursive algorithm will still be O(1) per generated sequence.
But in this case, it is likely that the iterative algorithm will also be O(1) per generated sequence, as long as the scan step can be performed in O(1); that is, as long as it can be performed without examining the entire sequence.
In this particular case, computing the maximum value of the sequence up to a given point is not O(1), but we can produce an O(1) iterative algorithm by also maintaining the vector of cumulative maxima. (In effect, this vector corresponds to the stack of M arguments in the recursive procedure above.)
It's easy enough to maintain the M vector; once we have it, we can easily identify "incrementable" elements in the sequence: element i is incrementable if i>0, M[i] is equal to M[i−1], and M[i] is not equal to K. [Note 2]
Notes
If we wanted to produce all partitions, we would replace the for loop above with the rather simpler:
for i from 1 to M+1:
This answer is largely based on this answer, but that question asked for all partitions; here, you want to generate the K-partitions. As indicated, the algorithms are very similar.

Subset Sum algorithm efficiency

We have a number of payments (Transaction) that come into our business each day. Each Transaction has an ID and an Amount. We have the requirement to match a number of these transactions to a specific amount. Example:
Transaction Amount
1 100
2 200
3 300
4 400
5 500
If we wanted to find the transactions that add up to 600 you would have a number of sets (1,2,3),(2,4),(1,5).
I found an algorithm that I have adapted, that works as defined below. For 30 transactions it takes 15ms. But the number of transactions average around 740 and have a maximum close to 6000. Is the a more efficient way to perform this search?
sum_up(TransactionList, remittanceValue, ref MatchedLists);
private static void sum_up(List<Transaction> transactions, decimal target, ref List<List<Transaction>> matchedLists)
{
sum_up_recursive(transactions, target, new List<Transaction>(), ref matchedLists);
}
private static void sum_up_recursive(List<Transaction> transactions, decimal target, List<Transaction> partial, ref List<List<Transaction>> matchedLists)
{
decimal s = 0;
foreach (Transaction x in partial) s += x.Amount;
if (s == target)
{
matchedLists.Add(partial);
}
if (s > target)
return;
for (int i = 0; i < transactions.Count; i++)
{
List<Transaction> remaining = new List<Transaction>();
Transaction n = new Transaction(0, transactions[i].ID, transactions[i].Amount);
for (int j = i + 1; j < transactions.Count; j++) remaining.Add(transactions[j]);
List<Transaction> partial_rec = new List<Transaction>(partial);
partial_rec.Add(new Transaction(n.MatchNumber, n.ID, n.Amount));
sum_up_recursive(remaining, target, partial_rec, ref matchedLists);
}
}
With Transaction defined as:
class Transaction
{
public int ID;
public decimal Amount;
public int MatchNumber;
public Transaction(int matchNumber, int id, decimal amount)
{
ID = id;
Amount = amount;
MatchNumber = matchNumber;
}
}

As already mentioned your problem can be solved by pseudo polynomial algorithm in O(n*G) with n - number of items and G - your targeted sum.
The first part question: is it possible to achieve the targeted sum G. The following pseudo/python code solves it (have no C# on my machine):
def subsum(values, target):
reached=[False]*(target+1) # initialize as no sums reached at all
reached[0]=True # with 0 elements we can only achieve the sum=0
for val in values:
for s in reversed(xrange(target+1)): #for target, target-1,...,0
if reached[s] and s+val<=target: # if subsum=s can be reached, that we can add the current value to this sum and build an new sum
reached[s+val]=True
return reached[target]
What is the idea? Let's consider values [1,2,3,6] and target sum 7:
We start with an empty set - the possible sum is obviously 0.
Now we look at the first element 1 and have to options to take or not to take. That leaves as with possible sums {0,1}.
Now looking at the next element 2: leads to possible sets {0,1} (not taking)+{2,3} (taking).
Until now not much difference to your approach, but now for element 3 we have possible sets a. for not taking {0,1,2,3} and b. for taking {3,4,5,6} resulting in {0,1,2,3,4,5,6} as possible sums. The difference to your approach is that there are two way to get to 3 and your recursion will be started twice from that (which is not needed). Calculating basically the same staff over and over again is the problem of your approach and why the proposed algorithm is better.
As last step we consider 6 and get {0,1,2,3,4,5,6,7} as possible sums.
But you also need the subset which leads to the targeted sum, for this we just remember which element was taken to achieve the current sub sum. This version returns a subset which results in the target sum or None otherwise:
def subsum(values, target):
reached=[False]*(target+1)
val_ids=[-1]*(target+1)
reached[0]=True # with 0 elements we can only achieve the sum=0
for (val_id,val) in enumerate(values):
for s in reversed(xrange(target+1)): #for target, target-1,...,0
if reached[s] and s+val<=target:
reached[s+val]=True
val_ids[s+val]=val_id
#reconstruct the subset for target:
if not reached[target]:
return None # means not possible
else:
result=[]
current=target
while current!=0:# search backwards jumping from predecessor to predecessor
val_id=val_ids[current]
result.append(val_id)
current-=values[val_id]
return result
As an another approach you could use memoization to speed up your current solution remembering for the state (subsum, number_of_elements_not considered) whether it is possible to achieve the target sum. But I would say the standard dynamic programming is a less error prone possibility here.

Yes.
I can't provide full code at the moment, but instead of iterating each list of transactions twice until finding matches (O squared), try this concept:
setup a hashtable with the existing transaction amounts as entries, as well as the summation of each set of two transactions assuming each value is made of a max of two transactions (weekend credit card processing).
for each total, reference into the hashtable - the sets of transactions in that slot are the list of matching transactions.
Instead of O^2, you can get it down to 4*O, which would make a noticeable difference in speed.
Good luck!

Dynamic programming can solve this problem efficiently:
Assume you have n transactions and the max amount of transactions is m.
we can solve it just in the complexity of O(nm).
learn it at Knapsack problem.
for this problem we can define for pre i transactions the numbers of subset, add up to sum: dp[i][sum].
the equation:
for i 1 to n:
dp[i][sum] = dp[i - 1][sum - amount_i]
the dp[n][sum] is the numbers of you need, and you need to add some tricks to get what are all the subsets.
Blockquote

You have a couple of practical assumptions here that would make brute force with smartish branch pruning feasible:
items are unique, hence you wouldn't be getting combinatorial blow up of valid subsets (i.e. (1,1,1,1,1,1,1,1,1,1,1,1,1) adding up to 3)
if the number of resulting feasible sets is still huge, you would run out of memory collecting them before running into total runtime issues.
ordering input ascending would allow for an easy early stop check - if your remaining sum is smaller then the current element, then none of the yet unexamined items could possibly be in a result (as current and subsequent items would only get bigger)
keeping running sums would speed up each step, as you wouldn't be recalculating it over and over again
Here's a bit of code:
public static List<T[]> SubsetSums<T>(T[] items, int target, Func<T, int> amountGetter)
{
Stack<T> unusedItems = new Stack<T>(items.OrderByDescending(amountGetter));
Stack<T> usedItems = new Stack<T>();
List<T[]> results = new List<T[]>();
SubsetSumsRec(unusedItems, usedItems, target, results, amountGetter);
return results;
}
public static void SubsetSumsRec<T>(Stack<T> unusedItems, Stack<T> usedItems, int targetSum, List<T[]> results, Func<T,int> amountGetter)
{
if (targetSum == 0)
results.Add(usedItems.ToArray());
if (targetSum < 0 || unusedItems.Count == 0)
return;
var item = unusedItems.Pop();
int currentAmount = amountGetter(item);
if (targetSum >= currentAmount)
{
// case 1: use current element
usedItems.Push(item);
SubsetSumsRec(unusedItems, usedItems, targetSum - currentAmount, results, amountGetter);
usedItems.Pop();
// case 2: skip current element
SubsetSumsRec(unusedItems, usedItems, targetSum, results, amountGetter);
}
unusedItems.Push(item);
}
I've run it against 100k input that yields around 1k results in under 25 millis, so it should be able to handle your 740 case with ease.

Take pages and combine to list of pages

I have a list, let's say it contains 1000 items. I want to end up with a list of 10 times 100 items with something like:
myList.Select(x => x.y).Take(100) (until list is empty)
So I want Take(100) to run ten times, since the list contains 1000 items, and end up with list containing 10 lists which each contains 100 items.

You need to Skip the number of records you have already taken, you can keep track of this number and use it when you query
alreadyTaken = 0;
while (alreadyTaken < 1000) {
var pagedList = myList.Select(x => x.y).Skip(alreadyTaken).Take(100);
...
alreadyTaken += 100;
}

This can be achieved with a simple paging extension method.
public static List<T> GetPage<T>(this List<T> dataSource, int pageIndex, int pageSize = 100)
{
return dataSource.Skip(pageIndex * pageSize)
.Take(pageSize)
.ToList();
}
Of course, you can extend it to accept and/or return any kind of IEnumerable<T>.

As already posted you can use a for loop and Skip some elements and Take some elements. In this way you create a new query in every for loop. But a problem raises if you also want to go through each of those queries, because this will be very inefficient. Lets assume you just have 50 entries and you want to go through your list with ten elements every loop. You will have 5 loops doing
.Skip(0).Take(10)
.Skip(10).Take(10)
.Skip(20).Take(10)
.Skip(30).Take(10)
.Skip(40).Take(10)
Here two problem raises.
Skiping elements can still lead to computation. In your first query you just calculate the needed 10 elements, but in your second loop you calculated 20 elements and throwing 10 away, and so on. If you sum all 5 loops together you already computed 10 + 20 + 30 + 40 + 50 = 150 elements even you only had 50 elements. This result in an O(n^2) performance.
Not every IEnumerable does the above thing. Some IEnumerable like a database for example can optimize a Skip, for example they use an Offset (MySQL) definition in the SQL query. But that still doesn't solve the problem. The main problem you still have is that you will create 5 different Queries and execute all 5 of them. Those five queries will now take the most time. Because a simple Query to a database is even a lot slower than just Skipping some in-memory elements or some computations.
Because of all these problems it makes sense to not use a for loop with multiple .Skip(x).Take(y) if you also want to evaluate every query in every loop. Instead your algorithm should only go through your IEnumerable once, executing the query once, and on the first iteration return the first 10 elements. The next iteration returns the next 10 elements and so on, until it runs out of elements.
The following Extension Method does exactly this.
public static IEnumerable<IReadOnlyList<T>> Combine<T>(this IEnumerable<T> source, int amount) {
var combined = new List<T>();
var counter = 0;
foreach ( var entry in source ) {
combined.Add(entry);
if ( ++counter >= amount ) {
yield return combined;
combined = new List<T>();
counter = 0;
}
}
if ( combined.Count > 0 )
yield return combined;
}
With this you can just do
someEnumerable.Combine(100)
and you get a new IEnumerable<IReadOnlyList<T>> that goes through your enumeration just once slicing everything into chunks with a maximum of 100 elements.
Just to show how much difference the performance could be:
var numberCount = 100000;
var combineCount = 100;
var nums = Enumerable.Range(1, numberCount);
var count = 0;
// Bechmark with Combine() Extension
var swCombine = Stopwatch.StartNew();
var sumCombine = 0L;
var pages = nums.Combine(combineCount);
foreach ( var page in pages ) {
sumCombine += page.Sum();
count++;
}
swCombine.Stop();
Console.WriteLine("Count: {0} Sum: {1} Time Combine: {2}", count, sumCombine, swCombine.Elapsed);
// Doing it with .Skip(x).Take(y)
var swTakes = Stopwatch.StartNew();
count = 0;
var sumTaken = 0L;
var alreadyTaken = 0;
while ( alreadyTaken < numberCount ) {
sumTaken += nums.Skip(alreadyTaken).Take(combineCount).Sum();
alreadyTaken += combineCount;
count++;
}
swTakes.Stop();
Console.WriteLine("Count: {0} Sum: {1} Time Takes: {2}", count, sumTaken, swTakes.Elapsed);
The usage with the Combine() Extension Methods runs in 3 milliseconds on my computer (i5 # 4Ghz) while the for loop already needs 178 milliseconds
If you have a lot more elements or the slicing is smaller it gets even more worse. For example if combineCount is set to 10 instead of 100 the runtime changes to 4 milliseconds and 1800 milliseconds (1.8 seconds)
Now you could possibly say that you don't have so much elements or your slicing never gets so small. But remember, in this this example i just generated a sequence of numbers that has nearly zero computation time. The whole overhead from 4 milliseconds to 178 milliseconds is only caused of the re-evaluation and Skiping of values. If you have some more complex stuff going on behind the scenes the Skipping creates the most overhead, and also if an IEnumerable can implement Skip, like a database as explained above, that example will still get more worse, because the most overhead will be the execution of the query itself.
And the amount of queries can go really fast up. With 100.000 elements and a slicing/chunking of 100 you already will execute 1.000 queries. The Combine Extension provided above on the other hand will always execute your query once. And will never suffer of any of those problems described above.
All of that doesn't mean that Skip and Take should be avoided. They have their place. But if you really plan to go through every element you should avoid using Skip and Take to get your slicing done.
If the only thing you want is just to slice everything into pages with 100 elements, and you just want to fetch the third page, for example. You just should calculate how much elements you need to Skip.
var pageCount = 100;
var pageNumberToGet = 3;
var thirdPage = yourEnumerable.Skip(pageCount * (pageNumberToGet-1)).take(pageCount);
In this way you will get the elements from 200 to 300 in a single query. Also an IEnumerable with a databse can optimize that and you just have a single-query. So, if you only want a specific range of elements from your IEnumerable than you should use Skip and Take and do it like above instead of using the Combine Extension Method that i provided.

Sum up every x values in a row

I have a column calendar week and a column amount. Now I want to sum up the amount for every 4 calendar weeks starting from the first calendar week in April. E.g. if I have 52 (rows) calendar weeks in my initial table I would have 13 (rows) weeks in my final table (I talk about table here since I will try to bind the outcome later to a DGV).
I am using Linq-to-dataset and tried different sources to get a hint how to solve this but group by, aggregate couldnt help me but maybe there are some applications of them that I dont understand so I hope one of you Linq-Experts can help me.
Usually I post code but I can just give you the backbone since I have no starting point.
_dataset = New Dataset1
_adapter.Fill(_dataset.Table1)
dim query = from dt in _dataset.Table1.AsEnumerable()

Divide the week by four (using integer division to truncate the result) to generate a value that you can group on.

You can group by anything you like. You could, theoretically, use a simple incrementing number for that:
var groups = dt
.Select((row, i) => Tuple.Create(row, i / 4))
.GroupBy(t => t.Item2);
(c# notation)
Then you can calculate the sum for each group:
var sums = groups
.Select(g => g.Sum(t => t.Item1.Amount));
You mention you want to start at a certain month, e.g. April. You can skip rows by using Skip or Where:
dt.Skip(12).Select(...)
i will always start at 0, making sure your first group contains 4 weeks. However, to know exactly how many weeks to skip or where to start you need more calendar information. I presume you have some fields in your rows that mention the corresponding week's start date:
dt.Where(row => row.StartDate >= firstOfApril).Select(...)

first convert the data into a lookup with the month and then the amount. Note your columns need to be called CalendarWeek and Amount. Then convert to a dictionary and sum the values per key:
var grouped = dt.AsEnumerable().ToLookup(o => Math.Ceiling((double) o.Field<int>("CalendarWeek")/4), o => o.Field<double>("Amount"));
var results = grouped.ToDictionary(result => result.Key, result => result.Sum());

Better algorithm for a date comparison task

I would like some help making this comparison faster (sample below). The sample take each value in an array, attach an hour to a comparison-variable. If no matching value, it's add the value to a second array (which are concatenated later).
if (ticks.TypeOf == Period.Hour)
while (compareAt <= endAt)
{
if (range.Where(d => d.time.AddMinutes(-d.time.Minute) == compareAt).Count() < 1)
gaps.Add(new SomeValue() {
...some dummy values.. });
compareAt = compareAt.AddTicks(ticks.Ticks);
}
This execution is too consuming when came to i.e. hours. There are 365 * 24 = 8760 values at most in this array. In future, there will also be minutes/seconds per month 60*24*31=44640, which means unusable.
If the array most often was complete (which means no gaps/empty slots), it could easily be by-passed with if (range.Count() == (hours/day * days)). Though, that day will not be today.
How would I solve it more effective?
One example: If ther are 7800 values in the array, we miss about 950, right? But can I find just the gaps-endings, and just create the missing values? That would make the o-notation depend on amount of gaps, not the amount of values..
One other welcome answer is just an more effective loop.
[Edit]
Sorry for bad english, I try my best to describe.

Your performance is low because the range lookup is not using any indexing and rechecks the entire range every time.
One way to do this a lot quicker;
if (ticks.TypeOf == Period.Hour)
{
// fill a hashset with the range's unique hourly values
var rangehs = new HashSet<DateTime>();
foreach (var r in range)
{
rangehs.Add(r.time.AddMinutes(-r.time.Minute));
}
// walk all the hours
while (compareAt <= endAt)
{
// quickly check if it's a gap
if (!rangehs.Contains(compareAt))
gaps.Add(new SomeValue() { ...some dummy values..});
compareAt = compareAt.AddTicks(ticks.Ticks);
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Finding similar records in LINQ - c#

Related

Algorithm for "consolidating" N items into K

Subset Sum algorithm efficiency

Take pages and combine to list of pages

Sum up every x values in a row

Better algorithm for a date comparison task

Categories

Resources