Remove oldest n Items from List using C# - c#

I am working on a dynamic listing of scores which is frequently updated. Ultimately this is used to produce an overall rating, so older entries (based on some parameters, not time) need to be removed to prevent heavy +/- weighting on the overall. It will be adding multiple values at once from a separate enumeration.
List<int> scoreList = new List<int>();
foreach(Item x in Items)
{
scoreList.Add(x.score);
}
//what I need help with:
if(scoreList.Count() > (Items.Count() * 3))
{
//I need to remove the last set (first in, first out) of values size
//Items.Count() from the list
}
If anyone can help it would be much appreciated :) I had to make the code a bit generic because it is written rather cryptically (didn't write the methods).

Use List<T>.RemoveRange - something like this:
// number to remove is the difference between the current length
// and the maximum length you want to allow.
var count = scoreList.Count - (Items.Count() * 3);
if (count > 0) {
// remove that number of items from the start of the list
scoreList.RemoveRange(0, count);
}
You remove from the start of the list, because when you Add items they go to the end - so the oldest are at the start.

Try this
scoreList.RemoveAt(scoreList.Count-1);
And here is the MSDN Article

Instead of using a List<int> I would recommend using a Queue<int>. That will give you the FIFO behavior you're looking for.
See http://msdn.microsoft.com/en-us/library/7977ey2c.aspx for more information on Queues.
Queue<int> scoreList = new Queue<int>();
foreach(Item x in Items)
{
scoreList.Enqueue(x.score);
}
//Or you can eliminate the foreach by doing the following
//Queue<int> scoreList = new Queue<int>(Items.Select(i => i.score).ToList());
//Note that Count is a property for a Queue
while (scoreList.Count > (Items.Count() * 3))
{
scoreList.Dequeue();
}

I didn't understand your question very well, hope if this is what you want.
scoreList.RemoveRange(Items.Count()*3, scoreList.Count()-Items.Count()*3);

A simple way to get last n elements from a list with linq
scoreList.Skip(Math.Max(0, scoreList.Count() - N)).Take(N)

I toyed around and looked at the method suggested above ( scoresList.RemoveAt() ), but it wasn't suited to the situation. What did end up working:
if (...)
{
scoresList.RemoveRange(0, scores.Count);
}
Thanks for the help guys

Related

Get sample set from large dataset

I have an in memory dataset that I'm trying to get an evenly distributed sample using LINQ. From what I've seen, there isn't anything that does this out of the box, so I'm trying to come up with some kind of composition or extension that will perform the sampling.
What I'm hoping for is something that I can use like this:
var sample = dataset.Sample(100);
var smallSample = smallDataset.Sample(100);
Assert.IsTrue(dataset.Count() > 100);
Assert.IsTrue(sample.Count() == 100);
Assert.IsTrue(smallDataset.Count() < 100);
Assert.IsTrue(smallSample .Count() == smallDataset.Count());
The composition I started with, but only works some of the time is this:
var sample = dataset
.Select((v,i) => new Tuple<string, int>(v,i))
.Where(t => t.Item2 / (double)(dataset.Count() / SampleSize) % 1 != 0)
.Select(t => t.Item1);
This works when the dataset and the sample size share a common devisor and the sample size is greater than 50% of the dataset size. Or something like that.
Any help would be excellent!
Update: So I have the following non-LINQ logic that works, but I'm trying to figure out if this can be "LINQ'd" somehow.
var sample = new List<T>();
double sampleRatio = dataset.Count() / sampleSize;
for (var i = 0; i < dataset.Count(); i++)
{
if ((sample.Count() * sampleRatio) <= i)
sample.Add(dataset.Skip(i).FirstOrDefault();
}
I can't find a satisfactory LINQ solution, mainly because iterating LINQ statements are not aware of the length of the sequence they work on -- which is OK: it totally fits LINQ's deferred-execution and streaming approach. Of course it's possible to store the length in a variable and use this in a Where statement, but that's not in line with LINQ's functional (stateless) paradigm, so I always try to avoid that.
The Aggregate statement can be stateless and length-aware, but I tend to find solutions using Aggregate rather contrived and hard to read. It's nothing but a covert stateful loop; for and foreach take some more lines, but are far easier to follow.
I can offer you an extension method that does what you want:
public static IEnumerable<T> TakeProrated<T>(this IEnumerable<T> sequence, int numberOfItems)
{
var local = sequence.ToList();
var n = Math.Min(local.Count, numberOfItems);
var dist = (decimal)local.Count / n;
for (int i = 0; i < n; i++)
{
var index = (int)(Math.Ceiling(i * dist));
yield return local[index];
}
}
The idea is that the required distance between items is first calculated. Then the requested number of items is returned, each time roughly skipping this distance, sometimes more, sometimes less, but evenly distributed. Using Math.Ceiling or Math.Floor is arbitrary, they either introduce a bias toward items higher in the sequence, or lower.
I think I understand what you're looking for. From what I understand, you're looking to return only a certain quantity of entities in a dataset. As my comment to your original post asks, have you tried using the Take operator? What you're looking for is something like this.
// .Skip is optional, but you can use it with it.
// Just ensure that instead of .FirstOrDefault(), you use Take(quantity)
var sample = dataSet.Skip(amt).Take(dataSet.Count() / desiredSampleSize);

exclude items of one list from another with C#

I have a rather specific question about how to exclude items of one list from another. Common approaches such as Except() won't do and here is why:
If the duplicate within a list has an "even" index - I need to remove THIS element and the NEXT element AFTER it.
if the duplicate within a list had an "odd" index - I need to remove THIS element AND one element BEFORE** it.
there might be many appearances of the same duplicate within a list. i.e. one might be with an "odd" index, another - "even".
I'm not asking for a solution since I've created one myself. However after performing this method many times - "ANTS performance profiler" shows that the method elapses 75% of whole execution time (30 seconds out of 40). The question is: Is there a faster method to perform the same operation? I've tried to optimize my current code but it still lacks performance. Here it is:
private void removedoubles(List<int> exclude, List<int> listcopy)
{
for (int j = 0; j < exclude.Count(); j++)
{
for (int i = 0; i < listcopy.Count(); i++)
{
if (listcopy[i] == exclude[j])
{
if (i % 2 == 0) // even
{
//listcopy.RemoveRange(i, i + 1);
listcopy.RemoveAt(i);
listcopy.RemoveAt(i);
i = i - 1;
}
else //odd
{
//listcopy.RemoveRange(i - 1, i);
listcopy.RemoveAt(i - 1);
listcopy.RemoveAt(i - 1);
i = i - 2;
}
}
}
}
}
where:
exclude - list that contains Duplicates only. This list might contain up to 30 elements.
listcopy - list that should be checked for duplicates. If duplicate from "exclude" is found -> perform removing operation. This list might contain up to 2000 elements.
I think that the LINQ might be some help but I don't understand its syntax well.
A faster way (O(n)) would be to do the following:
go through the exclude list and make it into a HashSet (O(n))
in the checks, check if the element being tested is in the set (again O(n)), since test for presence in a HashSet is O(1).
Maybe you can even change your algorithms so that the exclude collection will be a HashSet from the very beginning, this way you can omit step 1 and gain even more speed.
(Your current way is O(n^2).)
Edit:
Another idea is the following: you are perhaps creating a copy of some list and make this method modify it? (Guess based on the parameter name.) Then, you can change it to the following: you pass the original array to the method, and make the method allocate new array and return it (your method signature should be than something like private List<int> getWithoutDoubles(HashSet<int> exclude, List<int> original)).
Edit:
It could be even faster if you would reorganize the input data in the following way: As the items are always removed in pairs (even index + the following odd index), you should pair them in advance! So that your list if ints becomes list of pairs of ints. This way your method might be be something like that:
private List<Tuple<int, int>> getWithoutDoubles(
HashSet<int> exclude, List<Tuple<int, int>> original)
{
return original.Where(xy => (!exclude.Contains(xy.Item1) &&
!exclude.Contains(xy.Item2)))
.ToList();
}
(you remove the pairs where either the first or the second item is in the exclude collection). Instead of Tuple, perhaps you can pack the items into your custom type.
Here is yet another way to get the results.
var a = new List<int> {1, 2, 3, 4, 5};
var b = new List<int> {1, 2, 3};
var c = (from i in a let found = b.Any(j => j == i) where !found select i).ToList();
c will contain 4,5
Reverse your loops so they start at .Count - 1 and go to 0, so you don't have to change i in one of the cases and Count is only evaluated once per collection.
Can you convert the List to LinkedList and have a try? The List.RemoveAt() is more expensive than LinkedList.Remove().

Better algorithm for a date comparison task

I would like some help making this comparison faster (sample below). The sample take each value in an array, attach an hour to a comparison-variable. If no matching value, it's add the value to a second array (which are concatenated later).
if (ticks.TypeOf == Period.Hour)
while (compareAt <= endAt)
{
if (range.Where(d => d.time.AddMinutes(-d.time.Minute) == compareAt).Count() < 1)
gaps.Add(new SomeValue() {
...some dummy values.. });
compareAt = compareAt.AddTicks(ticks.Ticks);
}
This execution is too consuming when came to i.e. hours. There are 365 * 24 = 8760 values at most in this array. In future, there will also be minutes/seconds per month 60*24*31=44640, which means unusable.
If the array most often was complete (which means no gaps/empty slots), it could easily be by-passed with if (range.Count() == (hours/day * days)). Though, that day will not be today.
How would I solve it more effective?
One example: If ther are 7800 values in the array, we miss about 950, right? But can I find just the gaps-endings, and just create the missing values? That would make the o-notation depend on amount of gaps, not the amount of values..
One other welcome answer is just an more effective loop.
[Edit]
Sorry for bad english, I try my best to describe.
Your performance is low because the range lookup is not using any indexing and rechecks the entire range every time.
One way to do this a lot quicker;
if (ticks.TypeOf == Period.Hour)
{
// fill a hashset with the range's unique hourly values
var rangehs = new HashSet<DateTime>();
foreach (var r in range)
{
rangehs.Add(r.time.AddMinutes(-r.time.Minute));
}
// walk all the hours
while (compareAt <= endAt)
{
// quickly check if it's a gap
if (!rangehs.Contains(compareAt))
gaps.Add(new SomeValue() { ...some dummy values..});
compareAt = compareAt.AddTicks(ticks.Ticks);
}
}

C# fastest intersection of 2 sets of sorted numbers

I'm calculating intersection of 2 sets of sorted numbers in a time-critical part of my application. This calculation is the biggest bottleneck of the whole application so I need to speed it up.
I've tried a bunch of simple options and am currently using this:
foreach (var index in firstSet)
{
if (secondSet.BinarySearch(index) < 0)
continue;
//do stuff
}
Both firstSet and secondSet are of type List.
I've also tried using LINQ:
var intersection = firstSet.Where(t => secondSet.BinarySearch(t) >= 0).ToList();
and then looping through intersection.
But as both of these sets are sorted I feel there's a better way to do it. Note that I can't remove items from sets to make them smaller. Both sets usually consist of about 50 items each.
Please help me guys as I don't have a lot of time to get this thing done. Thanks.
NOTE: I'm doing this about 5.3 million times. So every microsecond counts.
If you have two sets which are both sorted, you can implement a faster intersection than anything provided out of the box with LINQ.
Basically, keep two IEnumerator<T> cursors open, one for each set. At any point, advance whichever has the smaller value. If they match at any point, advance them both, and so on until you reach the end of either iterator.
The nice thing about this is that you only need to iterate over each set once, and you can do it in O(1) memory.
Here's a sample implementation - untested, but it does compile :) It assumes that both of the incoming sequences are duplicate-free and sorted, both according to the comparer provided (pass in Comparer<T>.Default):
(There's more text at the end of the answer!)
static IEnumerable<T> IntersectSorted<T>(this IEnumerable<T> sequence1,
IEnumerable<T> sequence2,
IComparer<T> comparer)
{
using (var cursor1 = sequence1.GetEnumerator())
using (var cursor2 = sequence2.GetEnumerator())
{
if (!cursor1.MoveNext() || !cursor2.MoveNext())
{
yield break;
}
var value1 = cursor1.Current;
var value2 = cursor2.Current;
while (true)
{
int comparison = comparer.Compare(value1, value2);
if (comparison < 0)
{
if (!cursor1.MoveNext())
{
yield break;
}
value1 = cursor1.Current;
}
else if (comparison > 0)
{
if (!cursor2.MoveNext())
{
yield break;
}
value2 = cursor2.Current;
}
else
{
yield return value1;
if (!cursor1.MoveNext() || !cursor2.MoveNext())
{
yield break;
}
value1 = cursor1.Current;
value2 = cursor2.Current;
}
}
}
}
EDIT: As noted in comments, in some cases you may have one input which is much larger than the other, in which case you could potentially save a lot of time using a binary search for each element from the smaller set within the larger set. This requires random access to the larger set, however (it's just a prerequisite of binary search). You can even make it slightly better than a naive binary search by using the match from the previous result to give a lower bound to the binary search. So suppose you were looking for values 1000, 2000 and 3000 in a set with every integer from 0 to 19,999. In the first iteration, you'd need to look across the whole set - your starting lower/upper indexes would be 0 and 19,999 respectively. After you'd found a match at index 1000, however, the next step (where you're looking for 2000) can start with a lower index of 2000. As you progress, the range in which you need to search gradually narrows. Whether or not this is worth the extra implementation cost or not is a different matter, however.
Since both lists are sorted, you can arrive at the solution by iterating over them at most once (you may also get to skip part of one list, depending on the actual values they contain).
This solution keeps a "pointer" to the part of list we have not yet examined, and compares the first not-examined number of each list between them. If one is smaller than the other, the pointer to the list it belongs to is incremented to point to the next number. If they are equal, the number is added to the intersection result and both pointers are incremented.
var firstCount = firstSet.Count;
var secondCount = secondSet.Count;
int firstIndex = 0, secondIndex = 0;
var intersection = new List<int>();
while (firstIndex < firstCount && secondIndex < secondCount)
{
var comp = firstSet[firstIndex].CompareTo(secondSet[secondIndex]);
if (comp < 0) {
++firstIndex;
}
else if (comp > 0) {
++secondIndex;
}
else {
intersection.Add(firstSet[firstIndex]);
++firstIndex;
++secondIndex;
}
}
The above is a textbook C-style approach of solving this particular problem, and given the simplicity of the code I would be surprised to see a faster solution.
You're using a rather inefficient Linq method for this sort of task, you should opt for Intersect as a starting point.
var intersection = firstSet.Intersect(secondSet);
Try this. If you measure it for performance and still find it unwieldy, cry for further help (or perhaps follow Jon Skeet's approach).
I was using Jon's approach but needed to execute this intersect hundreds of thousands of times for a bulk operation on very large sets and needed more performance. The case I was running in to was heavily imbalanced sizes of the lists (eg 5 and 80,000) and wanted to avoid iterating the entire large list.
I found that detecting the imbalance and changing to an alternate algorithm gave me huge benifits over specific data sets:
public static IEnumerable<T> IntersectSorted<T>(this List<T> sequence1,
List<T> sequence2,
IComparer<T> comparer)
{
List<T> smallList = null;
List<T> largeList = null;
if (sequence1.Count() < Math.Log(sequence2.Count(), 2))
{
smallList = sequence1;
largeList = sequence2;
}
else if (sequence2.Count() < Math.Log(sequence1.Count(), 2))
{
smallList = sequence2;
largeList = sequence1;
}
if (smallList != null)
{
foreach (var item in smallList)
{
if (largeList.BinarySearch(item, comparer) >= 0)
{
yield return item;
}
}
}
else
{
//Use Jon's method
}
}
I am still unsure about the point at which you break even, need to do some more testing
try
firstSet.InterSect (secondSet).ToList ()
or
firstSet.Join(secondSet, o => o, id => id, (o, id) => o)

Divide List in Equal Parts

I have a big List that may have some 50,000 or more items and i have to do operation against each item.takes some X time now if i use conventional method and do operation in sequential manner it is definitely take X * 50,000 on average.
I planned to optimize and save some time and decided to use Background Worker as there is no dependency among them.Plan was to divide the List in 4 parts and use each in separate Background Worker.
I want to ASk
1.is this method DUMB?
2.Is there any other Better Method?
3.Suggest a nice and clean method to divide List in 4 equal Parts?
Thanks
If you can use .Net 4.0, then use the Task Parallel library and have a look at
Parallel.ForEach()
Parallel ForEach How-to.
Everything is basically the same as a traditional for loop, but you work with parallelism implicitly.
You can also really split it to groups.
I didn't see a built-in sequence method for it, so here's the low level way. point out any blunders please. I am learning.
static List<T[]> groups<T>(IList<T> original, uint n)
{
Debug.Assert(n > 0);
var listlist = new List<T[]>();
var list = new List<T>();
for (int i = 0; i < original.Count(); i++)
{
var item = original[i];
list.Add(item);
if ((i+1) % n == 0 || i == original.Count() - 1)
{
listlist.Add(list.ToArray());
list.Clear();
}
}
return listlist;
}
Another version, based on linq.
public static List<T[]> groups<T>(IList<T> original, uint n)
{
var almost_grouped = original.Select((row, i) => new { Item = row, GroupIndex = i / n });
var groups = almost_grouped.GroupBy(a => a.GroupIndex, a => a.Item);
var grouped = groups.Select(a => a.ToArray()).ToList();
return grouped;
}
This is a good method for optimizing similar, independent, operations on a large collection. However, you should look at the Parallel.For method in .NET 4.0. It does all the heavy lifting for you:
http://msdn.microsoft.com/en-us/library/system.threading.tasks.parallel.for.aspx

Categories