Removing list item from another list - c#

I have a list with some elements and I want to remove elements from another list. An item should be removed if its value Contains (not equals) the value from another list.
One of the ways is to do this:
var MyList = new List<string> { ... }
var ToRemove = new List<string> { ... }
MyList.RemoveAll(_ => ToRemove.Any(_.Contains));
It works...
but, I have a LOT of lists (>1 million) and since the ToRemove can be sorted, it would make sense to use that in order to speed the process.
It's easy to make a loop that does it, but is there a way to do this with the sorted collections?
Update:
On 20k iterations on a text with our forbidden list, I get this:
Forbidden list as List -> 00:00:07.1993364
Forbidden list as HashSet -> 00:00:07.9749997
It's consistent after multiple runs, so the hashset is slower

Since this is a removal of strings that contain strings that are in another list, a HashSet wouldn't be much help. Actually not much would be unless you were looking for exact full matches or maintain an index of all substrings (expensive and AFIK only SQL Server does this semi-efficiently outside the BigData realm).
If all you cared about was if it starts with items in 'ToRemove', sorting could help. Sort the 'MyList' and foreach string in 'ToRemove' custom binary search to find any string starting with that string and RemoveAt index until not starts with, then decrement index backwards removing until not starts with.

Well, sorting ToRemove may be beneficial because of binary search O(log n) complexity (you will need to rewrite _ => ToRemove.Any(_.Contains)).
But, instead, using a HashSet<string> instead of List<string> for ToRemove will be much faster, because finding an element in a hashset (using Contains) is O(1) operation.
Also, using LinkedList<string> for MyList can potentially be beneficial, since removing an item from a linked list is generally faster than removing from an array based list because of array size adjusting.

Related

Check for duplicates before adding, or just select distinct when finished?

Problem: Building a List<string>, by adding strings one-by-one iteratively. The end result needs to be the List<string> with no duplicates.
1) When adding strings to the list, check if the string is already there using myList.Contains(myString) and if not then myList.Add(myString). The end result will have no duplicates but the list is checked every time.
2) Just myList.Add(myString) without any care, without checking the list each time, then when needed, use some technique to effectively SELECT DISTINCT from the list. e.g. https://stackoverflow.com/a/7572073/1061602 - that answer would be tweaked as the end result would need to be just a List<string>
Q: Which would be the best approach in terms of efficiency and readability (1, 2 or something else?).
The List<string> would not be excessively large, but could end up containing around 10 strings, with around 200 checks. This could scale up to around 30 strings with around 600 checks.
The best approach would be to use a HashSet<string> instead, which will "eat up" duplicates transparently. When using a set the "check if duplicate" operation is not only automatic but also much faster than when using a list (constant time vs linear).
If your string consumer can consume IEnumerable<string> that's all you will need to do; otherwise, use Enumerable.ToList to convert the set to a list.
in your scenario you should use Hashset, as it would give you O(1) for checking containment. Wheras List will do it O(n) unless it is sorted.

Get certain item from each Tuple from List

What is the correct way to go about creating a list of, say, the first item of each Tuple in a List of Tuples?
If I have a List<Tuple<string,string>>, how would I get a List<string> of the first string in each Tuple?
A little Linq will do the trick:
var myStringList = myTupleList.Select(t=>t.Item1).ToList();
As an explanation, since Tim posted pretty much the same answer, Select() creates a 1:1 "projection"; it takes each input element of the Enumerable, and for each of them it evaluates the lambda expression and returns the result as an element of a new Enumerable having the same number of elements. ToList() will then spin through the Enumerable produced by Select(), and add each element one at a time to a new List<T> instance.
Tim has a good point on the memory-efficiency; ToList() will create a list and add the elements one at a time, which will cause the List to keep resizing its underlying array, doubling it each time to ensure it has the proper capacity. For a big list, that could cause OutOfMemoryExceptions, and it will cause the CLR to allocate more memory than necessary to the List unless the number of elements happens to be a power of 2.
List<string> list = tuples.Select(t => t.Item1).ToList();
or, potentially less memory expensive:
List<string> list = new List<String>(tuples.Count);
list.AddRange(tuples.Select(t => t.Item1));
because it avoids the doubling algorithm of List.Add in ToList.
If you have a List<Tuple<string, string>> listoftuples, you can use the List's implementation of the Select method to take the first string from each Tuple.
It would look like this:
List<string> firstelements = listoftuples.Select(t => t.Item1).ToList();
Generialised Variant:
for selecting a particular item where the collection's tuple's length is unknown i.e 2,3,4 ...:
static IEnumerable TupleListSelectQuery<T>(IEnumerable<T> lst, int index) where T : IStructuralEquatable, IStructuralComparable, IComparable
{
return lst.Select(t => typeof(T).GetProperty("Item" + Convert.ToString(itemNumber)).GetValue(t)).ToList();
}
where index's value corresponds to the way tuples are enumerated i.e 1,2,3 ... (not 0,1,2...).

Set allowing quick insert/deletion and random selection in C#

What data structure could I use in C# to allow quick insertion/deletion as well as uniform random selection? A List has slow deletion by element (since it needs to find the index of the element each time), while a HashSet does not seem to allow random selection of an element (without copying to a list.)
The data structure will be updated continuously, so insertion and deletion need to be online procedures. It seems as if there should be a way to make insertion, deletion, and random selection all O(log n).
A binary search tree with arbitrary integer keys assigned to the objects would solve all of these problems, but I can't find the appropriate class in the C# standard library. Is there a canonical way to solve this without writing a custom binary search tree?
There is already a BST in the C# BCL, it's called a SortedDictionary<TKey, TValue>, if you don't want Key Value Pairs, but instead want single items, you can use the SortedSet<T> (SortedSet is in .NET 4.0).
It sounds like from your example you'd want a SortedDictionary<int, WhateverValueType>. Though I'm not sure exactly what you are after when you say "uniform random selection".
Of course, the Dictionary<TKey, TValue> is O(1) which is much faster. So unless you have a need for sorted order of the keys, I'd use that.
UPDATE: From the sounds of your needs, you're going to have a catch-22 on efficiency. To be able to jump into a random contiguous index in the data structure, how often will you be inserting/deleting? If not often, you could use an array and just Sort() after (O(n log n)), or always insert/delete in order (O(n)).
Or, you could wrap a Dictionary<int, YourType> and keep a parallel List<int> and update it after every Add/Delete:
_dictionary.Add(newIndex, newValue);
_indexes.Add(newIndex);
And then just access a random index from the list on lookups. The nice thing is that in this method really the Add() will be ~ O(1) (unless List resizes, but you can set an initial capacity to avoid some of that) but you would incurr a O(n) cost on removes.
I'm afraid the problem is you'll either sacrifice times on the lookups, or on the deletes/inserts. The problem is all the best access-time containers are non-contiguous. With the dual List<int>/Dictionary<int, YourValue> combo, though, you'd have a pretty good mix.
UPDATE 2: It sounds like from our continued discussion that if that absolute performance is your requirement you may have better luck rolling your own. Was fun to think about though, I'll update if I think of anything else.
Binary search trees and derived structures, like SortedDictionary or SortedSet, operate by comparing keys.
Your objects are not comparable by itself, but they offer object identity and a hash value. Therefore, a HashSet is the right data structure. Note: A Dictionary<int,YourType> is not appropriate because removal becomes a linear search (O(n)), and doesn't solve the random problem after removals.
Insert is O(1)
Remove is O(1)
RandomElement is O(n). It can easily be implemented, e.g.
set.ElementAt(random.Next(set.Count))
No copying to an intermediate list is necessary.
I realize that this question is over 3 years old, but just for people who come across this page:
If you don't need to keep the items in the data set sorted, you can just use a List<ItemType>.
Insertion and random selection are O(1). You can do deletion in O(1) by just moving the last item to the position of the item you want to delete and removing it from the end.
Code:
using System; // For the Random
using System.Collections.Generic; // The List
// List:
List<ItemType> list = new List<ItemType>();
// Add x:
ItemType x = ...; // The item to insert into the list
list.Add( x );
// Random selection
Random r = ...; // Probably get this from somewhere else
int index = r.Next( list.Count );
ItemType y = list[index];
// Remove item at index
list[index] = list[list.Count - 1]; // Copy last item to index
list.RemoveAt( list.Count - 1 ); // Remove from end of list
EDIT: Of course, to remove an element from the List<ItemType> you'll need to know its index. If you want to remove a random element, you can use a random index (as done in the example above). If you want to remove a given item, you can keep a Dictionary<ItemType,int> which maps the items to their indices. Adding, removing and updating these indices can all be done in O(1) (amortized).
Together this results in a complexity of O(1) (amortized) for all operations.

finding all lines in a list that contain x or y?

can I do this without looping through the whole list?
List<string> responseLines = new List<string>();
the list is then filled with around 300 lines of text.
next I want to search the list and create a second list of all lines that either start with "abc" or contain "xyz".
I know I can do a for each but is there a better / faster way?
You could use LINQ. This is no different performance-wise to using foreach -- that's pretty much what it does behind the scenes -- but you might prefer the syntax:
var query = responseLines.Where(s => s.StartsWith("abc") || s.Contains("xyz"))
.ToList();
(If you're happy dealing with an IEnumerable<string> rather than List<string> then you can omit the final ToList call.)
var newList = (from line in responseLines
where line.StartsWith("abc") || line.Contains("xyz")
select line).ToList();
Try this:
List<string> responseLines = new List<string>();
List<string> myLines = responseLines.Where(line => line.StartsWith("abc", StringComparison.InvariantCultureIgnoreCase) || line.Contains("xyz")).ToList();
The StartsWith and Contains shortcut - the Contains will only evaluate if the StartsWith is not satisfied. This still iterates the whole list, but of course there is no way to avoid that if you want to check the whole list, but it saves you from doing typing a foreach.
Use LINQ:
List<string> list = responseLines.Where(x => x.StartsWith("abc") || x.Contains("xyz")).ToList();
Unless you need all the text for some reason, it would be quicker to inspect each line at the time when you were generating the List and discard the ones that don't match without ever adding them.
This depends on how the List is loaded as well - that code is not shown. This would be effective if you were reading from a text file since then you could just use your LINQ query to operate directly on the input data using File.ReadLines as the source instead of the final List<string>.
var query = File.ReadLines("input.txt").
Where(s => s.StartsWith("abc") || s.Contains("xyz"))
.ToList();
LINQ works well as far as offering you improved syntax for this sort of thing (See LukeH's answer for a good example), but it isn't any faster than iterating over it by hand.
If you need to do this operation often, you might want to come up with some kind of indexed data structure that watches for all "abc" or "xyz" strings as they come into the list, and can thereby use a faster algorithm for serving them up when asked, rather than iterating through the whole list.
If you don't have to do it often, it's probably a "premature optimization."
Quite simply, there is no possible algorithm that can guarantee you will never have to iterate through every item in the list. However, it is possible to improve the average number of items you need to iterate through - sorting the list before you begin your search. By doing so, the only times you would have to iterate through the entire list would be when it is filled with only "abc" and "xyz."
Assuming that it's not practical for you to have a pre-sorted list by the time you need to search through it, then the only way to improve the speed of your search would be to use a different data structure than a list - for example, a binary search tree.

What .NET collection provides the fastest search

I have 60k items that need to be checked against a 20k lookup list. Is there a collection object (like List, HashTable) that provides an exceptionly fast Contains() method? Or will I have to write my own? In otherwords, is the default Contains() method just scan each item or does it use a better search algorithm.
foreach (Record item in LargeCollection)
{
if (LookupCollection.Contains(item.Key))
{
// Do something
}
}
Note. The lookup list is already sorted.
In the most general case, consider System.Collections.Generic.HashSet as your default "Contains" workhorse data structure, because it takes constant time to evaluate Contains.
The actual answer to "What is the fastest searchable collection" depends on your specific data size, ordered-ness, cost-of-hashing, and search frequency.
If you don't need ordering, try HashSet<Record> (new to .Net 3.5)
If you do, use a List<Record> and call BinarySearch.
Have you considered List.BinarySearch(item)?
You said that your large collection is already sorted so this seems like the perfect opportunity? A hash would definitely be the fastest, but this brings about its own problems and requires a lot more overhead for storage.
You should read this blog that speed tested several different types of collections and methods for each using both single and multi-threaded techniques.
According to the results, a BinarySearch on a List and SortedList were the top performers constantly running neck-in-neck when looking up something as a "value".
When using a collection that allows for "keys", the Dictionary, ConcurrentDictionary, Hashset, and HashTables performed the best overall.
I've put a test together:
First - 3 chars with all of the possible combinations of A-Z0-9
Fill each of the collections mentioned here with those strings
Finally - search and time each collection for a random string (same string for each collection).
This test simulates a lookup when there is guaranteed to be a result.
Then I changed the initial collection from all possible combinations to only 10,000 random 3 character combinations, this should induce a 1 in 4.6 hit rate of a random 3 char lookup, thus this is a test where there isn't guaranteed to be a result, and ran the test again:
IMHO HashTable, although fastest, isn't always the most convenient; working with objects. But a HashSet is so close behind it's probably the one to recommend.
Just for fun (you know FUN) I ran with 1.68M rows (4 characters):
Keep both lists x and y in sorted order.
If x = y, do your action, if x < y, advance x, if y < x, advance y until either list is empty.
The run time of this intersection is proportional to min (size (x), size (y))
Don't run a .Contains () loop, this is proportional to x * y which is much worse.
If it's possible to sort your items then there is a much faster way to do this then doing key lookups into a hashtable or b-tree. Though if you're items aren't sortable you can't really put them into a b-tree anyway.
Anyway, if sortable sort both lists then it's just a matter of walking the lookup list in order.
Walk lookup list
While items in check list <= lookup list item
if check list item = lookup list item do something
Move to next lookup list item
If you're using .Net 3.5, you can make cleaner code using:
foreach (Record item in LookupCollection.Intersect(LargeCollection))
{
//dostuff
}
I don't have .Net 3.5 here and so this is untested. It relies on an extension method. Not that LookupCollection.Intersect(LargeCollection) is probably not the same as LargeCollection.Intersect(LookupCollection) ... the latter is probably much slower.
This assumes LookupCollection is a HashSet
If you aren't worried about squeaking every single last bit of performance the suggestion to use a HashSet or binary search is solid. Your datasets just aren't large enough that this is going to be a problem 99% of the time.
But if this just one of thousands of times you are going to do this and performance is critical (and proven to be unacceptable using HashSet/binary search), you could certainly write your own algorithm that walked the sorted lists doing comparisons as you went. Each list would be walked at most once and in the pathological cases wouldn't be bad (once you went this route you'd probably find that the comparison, assuming it's a string or other non-integral value, would be the real expense and that optimizing that would be the next step).

Categories