Fast ways to avoid duplicates in a List<> in C#

Fast ways to avoid duplicates in a List<> in C# - c#

My C# program generates random strings from a given pattern. These strings are stored in a list. As no duplicates are allowed I'm doing it like this:
List<string> myList = new List<string>();
for (int i = 0; i < total; i++) {
string random_string = GetRandomString(pattern);
if (!myList.Contains(random_string)) myList.Add(random_string);
}
As you can imagine this works fine for several hundreds of entries. But I'm facing the situation to generate several million strings. And with each added string checking for duplicates gets slower and slower.
Are there any faster ways to avoid duplicates?

Use a data structure that can much more efficiently determine if an item exists, namely a HashSet. It can determine if an item is in the set in constant time, regardless of the number of items in the set.
If you really need the items in a List instead, or you need the items in the resulting list to be in the order they were generated, then you can store the data in both a list and a hashset; adding the item to both collections if it doesn't currently exist in the HashSet.

The easiest way is to use this:
myList = myList.Distinct().ToList();
Although this would require creating the list once, then creating a new list. A better way might be to make your generator ahead of time:
public IEnumerable<string> GetRandomStrings(int total, string pattern)
{
for (int i = 0; i < total; i++)
{
yield return GetRandomString(pattern);
}
}
...
myList = GetRandomStrings(total, pattern).Distinct().ToList();
Of course, if you don't need to access items by index, you could probably improve efficiency even more by dropping the ToList and just using an IEnumerable.

Don't use List<>. Use Dictionary<> or HashSet<> instead!

You could use a HashSet<string> if order is not important:
HashSet<string> myHashSet = new HashSet<string>();
for (int i = 0; i < total; i++)
{
string random_string = GetRandomString(pattern);
myHashSet.Add(random_string);
}
The HashSet class provides high-performance set operations. A set is a collection that contains no duplicate elements, and whose elements are in no particular order.
MSDN
Or if the order is important, I'd recommend using a SortedSet (.net 4.5 only)

not a good way but kind of quick fix,
take a bool to check if in whole list there is any duplicate entry.
bool containsKey;
string newKey;
public void addKey(string newKey){
foreach(string key in MyKeys){
if(key == newKey){
containsKey = true;
}
}
if(!containsKey){
MyKeys.add(newKey);
}else{
containsKey = false;
}
}

A Hashtable would be a faster way to check if an item exists than a list.

Have you tried:
myList = myList.Distinct()

Related

C# Foreach Loop Doesn't Exit Properly [duplicate]

For now, the best I could think of is:
bool oneMoreTime = true;
while (oneMoreTime)
{
ItemType toDelete=null;
oneMoreTime=false;
foreach (ItemType item in collection)
{
if (ShouldBeDeleted(item))
{
toDelete=item;
break;
}
}
if (toDelete!=null)
{
collection.Remove(toDelete);
oneMoreTime=true;
}
}
I know that I have at least one extra variable here, but I included it to improve the readability of the algorithm.

The "RemoveAll" method is best.
Another common technique is:
var itemsToBeDeleted = collection.Where(i=>ShouldBeDeleted(i)).ToList();
foreach(var itemToBeDeleted in itemsToBeDeleted)
collection.Remove(itemToBeDeleted);
Another common technique is to use a "for" loop, but make sure you go backwards:
for (int i = collection.Count - 1; i >= 0; --i)
if (ShouldBeDeleted(collection[i]))
collection.RemoveAt(i);
Another common technique is to add the items that are not being removed to a new collection:
var newCollection = new List<whatever>();
foreach(var item in collection.Where(i=>!ShouldBeDeleted(i))
newCollection.Add(item);
And now you have two collections. A technique I particularly like if you want to end up with two collections is to use immutable data structures. With an immutable data structure, "removing" an item does not change the data structure; it gives you back a new data structure (that re-uses bits from the old one, if possible) that does not have the item you removed. With immutable data structures you are not modifying the thing you're iterating over, so there's no problem:
var newCollection = oldCollection;
foreach(var item in oldCollection.Where(i=>ShouldBeDeleted(i))
newCollection = newCollection.Remove(item);
or
var newCollection = ImmutableCollection<whatever>.Empty;
foreach(var item in oldCollection.Where(i=>!ShouldBeDeleted(i))
newCollection = newCollection.Add(item);
And when you're done, you have two collections. The new one has the items removed, the old one is the same as it ever was.

Just as I finished typing I remembered that there is lambda-way to do it.
collection.RemoveAll(i=>ShouldBeDeleted(i));
Better way?

A forward variation on the backward for loop:
for (int i = 0; i < collection.Count; )
if (ShouldBeDeleted(collection[i]))
collection.RemoveAt(i)
else
i++;

You cannot delete from a collection inside a foreach loop (unless it is a very special collection having a special enumerator). The BCL collections will throw exceptions if the collection is modified while it is being enumerated.
You could use a for loop to delete individual elements and adjust the index accordingly. However, doing that can be error prone. Depending on the implementation of the underlying collection it may also be expensive to delete individual elements. For instance deleting the first element of a List<T> will copy all the remaning elements in the list.
The best solution is often to create a new collection based on the old:
var newCollection = collection.Where(item => !ShouldBeDeleted(item)).ToList();
Use ToList() or ToArray() to create the new collection or initialize your specific collection type from the IEnumerable returned by the Where() clause.

The lambda way is good. You could also use a regular for loop, you can iterate lists that a for loop uses within the loop itself, unlike a foreach loop.
for (int i = collection.Count-1; i >= 0; i--)
{
if(ShouldBeDeleted(collection[i])
collection.RemoveAt(i);
}
I am assuming that collection is an arraylist here, the code might be a bit different if you are using a different data structure.

How to remove duplicate value from ARRAYLIST

I have one arraylist which contains values from a different database, but it stores some duplicate values so, I want to remove duplicate values and store only unique value in Array List.
How can this be done?

Let's try another method. Instead removing duplicates, avoid adding any duplicates. This might be more efficient in your environment.
Here's a sample code:
ArrayList<String> myList = new ArrayList<string>();
foreach (string aString in myList)
{
if (!myList.Contains( aString ))
{
myList.Add(aString);
}
}

You can replace your ArrayList with a HashSet. From the documentation:
The HashSet<T> class provides high performance set operations. A set is a collection that contains no duplicate elements, and whose elements are in no particular order.
If it's absolutely necessary to use an ArrayList, you could use some Linq to remove duplicates with the Distinct command.
var distinctItems = arrayList.Distinct()

you can using This code when work with an ArrayList
ArrayList arrayList;
//Add some Members :)
arrayList.Add("ali");
arrayList.Add("hadi");
arrayList.Add("ali");
//Remove duplicates from array
for (int i = 0; i < arrayList.Count; i++)
{
for (int j = i + 1; j < arrayList.Count ; j++)
if (arrayList[i].ToString() == arrayList[j].ToString())
arrayList.Remove(arrayList[j]);
}

If you can, you should use a HashSet, or any other set class. It is much more efficient for this kind of operation. The main default of the HashSet is that the ordering of the element is not guaranteed to remain the same as your original list (wich may or may not be a problem, depending on your specifications).
Otherwise, if you need to keep the ordering, but only need the duplicates removed when you enumerate through your values, you can use the Distinct method from linq. Just be careful and don't run this query and copy the result everytime you modify your arraylist as it risks impacting your performances.

Hashtable ht = new Hashtable();
foreach (string item in originalArray){
//set a key in the hashtable for our arraylist value - leaving the hashtable value empty
ht[item] = null;
}
//now grab the keys from that hashtable into another arraylist
ArrayList distincArray = new ArrayList(ht.Keys);

If you must use ArrayList, use the Sort method.
Here's a good link # Sort Method of ArrayList.
After the list is sorted, then use an algorithm to iterate/compare all your elements and remove the duplicates.
Have fun,
Tommy Kwee

Can't make simple list operation in C#

I'm trying to make a function with list.
It is to sort and delete duplicates.
It sorts good, but don't delete duplictates.
What's the problem?
void sort_del(List<double> slist){
//here i sort slist
//get sorted with duplicates
List<double> rlist = new List<double>();
int new_i=0;
rlist.Add(slist[0]);
for (i = 0; i < size; i++)
{
if (slist[i] != rlist[new_i])
{
rlist.Add(slist[i]);
new_i++;
}
}
slist = new List<double>(rlist);
//here get without duplicates
}

It does not work because slist is passed by value. Assigning rlist to it has no effect on the caller's end. Your algorithm for detecting duplicates seems fine. If you do not want to use a more elegant LINQ way suggested in the other answer, change the method to return your list:
List<double> sort_del(List<double> slist){
// Do your stuff
return rlist;
}

with double you can just use Distinct()
slist = new List<double>(rlist.Distinct());
or maybe:
slist.Distinct().Sort();

You're not modifying the underlying list. You're trying to add to a new collection, and you're not checking if the new one contains one of the old ones correctly.
If you're required to do this for homework (which seems likely, as there are data structures and easy ways to do this with LINQ that others have pointed out), you should break the sort piece and the removal of duplication into two separate methods. The methods that removes duplicates should accept a list as a parameter (as this one does), and return the new list instance without duplicates.

How to remove elements from an array

Hi I'm working on some legacy code that goes something along the lines of
for(int i = results.Count-1; i >= 0; i--)
{
if(someCondition)
{
results.Remove(results[i]);
}
}
To me it seems like bad practice to be removing the elements while still iterating through the loop because you'll be modifying the indexes.
Is this a correct assumption?
Is there a better way of doing this? I would like to use LINQ but I'm in 2.0 Framework

The removal is actually ok since you are going downwards to zero, only the indexes that you already passed will be modified. This code actually would break for another reason: It starts with results.Count, but should start at results.Count -1 since array indexes start at 0.
for(int i = results.Count-1; i >= 0; i--)
{
if(someCondition)
{
results.RemoveAt(i);
}
}
Edit:
As was pointed out - you actually must be dealing with a List of some sort in your pseudo-code. In this case they are conceptually the same (since Lists use an Array internally) but if you use an array you have a Length property (instead of a Count property) and you can not add or remove items.
Using a list the solution above is certainly concise but might not be easy to understand for someone that has to maintain the code (i.e. especially iterating through the list backwards) - an alternative solution could be to first identify the items to remove, then in a second pass removing those items.
Just substitute MyType with the actual type you are dealing with:
List<MyType> removeItems = new List<MyType>();
foreach(MyType item in results)
{
if(someCondition)
{
removeItems.Add(item);
}
}
foreach (MyType item in removeItems)
results.Remove(item);

It doesn't seem like the Remove should work at all. The IList implementation should fail if we're dealing with a fixed-size array, see here.
That being said, if you're dealing with a resizable list (e.g. List<T>), why call Remove instead of RemoveAt? Since you're already navigating the indices in reverse, you don't need to "re-find" the item.

May I suggest a somewhat more functional alternative to your current code:
Instead of modifying the existing array one item at a time, you could derive a new one from it and then replace the whole array as an "atomic" operation once you're done:
The easy way (no LINQ, but very similar):
Predicate<T> filter = delegate(T item) { return !someCondition; };
results = Array.FindAll(results, filter);
// with LINQ, you'd have written: results = results.Where(filter);
where T is the type of the items in your results array.
A somewhat more explicit alternative:
var newResults = new List<T>();
foreach (T item in results)
{
if (!someCondition)
{
newResults.Add(item);
}
}
results = newResults.ToArray();

Usually you wouldn't remove elements as such, you would create a new array from the old without the unwanted elements.
If you do go the route of removing elements from an array/list your loop should count down rather than up. (as yours does)

a couple of options:
List<int> indexesToRemove = new List<int>();
for(int i = results.Count; i >= 0; i--)
{
if(someCondition)
{
//results.Remove(results[i]);
indexesToRemove.Add(i);
}
}
foreach(int i in indexesToRemove) {
results.Remove(results[i]);
}
or alternatively, you could make a copy of the existing list, and instead remove from the original list.
//temp is a copy of results
for(int i = temp.Count-1; i >= 0; i--)
{
if(someCondition)
{
results.Remove(results[i]);
}
}

Removing duplicate string from List (.NET 2.0!)

I'm having issues finding the most efficient way to remove duplicates from a list of strings (List).
My current implementation is a dual foreach loop checking the instance count of each object being only 1, otherwise removing the second.
I know there are MANY other questions out there, but they all the best solutions require above .net 2.0, which is the current build environment I'm working in. (GM and Chrysler are very resistant to changes ... :) )
This limits the possible results by not allowing any LINQ, or HashSets.
The code I'm using is Visual C++, but a C# solution will work just fine as well.
Thanks!

This probably isn't what you're looking for, but if you have control over this, the most efficient way would be to not add them in the first place...
Do you have control over this? If so, all you'd need to do is a myList.Contains(currentItem) call before you add the item and you're set

You could do the following.
List<string> list = GetTheList();
Dictionary<string,object> map = new Dictionary<string,object>();
int i = 0;
while ( i < list.Count ) {
string current = list[i];
if ( map.ContainsKey(current) ) {
list.RemoveAt(i);
} else {
i++;
map.Add(current,null);
}
}
This has the overhead of building a Dictionary<TKey,TValue> object which will duplicate the list of unique values in the list. But it's fairly efficient speed wise.

I'm no Comp Sci PhD, but I'd imagine using a dictionary, with the items in your list as the keys would be fast.
Since a dictionary doesn't allow duplicate keys, you'd only have unique strings at the end of iteration.

Just remember when providing a custom class to override the Equals() method in order for the Contains() to function as required.
Example
List<CustomClass> clz = new List<CustomClass>()
public class CustomClass{
public bool Equals(Object param){
//Put equal code here...
}
}

If you're going the route of "just don't add duplicates", then checking "List.Contains" before adding an item works, but its O(n^2) where n is the number strings you want to add. Its no different from your current solution using two nested loops.
You'll have better luck using a hashset to store items you've already added, but since you're using .NET 2.0, a Dictionary can substitute for a hash set:
static List<T> RemoveDuplicates<T>(List<T> input)
{
List<T> result = new List<T>(input.Count);
Dictionary<T, object> hashSet = new Dictionary<T, object>();
foreach (T s in input)
{
if (!hashSet.ContainsKey(s))
{
result.Add(s);
hashSet.Add(s, null);
}
}
return result;
}
This runs in O(n) and uses O(2n) space, it will generally work very well for up to 100K items. Actual performance depends on the average length of the strings -- if you really need to maximum performance, you can exploit some more powerful data structures like tries make inserts even faster.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Fast ways to avoid duplicates in a List<> in C# - c#

Don't use List<>. Use Dictionary<> or HashSet<> instead!

A Hashtable would be a faster way to check if an item exists than a list.

Have you tried: myList = myList.Distinct()

Related

C# Foreach Loop Doesn't Exit Properly [duplicate]

How to remove duplicate value from ARRAYLIST

Can't make simple list operation in C#

How to remove elements from an array

Removing duplicate string from List (.NET 2.0!)

Categories

Resources