I'm having issues finding the most efficient way to remove duplicates from a list of strings (List).
My current implementation is a dual foreach loop checking the instance count of each object being only 1, otherwise removing the second.
I know there are MANY other questions out there, but they all the best solutions require above .net 2.0, which is the current build environment I'm working in. (GM and Chrysler are very resistant to changes ... :) )
This limits the possible results by not allowing any LINQ, or HashSets.
The code I'm using is Visual C++, but a C# solution will work just fine as well.
Thanks!
This probably isn't what you're looking for, but if you have control over this, the most efficient way would be to not add them in the first place...
Do you have control over this? If so, all you'd need to do is a myList.Contains(currentItem) call before you add the item and you're set
You could do the following.
List<string> list = GetTheList();
Dictionary<string,object> map = new Dictionary<string,object>();
int i = 0;
while ( i < list.Count ) {
string current = list[i];
if ( map.ContainsKey(current) ) {
list.RemoveAt(i);
} else {
i++;
map.Add(current,null);
}
}
This has the overhead of building a Dictionary<TKey,TValue> object which will duplicate the list of unique values in the list. But it's fairly efficient speed wise.
I'm no Comp Sci PhD, but I'd imagine using a dictionary, with the items in your list as the keys would be fast.
Since a dictionary doesn't allow duplicate keys, you'd only have unique strings at the end of iteration.
Just remember when providing a custom class to override the Equals() method in order for the Contains() to function as required.
Example
List<CustomClass> clz = new List<CustomClass>()
public class CustomClass{
public bool Equals(Object param){
//Put equal code here...
}
}
If you're going the route of "just don't add duplicates", then checking "List.Contains" before adding an item works, but its O(n^2) where n is the number strings you want to add. Its no different from your current solution using two nested loops.
You'll have better luck using a hashset to store items you've already added, but since you're using .NET 2.0, a Dictionary can substitute for a hash set:
static List<T> RemoveDuplicates<T>(List<T> input)
{
List<T> result = new List<T>(input.Count);
Dictionary<T, object> hashSet = new Dictionary<T, object>();
foreach (T s in input)
{
if (!hashSet.ContainsKey(s))
{
result.Add(s);
hashSet.Add(s, null);
}
}
return result;
}
This runs in O(n) and uses O(2n) space, it will generally work very well for up to 100K items. Actual performance depends on the average length of the strings -- if you really need to maximum performance, you can exploit some more powerful data structures like tries make inserts even faster.
Related
I need to have a property that will be an array that can hold both ints and strings.
if i set the property to an array of ints it should be ints so when I am searching through this array the search will be fast, and at odd times this property will also contain strings which the search will be slow.
Is there any other way other than the following to have a list that contain native types
two properties one for ints and one for strings
use List< object >
UPDATE:
The use-case is as follow. I have a database field [ReferenceNumber] that holds the values (integers and strings) and another field [SourceID] (used for other things) which can be used to determine if record holds an int or string.
I will be fetching collections of these records based on the source id, of course depending on what the source is, the list either will be integers or strings. Then I will go through this collection looking for certain reference numbers, if they exist not add them or they dont then add them. I will be pre-fetching a lot of records instead of hitting the database over and over.
so for example if i get a list for sourceid =1 that means they are ints and if searching i want the underline list to be int so the search will be fast. and if sourceid say is 2 which means they are strings and very rare its okay if the search is slow because those number of records are not that many and a performance hit on searching through strings is okay.
I will go through this collection looking for certain reference numbers, if they exist not add them or they dont then add them.
It sounds to me like you don't need a List<>, but rather a HashSet<>. Simply use a HashSet<object>, and Add() all the items, and the collection will ignore duplicate items. It will be super-fast, regardless of whether you're dealing with ints or strings.
On my computer, the following code shows that it takes about 50 milliseconds to populate an initial 400,000 unique strings in the hashset, and about 2 milliseconds to add an additional 10,000 random strings:
var sw = new Stopwatch();
var initial= Enumerable.Range(1, 400000).Select(i => i.ToString()).ToList();
sw.Start();
var set = new HashSet<object>(initial);
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
var random = new Random();
var additional = Enumerable.Range(1, 10000).Select(i => random.Next(1000000).ToString()).ToList();
sw.Restart();
foreach (var item in additional)
{
set.Add(item);
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
Also, in case it's important, HashSet<>s do retain order of insertion.
The only other thing I would suggest is a custom object that implements IComparable
class Multitype: IComparable
{
public int? Number { get; set; }
public string Words {get; set; }
public int CompareTo(object obj)
{
Multitype other = obj as Multitype;
if (Number != null && other != null && other.Number != null)
{
//...
}
else
{
//...
}
}
}
There will be some extra comparison steps between numbers, but not as much as string parsing.
Are you storing a ton of data, is that performance difference really going to matter?
It's possible to use generics if you implement them on the class. Not sure if this solves your problem. Would be interested to hear the real-world example of a property that can have different types.
class Foo<T>
{
public List<T> GenericList { get; set; }
public Foo()
{
this.GenericList = new List<T>();
}
}
If by "use List" you mean the object primitive or provided System.Object, that is an option, but I think it would behoove you to make your own wrapper object -- IntString or similar -- that would handle everything for you. It should implement IComparable, as the other gal mentioned.
You can increase the efficiency of sorting your object in collections by writing a CompareTo method that does exactly what you need it to. Writing a good CompareTo method is a whole can of worms in itself, so you should probably start a new question for that, if that's what you want.
If you're looking for a property that is strongly typed as a List<Int> or List<String> at instantiation, but can change afterwards, then you want an interface. IList exists, but won't help you, since that must also be strongly typed upon declaration. You should probably make something like IIntStringList that can only be one of List<Int> or List<String>.
Sorry this answer doesn't have that many details (I need to leave the office now), but I hope I've set you on the right track.
My C# program generates random strings from a given pattern. These strings are stored in a list. As no duplicates are allowed I'm doing it like this:
List<string> myList = new List<string>();
for (int i = 0; i < total; i++) {
string random_string = GetRandomString(pattern);
if (!myList.Contains(random_string)) myList.Add(random_string);
}
As you can imagine this works fine for several hundreds of entries. But I'm facing the situation to generate several million strings. And with each added string checking for duplicates gets slower and slower.
Are there any faster ways to avoid duplicates?
Use a data structure that can much more efficiently determine if an item exists, namely a HashSet. It can determine if an item is in the set in constant time, regardless of the number of items in the set.
If you really need the items in a List instead, or you need the items in the resulting list to be in the order they were generated, then you can store the data in both a list and a hashset; adding the item to both collections if it doesn't currently exist in the HashSet.
The easiest way is to use this:
myList = myList.Distinct().ToList();
Although this would require creating the list once, then creating a new list. A better way might be to make your generator ahead of time:
public IEnumerable<string> GetRandomStrings(int total, string pattern)
{
for (int i = 0; i < total; i++)
{
yield return GetRandomString(pattern);
}
}
...
myList = GetRandomStrings(total, pattern).Distinct().ToList();
Of course, if you don't need to access items by index, you could probably improve efficiency even more by dropping the ToList and just using an IEnumerable.
Don't use List<>. Use Dictionary<> or HashSet<> instead!
You could use a HashSet<string> if order is not important:
HashSet<string> myHashSet = new HashSet<string>();
for (int i = 0; i < total; i++)
{
string random_string = GetRandomString(pattern);
myHashSet.Add(random_string);
}
The HashSet class provides high-performance set operations. A set is a collection that contains no duplicate elements, and whose elements are in no particular order.
MSDN
Or if the order is important, I'd recommend using a SortedSet (.net 4.5 only)
not a good way but kind of quick fix,
take a bool to check if in whole list there is any duplicate entry.
bool containsKey;
string newKey;
public void addKey(string newKey){
foreach(string key in MyKeys){
if(key == newKey){
containsKey = true;
}
}
if(!containsKey){
MyKeys.add(newKey);
}else{
containsKey = false;
}
}
A Hashtable would be a faster way to check if an item exists than a list.
Have you tried:
myList = myList.Distinct()
I'm trying to make a function with list.
It is to sort and delete duplicates.
It sorts good, but don't delete duplictates.
What's the problem?
void sort_del(List<double> slist){
//here i sort slist
//get sorted with duplicates
List<double> rlist = new List<double>();
int new_i=0;
rlist.Add(slist[0]);
for (i = 0; i < size; i++)
{
if (slist[i] != rlist[new_i])
{
rlist.Add(slist[i]);
new_i++;
}
}
slist = new List<double>(rlist);
//here get without duplicates
}
It does not work because slist is passed by value. Assigning rlist to it has no effect on the caller's end. Your algorithm for detecting duplicates seems fine. If you do not want to use a more elegant LINQ way suggested in the other answer, change the method to return your list:
List<double> sort_del(List<double> slist){
// Do your stuff
return rlist;
}
with double you can just use Distinct()
slist = new List<double>(rlist.Distinct());
or maybe:
slist.Distinct().Sort();
You're not modifying the underlying list. You're trying to add to a new collection, and you're not checking if the new one contains one of the old ones correctly.
If you're required to do this for homework (which seems likely, as there are data structures and easy ways to do this with LINQ that others have pointed out), you should break the sort piece and the removal of duplication into two separate methods. The methods that removes duplicates should accept a list as a parameter (as this one does), and return the new list instance without duplicates.
Hi I'm working on some legacy code that goes something along the lines of
for(int i = results.Count-1; i >= 0; i--)
{
if(someCondition)
{
results.Remove(results[i]);
}
}
To me it seems like bad practice to be removing the elements while still iterating through the loop because you'll be modifying the indexes.
Is this a correct assumption?
Is there a better way of doing this? I would like to use LINQ but I'm in 2.0 Framework
The removal is actually ok since you are going downwards to zero, only the indexes that you already passed will be modified. This code actually would break for another reason: It starts with results.Count, but should start at results.Count -1 since array indexes start at 0.
for(int i = results.Count-1; i >= 0; i--)
{
if(someCondition)
{
results.RemoveAt(i);
}
}
Edit:
As was pointed out - you actually must be dealing with a List of some sort in your pseudo-code. In this case they are conceptually the same (since Lists use an Array internally) but if you use an array you have a Length property (instead of a Count property) and you can not add or remove items.
Using a list the solution above is certainly concise but might not be easy to understand for someone that has to maintain the code (i.e. especially iterating through the list backwards) - an alternative solution could be to first identify the items to remove, then in a second pass removing those items.
Just substitute MyType with the actual type you are dealing with:
List<MyType> removeItems = new List<MyType>();
foreach(MyType item in results)
{
if(someCondition)
{
removeItems.Add(item);
}
}
foreach (MyType item in removeItems)
results.Remove(item);
It doesn't seem like the Remove should work at all. The IList implementation should fail if we're dealing with a fixed-size array, see here.
That being said, if you're dealing with a resizable list (e.g. List<T>), why call Remove instead of RemoveAt? Since you're already navigating the indices in reverse, you don't need to "re-find" the item.
May I suggest a somewhat more functional alternative to your current code:
Instead of modifying the existing array one item at a time, you could derive a new one from it and then replace the whole array as an "atomic" operation once you're done:
The easy way (no LINQ, but very similar):
Predicate<T> filter = delegate(T item) { return !someCondition; };
results = Array.FindAll(results, filter);
// with LINQ, you'd have written: results = results.Where(filter);
where T is the type of the items in your results array.
A somewhat more explicit alternative:
var newResults = new List<T>();
foreach (T item in results)
{
if (!someCondition)
{
newResults.Add(item);
}
}
results = newResults.ToArray();
Usually you wouldn't remove elements as such, you would create a new array from the old without the unwanted elements.
If you do go the route of removing elements from an array/list your loop should count down rather than up. (as yours does)
a couple of options:
List<int> indexesToRemove = new List<int>();
for(int i = results.Count; i >= 0; i--)
{
if(someCondition)
{
//results.Remove(results[i]);
indexesToRemove.Add(i);
}
}
foreach(int i in indexesToRemove) {
results.Remove(results[i]);
}
or alternatively, you could make a copy of the existing list, and instead remove from the original list.
//temp is a copy of results
for(int i = temp.Count-1; i >= 0; i--)
{
if(someCondition)
{
results.Remove(results[i]);
}
}
I have an arraylist that contains items called Room. Each Room has a roomtype such as kitchen, reception etc.
I want to check the arraylist to see if any rooms of that type exist before adding it to the list.
Can anyone recommend a neat way of doing this without the need for multiple foreach loops?
(.NET 2.0)
I havent got access to the linq technology as am running on .net 2.0. I should have stated that in the question.
Apologies
I would not use ArrayList here; since you have .NET 2.0, use List<T> and all becomes simple:
List<Room> rooms = ...
string roomType = "lounge";
bool exists = rooms.Exists(delegate(Room room) { return room.Type == roomType; });
Or with C# 3.0 (still targetting .NET 2.0)
bool exists = rooms.Exists(room => room.Type == roomType);
Or with C# 3.0 and either LINQBridge or .NET 3.5:
bool exists = rooms.Any(room => room.Type == roomType);
(the Any usage will work with more types, not just List<T>)
if (!rooms.Any (r => r.RoomType == typeToFind /*kitchen, ...*/))
//add it or whatever
From your question it's not 100% clear to me if you want to enforce the rule that there may be only one room of a given type, or if you simply want to know.
If you have the invariant that no collection of Rooms may have more than one of the same Room type, you might try using a Dictionary<Type, Room>.
This has the benefit of not performing a linear search on add.
You would add a room using the following operations:
if(rooms.ContainsKey(room.GetType()))
{
// Can't add a second room of the same type
...
}
else
{
rooms.Add(room.GetType(), room);
}
Without using lambda expressions:
void AddRoom(Room r, IList<Room> rooms, IDictionary<string, bool> roomTypes)
{
if (!roomTypes.Contains(r.RoomType))
{
rooms.Add(r);
roomTypes.Add(r.RoomType, true);
}
}
It doesn't actually matter what the type of the value in the dictionary is, because the only thing you're ever looking at is the keys.
Another way is to sort the array, then walk the elements until you find a pair of adjacent duplicates. Make it to the end, and the array is dupe-free.
I thought using lists and doing Exists where an operation that takes O(n) time.
Using Dictionary instead is O(1) and is preferred if memory is not a problem.
If you do not need the sequential List I would try using a Dictionary like this:
Dictionary<Type, List<Room>> rooms = new Dictionary<Type, List<Room>>;
void Main(){
KitchenRoom kr = new KitchenRoom();
DummyRoom dr = new DummyRoom();
RoomType1 rt1 = new RoomType1();
...
AddRoom(kr);
AddRoom(dr);
AddRoom(rt1);
...
}
void AddRoom(Room r){
Type roomtype = r.GetType();
if(!rooms.ContainsKey(roomtype){ //If the type is new, then add it with an empty list
rooms.Add(roomtype, new List<Room>);
}
//And of course add the room.
rooms[roomtype].Add(r);
}
You basically have a list of different roomtypes. But this solution is only OK if you don't need the arraylist. But for large lists this will be the fastest one.
I had a solution once with List<string> with 300.000+ items. Comparing each element with another list of almost the same size took humongous 12hrs to do. Switched the logic to using Dictionary instead and down to 12 minutes. For larger lists I always go Dictionary<mytype, bool> where bool is just a dummy not being used.