C# distinct List<string> by substring - c#

I want to remove duplicates from a list of strings. I do this by using distinct, but i want to ignore the first char when comparing.
I already have a working code that deletes the duplicates, but my code also delete the first char of every string.
List<string> mylist = new List<string>();
List<string> newlist =
mylist.Select(e => e.Substring(1, e.Length - 1)).Distinct().ToList();
Input:
"1A","1B","2A","3C","4D"
Output:
"A","B","C","D"
Right Output:
"1A","2B","3C","4D" it doesn't matter if "1A" or "2A" will be deleted
I guess I am pretty close but.... any input is highly appreciated!
As always a solution should work as fast as possible ;)

You can implement an IEqualityComparer<string> that will compare your strings by ignoring the first letter. Then pass it to Distinct method.
myList.Distinct(new MyComparer());
There is also an example on MSDN that shows you how to implement and use a custom comparer with Distinct.

You can GroupBy all but the first character and take the first of every group:
List<string> result= mylist.GroupBy(s => s.Length < 2 ? s : s.Substring(1))
.Select(g => g.First())
.ToList();
Result:
Console.Write(string.Join(",", result)); // 1A,1B,3C,4D
it doesn't matter if "1A" or "2A" will be deleted
If you change your mind you have to replace g.First() with the new logic.
However, if performance really matters and it is never important which duplicate you want to delete you should prefer Selman's approach which suggests to write a custom IEqualityComparer<string>. That will be more efficient than my GroupBy approach if it's GetHashCode is implemented like:
return (s.Length < 2 ? s : s.Substring(1)).GetHashCode();

I'm going to suggest a simple extension that you can reuse in similar situations
public static IEnumerable<T> DistinctBy<T, U>(this IEnumerable<T> This, Func<T, U> keySelector)
{
var set = new HashSet<U>();
foreach (var item in This)
{
if (set.Add(keySelector(item)))
yield return item;
}
}
This is basically how Distinct is implemented in Linq.
Usage:
List<string> newlist =
mylist.DistinctBy(e => e.Substring(1, e.Length - 1)).ToList();

I realise the answer has already been given, but since I was working on this answer anyway I'm still going to post it, in case it's any use.
If you really want the fastest solution for large lists, then something like this might be optimal. You would need to do some accurate timings to be sure, though!
This approach does not make any additional string copies when comparing or computing the hash codes:
using System;
using System.Collections.Generic;
using System.Linq;
namespace Demo
{
internal static class Program
{
static void Main()
{
var myList = new List<string>
{
"1A",
"1B",
"2A",
"3C",
"4D"
};
var newList = myList.Distinct(new MyComparer());
Console.WriteLine(string.Join("\n", newList));
}
sealed class MyComparer: IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
if (x.Length != y.Length)
return false;
if (x.Length == 0)
return true;
return (string.Compare(x, 1, y, 1, x.Length) == 0);
}
public int GetHashCode(string s)
{
if (s.Length <= 1)
return 0;
int result = 17;
unchecked
{
bool first = true;
foreach (char c in s)
{
if (first)
first = false;
else
result = result*23 + c;
}
}
return result;
}
}
}
}

Related

search List<string> for string .StartsWith()

I have a
List<string>
with 1500 strings. I am now using the following code to pull out only string that start with the string prefixText.
foreach(string a in <MYLIST>)
{
if(a.StartsWith(prefixText, true, null))
{
newlist.Add(a);
}
}
This is pretty fast, but I'm looking for google fast. Now my question is if I arrange the List in alphabetical order, then compare char by char can I make this faster? Or any other suggestions on making this faster?
Thus 1500 is not really a huge number binary search on sorted list would be enough probably.
Nevertheless most efficient algorithms for prefix search are based on the data structure named Trie or Prefix Tree. See: http://en.wikipedia.org/wiki/Trie
Following picture demonstrates the idea very briefly:
For c# implementation see for instance .NET DATA STRUCTURES FOR PREFIX STRING SEARCH AND SUBSTRING (INFIX) SEARCH TO IMPLEMENT AUTO-COMPLETION AND INTELLI-SENSE
You can use PLINQ (Parallel LINQ) to make the execution faster:
var newList = list.AsParallel().Where(x => x.StartsWith(prefixText)).ToList()
If you have the list in alpabetical order, you can use a variation of binary search to make it a lot faster.
As a starting point, this will return the index of one of the strings that match the prefix, so then you can look forward and backward in the list to find the rest:
public static int BinarySearchStartsWith(List<string> words, string prefix, int min, int max) {
while (max >= min) {
int mid = (min + max) / 2;
int comp = String.Compare(words[mid].Substring(0, prefix.Length), prefix);
if (comp < 0) {
min = mid + 1;
} else if (comp > 0) {
max = mid - 1;
} else {
return mid;
}
}
return -1;
}
int index = BinarySearchStartsWith(theList, "pre", 0, theList.Count - 1);
if (index == -1) {
// not found
} else{
// found
}
Note: If you use a prefix that is longer than any of the strings that are compared, it will break, so you might need to figure out how you want to handle that.
So many approches were analyzed to achive minimum data capacity and high performance. The first place is: all prefixes are stored in dictionary: key - prefix, values - items appropriate for prefix.
Here simple implementation of this algorithm:
public class Trie<TItem>
{
#region Constructors
public Trie(
IEnumerable<TItem> items,
Func<TItem, string> keySelector,
IComparer<TItem> comparer)
{
this.KeySelector = keySelector;
this.Comparer = comparer;
this.Items = (from item in items
from i in Enumerable.Range(1, this.KeySelector(item).Length)
let key = this.KeySelector(item).Substring(0, i)
group item by key)
.ToDictionary( group => group.Key, group => group.ToList());
}
#endregion
#region Properties
protected Dictionary<string, List<TItem>> Items { get; set; }
protected Func<TItem, string> KeySelector { get; set; }
protected IComparer<TItem> Comparer { get; set; }
#endregion
#region Methods
public List<TItem> Retrieve(string prefix)
{
return this.Items.ContainsKey(prefix)
? this.Items[prefix]
: new List<TItem>();
}
public void Add(TItem item)
{
var keys = (from i in Enumerable.Range(1, this.KeySelector(item).Length)
let key = this.KeySelector(item).Substring(0, i)
select key).ToList();
keys.ForEach(key =>
{
if (!this.Items.ContainsKey(key))
{
this.Items.Add(key, new List<TItem> { item });
}
else if (this.Items[key].All(x => this.Comparer.Compare(x, item) != 0))
{
this.Items[key].Add(item);
}
});
}
public void Remove(TItem item)
{
this.Items.Keys.ToList().ForEach(key =>
{
if (this.Items[key].Any(x => this.Comparer.Compare(x, item) == 0))
{
this.Items[key].RemoveAll(x => this.Comparer.Compare(x, item) == 0);
if (this.Items[key].Count == 0)
{
this.Items.Remove(key);
}
}
});
}
#endregion
}
1500 is usually too few:
you could search it in parallel with a simple divide and conquer of the problem. Search each half of the list in two (or divide into three, four, ..., parts) different jobs/threads.
Or store the strings in a (not binary) tree instead. Will be O(log n).
sorted in alphabetical order you can do a binary search (sort of the same as the previous one)
You can accelerate a bit by comparing the first character before invoking StartsWith:
char first = prefixText[0];
foreach(string a in <MYLIST>)
{
if (a[0]==first)
{
if(a.StartsWith(prefixText, true, null))
{
newlist.Add(a);
}
}
}
I assume that the really fastest way would be to generate a dictionary with all possible prefixes from your 1500 strings, effectively precomputing the results for all possible searches that will return non-empty. Your search would then be simply a dictionary lookup completing in O(1) time. This is a case of trading memory (and initialization time) for speed.
private IDictionary<string, string[]> prefixedStrings;
public void Construct(IEnumerable<string> strings)
{
this.prefixedStrings =
(
from s in strings
from i in Enumerable.Range(1, s.Length)
let p = s.Substring(0, i)
group s by p
).ToDictionary(
g => g.Key,
g => g.ToArray());
}
public string[] Search(string prefix)
{
string[] result;
if (this.prefixedStrings.TryGetValue(prefix, out result))
return result;
return new string[0];
}
Have you tried implementing a Dictionary and comparing the results? Or, if you do put the entries in alphabetical order, try a binary search.
The question to me is whether or not you'll need to do this one time or multiple times.
If you only find the StartsWithPrefix list one time, you can't get faster then leaving the original list as is and doing myList.Where(s => s.StartsWith(prefix)). This looks at every string one time so it's O(n)
If you need to find the StartsWithPrefix list several times, or maybe you're going to want to add or remove strings to the original list and update the StartsWithPrefix list then you should sort the original list and use binary search. But this will be sort time + search time = O(n log n) + 2 * O(log n)
If you did the binary search method, you would find the indexes of the first occurrence of your prefix and the last occurrence via search. Then do mySortedList.Skip(n).Take(m-n) where n is first index and m is last index.
Edit:
Wait a minute, we're using the wrong tool for the job. Use a Trie! If you put all your strings into a Trie instead of the list, all you have to do is walk down the trie with your prefix and grab all the words underneath that node.
I would go with using Linq:
var query = list.Where(w => w.StartsWith("prefixText")).Select(s => s).ToList();

Query Nested Dictionary

I was curious if anyone had a good way to solving this problem efficiently. I currently have the following object.
Dictionary<int, Dictionary<double, CustomStruct>>
struct CustomStruct
{
double value1;
double value2;
...
}
Given that I know the 'int' I want to access, I need to know how to return the 'double key' for the dictionary that has the lowest sum of (value1 + value2). Any help would be greatly appreciated. I was trying to use Linq, but any method would be appreciated.
var result = dict[someInt].MinBy(kvp => kvp.Value.value1 + kvp.Value.value2).Key;
using the MinBy Extension Method from the awesome MoreLINQ project.
Using just plain LINQ:
Dictionary<int, Dictionary<double, CustomStruct>> dict = ...;
int id = ...;
var minimum =
(from kvp in dict[id]
// group the keys (double) by their sums
group kvp.Key by kvp.Value.value1 + kvp.Value.value2 into g
orderby g.Key // sort group keys (sums) in ascending order
select g.First()) // select the first key (double) in the group
.First(); // return first key in the sorted collection of keys
Whenever you want to get the minimum or maximum item using plain LINQ, you usually have to do it using ith a combination of GroupBy(), OrderBy() and First()/Last() to get it.
A Dictionary<TKey,TValue> is also a sequence of KeyValuePair<TKey,TValue>. You can select the KeyValuePair with the least sum of values and and get its key.
Using pure LINQ to Objects:
dict[someInt].OrderBy(item => item.Value.value1 + item.Value.value2)
.FirstOrDefault()
.Select(item => item.Key);
Here is the non LINQ way. It is not shorter than its LINQ counterparts but it is much more efficient because it does no sorting like most LINQ solutions which may turn out expensive if the collection is large.
The MinBy solution from dtb is a good one but it requires an external library. I do like LINQ a lot but sometimes you should remind yourself that a foreach loop with a few local variables is not archaic or an error.
CustomStruct Min(Dictionary<double, CustomStruct> input)
{
CustomStruct lret = default(CustomStruct);
double lastSum = double.MaxValue;
foreach (var kvp in input)
{
var other = kvp.Value;
var newSum = other.value1 + other.value2;
if (newSum < lastSum)
{
lastSum = newSum;
lret = other;
}
}
return lret;
}
If you want to use the LINQ method without using an extern library you can create your own MinBy like this one:
public static class Extensions
{
public static T MinBy<T>(this IEnumerable<T> coll, Func<T,double> criteria)
{
T lret = default(T);
double last = double.MaxValue;
foreach (var v in coll)
{
var newLast = criteria(v);
if (newLast < last)
{
last = newLast;
lret = v;
}
}
return lret;
}
}
It is not as efficient as the first one but it does the job and is more reusable and composable as the first one. Your solution with Aggregate is innovative but requires recalculation of the sum of the current best match for every item the current best match is compared to because you carry not enough state between the aggregate calls.
Thanks for all the help guys, found out this way too:
dict[int].Aggregate(
(seed, o) =>
{
var v = seed.Value.TotalCut + seed.Value.TotalFill;
var k = o.Value.TotalCut + o.Value.TotalFill;
return v < k ? seed : o;
}).Key;

compare multiple arraylist lengths to find longest one

I have 6 array lists and I would like to know which one is the longest without using a bunch of IF STATEMENTS.
"if arraylist.count > anotherlist.count Then...." <- Anyway to do this other than this?
Examples in VB.net or C#.Net (4.0) would be helpfull.
arraylist1.count
arraylist2.count
arraylist3.count
arraylist4.count
arraylist5.count
arraylist6.count
DIM longest As integer = .... 'the longest arraylist should be stored in this variable.
Thanks
Is 1 if statement acceptable?
public ArrayList FindLongest(params ArrayList[] lists)
{
var longest = lists[0];
for(var i=1;i<lists.Length;i++)
{
if(lists[i].Length > longest.Length)
longest = lists[i];
}
return longest;
}
You could use Linq:
public static ArrayList FindLongest(params ArrayList[] lists)
{
return lists == null
? null
: lists.OrderByDescending(x => x.Count).FirstOrDefault();
}
If you just want the length of the longest list, it's even simpler:
public static int FindLongestLength(params ArrayList[] lists)
{
return lists == null
? -1 // here you could also return (int?)null,
// all you need to do is adjusting the return type
: lists.Max(x => x.Count);
}
If you store everything in a List of Lists like for example
List<List<int>> f = new List<List<int>>();
Then a LINQ like
List<int> myLongest = f.OrderBy(x => x.Count).Last();
will yield the list with the most number of items. Of course you will have to handle the case when there is tie for the longest list
SortedList sl=new SortedList();
foreach (ArrayList al in YouArrayLists)
{
int c=al.Count;
if (!sl.ContainsKey(c)) sl.Add(c,al);
}
ArrayList LongestList=(ArrayList)sl.GetByIndex(sl.Count-1);
If you just want the length of the longest ArrayList:
public int FindLongest(params ArrayList[] lists)
{
return lists.Max(item => item.Count);
}
Or if you don't want to write a function and just want to in-line the code, then:
int longestLength = (new ArrayList[] { arraylist1, arraylist2, arraylist3,
arraylist4, arraylist5, arraylist6 }).Max(item => item.Count);

How can I split an IEnumerable<String> into groups of IEnumerable<string> [duplicate]

This question already has answers here:
Split List into Sublists with LINQ
(34 answers)
Closed 10 years ago.
I have an IEnumerable<string> which I would like to split into groups of three so if my input had 6 items i would get a IEnumerable<IEnumerable<string>> returned with two items each of which would contain an IEnumerable<string> which my string contents in it.
I am looking for how to do this with Linq rather than a simple for loop
Thanks
var result = sequence.Select((s, i) => new { Value = s, Index = i })
.GroupBy(item => item.Index / 3, item => item.Value);
Note that this will return an IEnumerable<IGrouping<int,string>> which will be functionally similar to what you want. However, if you strictly need to type it as IEnumerable<IEnumerable<string>> (to pass to a method that expects it in C# 3.0 which doesn't support generics variance,) you should use Enumerable.Cast:
var result = sequence.Select((s, i) => new { Value = s, Index = i })
.GroupBy(item => item.Index / 3, item => item.Value)
.Cast<IEnumerable<string>>();
This is a late reply to this thread, but here is a method that doesn't use any temporary storage:
public static class EnumerableExt
{
public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> input, int blockSize)
{
var enumerator = input.GetEnumerator();
while (enumerator.MoveNext())
{
yield return nextPartition(enumerator, blockSize);
}
}
private static IEnumerable<T> nextPartition<T>(IEnumerator<T> enumerator, int blockSize)
{
do
{
yield return enumerator.Current;
}
while (--blockSize > 0 && enumerator.MoveNext());
}
}
And some test code:
class Program
{
static void Main(string[] args)
{
var someNumbers = Enumerable.Range(0, 10000);
foreach (var block in someNumbers.Partition(100))
{
Console.WriteLine("\nStart of block.");
foreach (int number in block)
{
Console.Write(number);
Console.Write(" ");
}
}
Console.WriteLine("\nDone.");
Console.ReadLine();
}
}
However, do note the comments below for the limitations of this approach:
If you change the foreach in the test code to
foreach (var block in someNumbers.Partition(100).ToArray())
then it doesn't work any more.
It isn't threadsafe.
I know this has already been answered, but if you plan on taking slices of IEnumerables often, then I recommend making a generic extension method like this:
public static IEnumerable<IEnumerable<T>> Split<T>(this IEnumerable<T> source, int chunkSize)
{
return source.Where((x,i) => i % chunkSize == 0).Select((x,i) => source.Skip(i * chunkSize).Take(chunkSize));
}
Then you can use sequence.Split(3) to get what you want.
(you can name it something else like 'slice', or 'chunk' if you don't like that 'split' has already been defined for strings. 'Split' is just what I happened to call mine.)
Inspired By #dicegiuy30's implementation, I wanted to create a version that only iterates over the source once and doesn't build the whole result set in memory to compensate. Best I've come up with is this:
public static IEnumerable<IEnumerable<T>> Split2<T>(this IEnumerable<T> source, int chunkSize) {
var chunk = new List<T>(chunkSize);
foreach(var x in source) {
chunk.Add(x);
if(chunk.Count <= chunkSize) {
continue;
}
yield return chunk;
chunk = new List<T>(chunkSize);
}
if(chunk.Any()) {
yield return chunk;
}
}
This way I build each chunk on demand. I wish I should avoid the List<T> as well and just stream that that as well, but haven't figured that out yet.
using Microsoft.Reactive you can do this pretty simply and you will iterate only one time through the source.
IEnumerable<string> source = new List<string>{"1", "2", "3", "4", "5", "6"};
IEnumerable<IEnumerable<string>> splited = source.ToObservable().Buffer(3).ToEnumerable();
We can improve #Afshari's solution to do true lazy evaluation. We use a GroupAdjacentBy method that yields groups of consecutive elements with the same key:
sequence
.Select((x, i) => new { Value = x, Index = i })
.GroupAdjacentBy(x=>x.Index/3)
.Select(g=>g.Select(x=>x.Value))
Because the groups are yielded one-by-one, this solution works efficiently with long or infinite sequences.
Mehrdad Afshari's answer is excellent. Here is the an extension method that encapsulates it:
using System.Collections.Generic;
using System.Linq;
public static class EnumerableExtensions
{
public static IEnumerable<IEnumerable<T>> GroupsOf<T>(this IEnumerable<T> enumerable, int size)
{
return enumerable.Select((v, i) => new {v, i}).GroupBy(x => x.i/size, x => x.v);
}
}
I came up with a different approach. It uses a while iterator alright but the results are cached in memory like a regular LINQ until needed.
Here's the code.
public IEnumerable<IEnumerable<T>> Paginate<T>(this IEnumerable<T> source, int pageSize)
{
List<IEnumerable<T>> pages = new List<IEnumerable<T>>();
int skipCount = 0;
while (skipCount * pageSize < source.Count) {
pages.Add(source.Skip(skipCount * pageSize).Take(pageSize));
skipCount += 1;
}
return pages;
}

C# - Finding the common members of two List<T>s - Lambda Syntax

So I wrote this simple console app to aid in my question asking. What is the proper way to use a lambda expression on line 3 of the method to get the common members. Tried a Join() but couldn't figure out the correct syntax. As follow up... is there a non-LINQ way to do this in one line that I missed?
class Program
{
static void Main(string[] args)
{
List<int> c = new List<int>() { 1, 2, 3 };
List<int> a = new List<int>() { 5, 3, 2, 4 };
IEnumerable<int> j = c.Union<int>(a);
// just show me the Count
Console.Write(j.ToList<int>().Count.ToString());
}
}
You want Intersect():
IEnumerable<int> j = c.Intersect(a);
Here's an OrderedIntersect() example based on the ideas mentioned in the comments. If you know your sequences are ordered it should run faster — O(n) rather than whatever .Intersect() normally is (don't remember off the top of my head). But if you don't know they are ordered, it likely won't return correct results at all:
public static IEnumerable<T> OrderedIntersect<T>(this IEnumerable<T> source, IEnumerable<T> other) where T : IComparable
{
using (var xe = source.GetEnumerator())
using (var ye = other.GetEnumerator())
{
while (xe.MoveNext())
{
while (ye.MoveNext() && ye.Current.CompareTo(xe.Current) < 0 )
{
// do nothing - all we care here is that we advanced the y enumerator
}
if (ye.Current.Equals(xe.Current))
yield return xe.Current;
else
{ // y is now > x, so get x caught up again
while (xe.MoveNext() && xe.Current.CompareTo(ye.Current) < 0 )
{ } // again: just advance, do do anything
if (xe.Current.Equals(ye.Current)) yield return xe.Current;
}
}
}
}
If you by lambda syntax mean a real LINQ query, it looks like this:
IEnumerable<int> j =
from cItem in c
join aitem in a on cItem equals aItem
select aItem;
A lambda expression is when you use the => operator, like in:
IEnumerable<int> x = a.Select(y => y > 5);
What you have with the Union method really is a non-LINQ way of doing it, but I suppose that you mean a way of doing it without extension methods. There is hardly a one-liner for that. I did something similar using a Dictionary yesterday. You could do like this:
Dictaionary<int, bool> match = new Dictaionary<int, bool>();
foreach (int i in c) match.Add(i, false);
foreach (int i in a) {
if (match.ContainsKey(i)) {
match[i] = true;
}
}
List<int> result = new List<int>();
foreach (KeyValuePair<int,bool> pair in match) {
if (pair.Value) result.Add(pair.Key);
}

Categories