search List<string> for string .StartsWith() - c#

I have a
List<string>
with 1500 strings. I am now using the following code to pull out only string that start with the string prefixText.
foreach(string a in <MYLIST>)
{
if(a.StartsWith(prefixText, true, null))
{
newlist.Add(a);
}
}
This is pretty fast, but I'm looking for google fast. Now my question is if I arrange the List in alphabetical order, then compare char by char can I make this faster? Or any other suggestions on making this faster?

Thus 1500 is not really a huge number binary search on sorted list would be enough probably.
Nevertheless most efficient algorithms for prefix search are based on the data structure named Trie or Prefix Tree. See: http://en.wikipedia.org/wiki/Trie
Following picture demonstrates the idea very briefly:
For c# implementation see for instance .NET DATA STRUCTURES FOR PREFIX STRING SEARCH AND SUBSTRING (INFIX) SEARCH TO IMPLEMENT AUTO-COMPLETION AND INTELLI-SENSE

You can use PLINQ (Parallel LINQ) to make the execution faster:
var newList = list.AsParallel().Where(x => x.StartsWith(prefixText)).ToList()

If you have the list in alpabetical order, you can use a variation of binary search to make it a lot faster.
As a starting point, this will return the index of one of the strings that match the prefix, so then you can look forward and backward in the list to find the rest:
public static int BinarySearchStartsWith(List<string> words, string prefix, int min, int max) {
while (max >= min) {
int mid = (min + max) / 2;
int comp = String.Compare(words[mid].Substring(0, prefix.Length), prefix);
if (comp < 0) {
min = mid + 1;
} else if (comp > 0) {
max = mid - 1;
} else {
return mid;
}
}
return -1;
}
int index = BinarySearchStartsWith(theList, "pre", 0, theList.Count - 1);
if (index == -1) {
// not found
} else{
// found
}
Note: If you use a prefix that is longer than any of the strings that are compared, it will break, so you might need to figure out how you want to handle that.

So many approches were analyzed to achive minimum data capacity and high performance. The first place is: all prefixes are stored in dictionary: key - prefix, values - items appropriate for prefix.
Here simple implementation of this algorithm:
public class Trie<TItem>
{
#region Constructors
public Trie(
IEnumerable<TItem> items,
Func<TItem, string> keySelector,
IComparer<TItem> comparer)
{
this.KeySelector = keySelector;
this.Comparer = comparer;
this.Items = (from item in items
from i in Enumerable.Range(1, this.KeySelector(item).Length)
let key = this.KeySelector(item).Substring(0, i)
group item by key)
.ToDictionary( group => group.Key, group => group.ToList());
}
#endregion
#region Properties
protected Dictionary<string, List<TItem>> Items { get; set; }
protected Func<TItem, string> KeySelector { get; set; }
protected IComparer<TItem> Comparer { get; set; }
#endregion
#region Methods
public List<TItem> Retrieve(string prefix)
{
return this.Items.ContainsKey(prefix)
? this.Items[prefix]
: new List<TItem>();
}
public void Add(TItem item)
{
var keys = (from i in Enumerable.Range(1, this.KeySelector(item).Length)
let key = this.KeySelector(item).Substring(0, i)
select key).ToList();
keys.ForEach(key =>
{
if (!this.Items.ContainsKey(key))
{
this.Items.Add(key, new List<TItem> { item });
}
else if (this.Items[key].All(x => this.Comparer.Compare(x, item) != 0))
{
this.Items[key].Add(item);
}
});
}
public void Remove(TItem item)
{
this.Items.Keys.ToList().ForEach(key =>
{
if (this.Items[key].Any(x => this.Comparer.Compare(x, item) == 0))
{
this.Items[key].RemoveAll(x => this.Comparer.Compare(x, item) == 0);
if (this.Items[key].Count == 0)
{
this.Items.Remove(key);
}
}
});
}
#endregion
}

1500 is usually too few:
you could search it in parallel with a simple divide and conquer of the problem. Search each half of the list in two (or divide into three, four, ..., parts) different jobs/threads.
Or store the strings in a (not binary) tree instead. Will be O(log n).
sorted in alphabetical order you can do a binary search (sort of the same as the previous one)

You can accelerate a bit by comparing the first character before invoking StartsWith:
char first = prefixText[0];
foreach(string a in <MYLIST>)
{
if (a[0]==first)
{
if(a.StartsWith(prefixText, true, null))
{
newlist.Add(a);
}
}
}

I assume that the really fastest way would be to generate a dictionary with all possible prefixes from your 1500 strings, effectively precomputing the results for all possible searches that will return non-empty. Your search would then be simply a dictionary lookup completing in O(1) time. This is a case of trading memory (and initialization time) for speed.
private IDictionary<string, string[]> prefixedStrings;
public void Construct(IEnumerable<string> strings)
{
this.prefixedStrings =
(
from s in strings
from i in Enumerable.Range(1, s.Length)
let p = s.Substring(0, i)
group s by p
).ToDictionary(
g => g.Key,
g => g.ToArray());
}
public string[] Search(string prefix)
{
string[] result;
if (this.prefixedStrings.TryGetValue(prefix, out result))
return result;
return new string[0];
}

Have you tried implementing a Dictionary and comparing the results? Or, if you do put the entries in alphabetical order, try a binary search.

The question to me is whether or not you'll need to do this one time or multiple times.
If you only find the StartsWithPrefix list one time, you can't get faster then leaving the original list as is and doing myList.Where(s => s.StartsWith(prefix)). This looks at every string one time so it's O(n)
If you need to find the StartsWithPrefix list several times, or maybe you're going to want to add or remove strings to the original list and update the StartsWithPrefix list then you should sort the original list and use binary search. But this will be sort time + search time = O(n log n) + 2 * O(log n)
If you did the binary search method, you would find the indexes of the first occurrence of your prefix and the last occurrence via search. Then do mySortedList.Skip(n).Take(m-n) where n is first index and m is last index.
Edit:
Wait a minute, we're using the wrong tool for the job. Use a Trie! If you put all your strings into a Trie instead of the list, all you have to do is walk down the trie with your prefix and grab all the words underneath that node.

I would go with using Linq:
var query = list.Where(w => w.StartsWith("prefixText")).Select(s => s).ToList();

Related

Balancing oriented graph

I have a set of edges looking like this:
public class Edge<T>
{
public T From { get; set; }
public T To { get; set; }
}
Now I would like to check if my graph is balanced. Under "balanced" I mean that any vertex have equal count of incoming and outgoing edges. My current code is:
public static bool IsGraphBalanced<T>(List<Edge<T>> edges)
{
var from = new Dictionary<T, int>);
var to = new Dictionary<T, int>);
foreach (var edge in edges)
{
if (!from.ContainsKey(edge.From))
from.Add(edge.From, 0);
if (!to.ContainsKey(edge.To))
to.Add(edge.To, 0);
from[edge.From] += 1;
to[edge.To] += 1;
}
foreach (var kv in from)
{
if (!to.ContainsKey(kv.Key))
return false;
if (to[kv.Key] != kv.Value)
return false;
}
// mirrored check with foreach on "to" dictionary
return true;
}
Can I replace it with Linq?
P.S. Size of edges is under 100-150 items, so I care about a readability rather than performance
Here is a more concise implementation utilizing Enumerable class ToLookup, All, Count and Any extension methods (I'll let you decide whether it's more readable or not):
public static bool IsGraphBalanced<T>(List<Edge<T>> edges)
{
var from = edges.ToLookup(e => e.From);
var to = edges.ToLookup(e => e.To);
return from.All(g => g.Count() == to[g.Key].Count())
&& to.All(g => from[g.Key].Any());
}
The ToLookup method is similar to GroupBy, but creates a reusable data structure (because we'll need 2 passes).
Then from.All(g => g.Count() == to[g.Key].Count()) checks if every From has corresponding To and their counts match. Note that in case the key doesn't exist, the ILookup<TKey, TElement> indexer does not throw exception or return null, but returns an empty IEnumerable<TElement>, which allows us to combine the checks.
Finally the to.All(g => from[g.Key].Any()) checks if every To has corresponding From. There is no need to check the counts here because they have been checked in the previous step.

Return duplicate items from Custom List<T> not in order C#

I have a collection of (6) number sequences in a custom List
Example of List
1. 1,2,3,4,5,6
2. 2,1,8,9,8,4
3. 6,5,4,3,2,1
Basically, I need to get the same groups of numbers which aren't necessarily in the same order. So for the example above, I would need to return either 1,2,3,4,5,6 or 6,5,4,3,2,1
I have the following which works for single numbers, but not 6 number groups.
var dupes = numCol.GroupBy(x => x)
.Where(x => x.Count() > 1)
.Select(x => x.Key)
.ToList();
Whats the best and most efficient way to do this?
Thanks.
Edit:
Sample of my custom list structure is below..
public class Numbers
{
private int _First;
public int First
{
get { return _First; }
set { _First = value; }
}
private int _Second;
public int Second
{
get { return _Second; }
set { _Second = value; }
}
...
If the order is not important use Set instead of List.
List is an ordered sequence of elements whereas Set is a distinct list of elements which is unordered.
If you realy need a List, you will need to iterate the list to do that. There is not more efficient to do it. Linq functions do that under the hood, they are just shorcuts for developers.

C# distinct List<string> by substring

I want to remove duplicates from a list of strings. I do this by using distinct, but i want to ignore the first char when comparing.
I already have a working code that deletes the duplicates, but my code also delete the first char of every string.
List<string> mylist = new List<string>();
List<string> newlist =
mylist.Select(e => e.Substring(1, e.Length - 1)).Distinct().ToList();
Input:
"1A","1B","2A","3C","4D"
Output:
"A","B","C","D"
Right Output:
"1A","2B","3C","4D" it doesn't matter if "1A" or "2A" will be deleted
I guess I am pretty close but.... any input is highly appreciated!
As always a solution should work as fast as possible ;)
You can implement an IEqualityComparer<string> that will compare your strings by ignoring the first letter. Then pass it to Distinct method.
myList.Distinct(new MyComparer());
There is also an example on MSDN that shows you how to implement and use a custom comparer with Distinct.
You can GroupBy all but the first character and take the first of every group:
List<string> result= mylist.GroupBy(s => s.Length < 2 ? s : s.Substring(1))
.Select(g => g.First())
.ToList();
Result:
Console.Write(string.Join(",", result)); // 1A,1B,3C,4D
it doesn't matter if "1A" or "2A" will be deleted
If you change your mind you have to replace g.First() with the new logic.
However, if performance really matters and it is never important which duplicate you want to delete you should prefer Selman's approach which suggests to write a custom IEqualityComparer<string>. That will be more efficient than my GroupBy approach if it's GetHashCode is implemented like:
return (s.Length < 2 ? s : s.Substring(1)).GetHashCode();
I'm going to suggest a simple extension that you can reuse in similar situations
public static IEnumerable<T> DistinctBy<T, U>(this IEnumerable<T> This, Func<T, U> keySelector)
{
var set = new HashSet<U>();
foreach (var item in This)
{
if (set.Add(keySelector(item)))
yield return item;
}
}
This is basically how Distinct is implemented in Linq.
Usage:
List<string> newlist =
mylist.DistinctBy(e => e.Substring(1, e.Length - 1)).ToList();
I realise the answer has already been given, but since I was working on this answer anyway I'm still going to post it, in case it's any use.
If you really want the fastest solution for large lists, then something like this might be optimal. You would need to do some accurate timings to be sure, though!
This approach does not make any additional string copies when comparing or computing the hash codes:
using System;
using System.Collections.Generic;
using System.Linq;
namespace Demo
{
internal static class Program
{
static void Main()
{
var myList = new List<string>
{
"1A",
"1B",
"2A",
"3C",
"4D"
};
var newList = myList.Distinct(new MyComparer());
Console.WriteLine(string.Join("\n", newList));
}
sealed class MyComparer: IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
if (x.Length != y.Length)
return false;
if (x.Length == 0)
return true;
return (string.Compare(x, 1, y, 1, x.Length) == 0);
}
public int GetHashCode(string s)
{
if (s.Length <= 1)
return 0;
int result = 17;
unchecked
{
bool first = true;
foreach (char c in s)
{
if (first)
first = false;
else
result = result*23 + c;
}
}
return result;
}
}
}
}

LINQ: Collapsing a series of strings into a set of "ranges"

I have an array of strings similar to this (shown on separate lines to illustrate the pattern):
{ "aa002","aa003","aa004","aa005","aa006","aa007", // note that aa008 is missing
"aa009"
"ba023","ba024","ba025"
"bb025",
"ca002","ca003",
"cb004",
...}
...and the goal is to collapse those strings into this comma-separated string of "ranges":
"aa002-aa007,aa009,ba023-ba025,bb025,ca002-ca003,cb004, ... "
I want to collapse them so I can construct a URL. There are hundreds of elements, but I can still convey all the information if I collapse them this way - putting them all into a URL "longhand" (it has to be a GET, not a POST) isn't feasible.
I've had the idea to separate them into groups using the first two characters as the key - but does anyone have any clever ideas for collapsing those sequences (without gaps) into ranges? I'm struggling with it, and everything I've come up with looks like spaghetti.
So the first thing that you need to do is parse the strings. It's important to have the alphabetic prefix and the integer value separately.
Next you want to group the items on the prefix.
For each of the items in that group, you want to order them by number, and then group items while the previous value's number is one less than the current item's number. (Or, put another way, while the previous item plus one is equal to the current item.)
Once you've grouped all of those items you want to project that group out to a value based on that range's prefix, as well as the first and last number. No other information from these groups is needed.
We then flatten the list of strings for each group into just a regular list of strings, since once we're all done there is no need to separate out ranges from different groups. This is done using SelectMany.
When that's all said and done, that, translated into code, is this:
public static IEnumerable<string> Foo(IEnumerable<string> data)
{
return data.Select(item => new
{
Prefix = item.Substring(0, 2),
Number = int.Parse(item.Substring(2))
})
.GroupBy(item => item.Prefix)
.SelectMany(group => group.OrderBy(item => item.Number)
.GroupWhile((prev, current) =>
prev.Number + 1 == current.Number)
.Select(range =>
RangeAsString(group.Key,
range.First().Number,
range.Last().Number)));
}
The GroupWhile method can be implemented like so:
public static IEnumerable<IEnumerable<T>> GroupWhile<T>(
this IEnumerable<T> source, Func<T, T, bool> predicate)
{
using (var iterator = source.GetEnumerator())
{
if (!iterator.MoveNext())
yield break;
List<T> list = new List<T>() { iterator.Current };
T previous = iterator.Current;
while (iterator.MoveNext())
{
if (!predicate(previous, iterator.Current))
{
yield return list;
list = new List<T>();
}
list.Add(iterator.Current);
previous = iterator.Current;
}
yield return list;
}
}
And then the simple helper method to convert each range into a string:
private static string RangeAsString(string prefix, int start, int end)
{
if (start == end)
return prefix + start;
else
return string.Format("{0}{1}-{0}{2}", prefix, start, end);
}
Here's a LINQ version without the need to add new extension methods:
var data2 = data.Skip(1).Zip(data, (d1, d0) => new
{
value = d1,
jump = d1.Substring(0, 2) == d0.Substring(0, 2)
? int.Parse(d1.Substring(2)) - int.Parse(d0.Substring(2))
: -1,
});
var agg = new { f = data.First(), t = data.First(), };
var query2 =
data2
.Aggregate(new [] { agg }.ToList(), (a, x) =>
{
var last = a.Last();
if (x.jump == 1)
{
a.RemoveAt(a.Count() - 1);
a.Add(new { f = last.f, t = x.value, });
}
else
{
a.Add(new { f = x.value, t = x.value, });
}
return a;
});
var query3 =
from q in query2
select (q.f) + (q.f == q.t ? "" : "-" + q.t);
I get these results:

compare multiple arraylist lengths to find longest one

I have 6 array lists and I would like to know which one is the longest without using a bunch of IF STATEMENTS.
"if arraylist.count > anotherlist.count Then...." <- Anyway to do this other than this?
Examples in VB.net or C#.Net (4.0) would be helpfull.
arraylist1.count
arraylist2.count
arraylist3.count
arraylist4.count
arraylist5.count
arraylist6.count
DIM longest As integer = .... 'the longest arraylist should be stored in this variable.
Thanks
Is 1 if statement acceptable?
public ArrayList FindLongest(params ArrayList[] lists)
{
var longest = lists[0];
for(var i=1;i<lists.Length;i++)
{
if(lists[i].Length > longest.Length)
longest = lists[i];
}
return longest;
}
You could use Linq:
public static ArrayList FindLongest(params ArrayList[] lists)
{
return lists == null
? null
: lists.OrderByDescending(x => x.Count).FirstOrDefault();
}
If you just want the length of the longest list, it's even simpler:
public static int FindLongestLength(params ArrayList[] lists)
{
return lists == null
? -1 // here you could also return (int?)null,
// all you need to do is adjusting the return type
: lists.Max(x => x.Count);
}
If you store everything in a List of Lists like for example
List<List<int>> f = new List<List<int>>();
Then a LINQ like
List<int> myLongest = f.OrderBy(x => x.Count).Last();
will yield the list with the most number of items. Of course you will have to handle the case when there is tie for the longest list
SortedList sl=new SortedList();
foreach (ArrayList al in YouArrayLists)
{
int c=al.Count;
if (!sl.ContainsKey(c)) sl.Add(c,al);
}
ArrayList LongestList=(ArrayList)sl.GetByIndex(sl.Count-1);
If you just want the length of the longest ArrayList:
public int FindLongest(params ArrayList[] lists)
{
return lists.Max(item => item.Count);
}
Or if you don't want to write a function and just want to in-line the code, then:
int longestLength = (new ArrayList[] { arraylist1, arraylist2, arraylist3,
arraylist4, arraylist5, arraylist6 }).Max(item => item.Count);

Categories