Fastest way to search in a string collection - c#

Problem:
I have a text file of around 120,000 users (strings) which I would like to store in a collection and later to perform a search on that collection.
The search method will occur every time the user change the text of a TextBox and the result should be the strings that contain the text in TextBox.
I don't have to change the list, just pull the results and put them in a ListBox.
What I've tried so far:
I tried with two different collections/containers, which I'm dumping the string entries from an external text file (once, of course):
List<string> allUsers;
HashSet<string> allUsers;
With the following LINQ query:
allUsers.Where(item => item.Contains(textBox_search.Text)).ToList();
My search event (fires when user change the search text):
private void textBox_search_TextChanged(object sender, EventArgs e)
{
if (textBox_search.Text.Length > 2)
{
listBox_choices.DataSource = allUsers.Where(item => item.Contains(textBox_search.Text)).ToList();
}
else
{
listBox_choices.DataSource = null;
}
}
Results:
Both gave me a poor response time (around 1-3 seconds between each key press).
Question:
Where do you think my bottleneck is? The collection I've used? The search method? Both?
How can I get better performance and more fluent functionality?

You could consider doing the filtering task on a background thread which would invoke a callback method when it's done, or simply restart filtering if input is changed.
The general idea is to be able to use it like this:
public partial class YourForm : Form
{
private readonly BackgroundWordFilter _filter;
public YourForm()
{
InitializeComponent();
// setup the background worker to return no more than 10 items,
// and to set ListBox.DataSource when results are ready
_filter = new BackgroundWordFilter
(
items: GetDictionaryItems(),
maxItemsToMatch: 10,
callback: results =>
this.Invoke(new Action(() => listBox_choices.DataSource = results))
);
}
private void textBox_search_TextChanged(object sender, EventArgs e)
{
// this will update the background worker's "current entry"
_filter.SetCurrentEntry(textBox_search.Text);
}
}
A rough sketch would be something like:
public class BackgroundWordFilter : IDisposable
{
private readonly List<string> _items;
private readonly AutoResetEvent _signal = new AutoResetEvent(false);
private readonly Thread _workerThread;
private readonly int _maxItemsToMatch;
private readonly Action<List<string>> _callback;
private volatile bool _shouldRun = true;
private volatile string _currentEntry = null;
public BackgroundWordFilter(
List<string> items,
int maxItemsToMatch,
Action<List<string>> callback)
{
_items = items;
_callback = callback;
_maxItemsToMatch = maxItemsToMatch;
// start the long-lived backgroud thread
_workerThread = new Thread(WorkerLoop)
{
IsBackground = true,
Priority = ThreadPriority.BelowNormal
};
_workerThread.Start();
}
public void SetCurrentEntry(string currentEntry)
{
// set the current entry and signal the worker thread
_currentEntry = currentEntry;
_signal.Set();
}
void WorkerLoop()
{
while (_shouldRun)
{
// wait here until there is a new entry
_signal.WaitOne();
if (!_shouldRun)
return;
var entry = _currentEntry;
var results = new List<string>();
// if there is nothing to process,
// return an empty list
if (string.IsNullOrEmpty(entry))
{
_callback(results);
continue;
}
// do the search in a for-loop to
// allow early termination when current entry
// is changed on a different thread
foreach (var i in _items)
{
// if matched, add to the list of results
if (i.Contains(entry))
results.Add(i);
// check if the current entry was updated in the meantime,
// or we found enough items
if (entry != _currentEntry || results.Count >= _maxItemsToMatch)
break;
}
if (entry == _currentEntry)
_callback(results);
}
}
public void Dispose()
{
// we are using AutoResetEvent and a background thread
// and therefore must dispose it explicitly
Dispose(true);
}
private void Dispose(bool disposing)
{
if (!disposing)
return;
// shutdown the thread
if (_workerThread.IsAlive)
{
_shouldRun = false;
_currentEntry = null;
_signal.Set();
_workerThread.Join();
}
// if targetting .NET 3.5 or older, we have to
// use the explicit IDisposable implementation
(_signal as IDisposable).Dispose();
}
}
Also, you should actually dispose the _filter instance when the parent Form is disposed. This means you should open and edit your Form's Dispose method (inside the YourForm.Designer.cs file) to look something like:
// inside "xxxxxx.Designer.cs"
protected override void Dispose(bool disposing)
{
if (disposing)
{
if (_filter != null)
_filter.Dispose();
// this part is added by Visual Studio designer
if (components != null)
components.Dispose();
}
base.Dispose(disposing);
}
On my machine, it works pretty quickly, so you should test and profile this before going for a more complex solution.
That being said, a "more complex solution" would possibly be to store the last couple of results in a dictionary, and then only filter them if it turns out that the new entry differs by only the first of last character.

I've done some testing, and searching a list of 120,000 items and populating a new list with the entries takes a negligible amount of time (about a 1/50th of a second even if all strings are matched).
The problem you're seeing must therefore be coming from the populating of the data source, here:
listBox_choices.DataSource = ...
I suspect you are simply putting too many items into the listbox.
Perhaps you should try limiting it to the first 20 entries, like so:
listBox_choices.DataSource = allUsers.Where(item => item.Contains(textBox_search.Text))
.Take(20).ToList();
Also note (as others have pointed out) that you are accessing the TextBox.Text property for each item in allUsers. This can easily be fixed as follows:
string target = textBox_search.Text;
listBox_choices.DataSource = allUsers.Where(item => item.Contains(target))
.Take(20).ToList();
However, I timed how long it takes to access TextBox.Text 500,000 times and it only took 0.7 seconds, far less than the 1 - 3 seconds mentioned in the OP. Still, this is a worthwhile optimisation.

Use Suffix tree as index. Or rather just build a sorted dictionary that associates every suffix of every name with the list of corresponding names.
For input:
Abraham
Barbara
Abram
The structure would look like:
a -> Barbara
ab -> Abram
abraham -> Abraham
abram -> Abram
am -> Abraham, Abram
aham -> Abraham
ara -> Barbara
arbara -> Barbara
bara -> Barbara
barbara -> Barbara
bram -> Abram
braham -> Abraham
ham -> Abraham
m -> Abraham, Abram
raham -> Abraham
ram -> Abram
rbara -> Barbara
Search algorithm
Assume user input "bra".
Bisect the dictionary on user input to find the user input or the position where it could go. This way we find "barbara" - last key lower than "bra". It is called lower bound for "bra". Search will take logarithmic time.
Iterate from the found key onwards until user input no longer matches. This would give "bram" -> Abram and "braham" -> Abraham.
Concatenate iteration result (Abram, Abraham) and output it.
Such trees are designed for quick search of substrings. It performance is close to O(log n). I believe this approach will work fast enough to be used by GUI thread directly. Moreover it will work faster then threaded solution due to absence of synchronization overhead.

You need either a text search engine (like Lucene.Net), or database (you may consider an embedded one like SQL CE, SQLite, etc.). In other words, you need an indexed search. Hash-based search isn't applicable here, because you searching for sub-string, while hash-based search is well for searching for exact value.
Otherwise it will be an iterative search with looping through the collection.

It might also be useful to have a "debounce" type of event. This differs from throttling in that it waits a period of time (for example, 200 ms) for changes to finish before firing the event.
See Debounce and Throttle: a visual explanation for more information about debouncing. I appreciate that this article is JavaScript focused, instead of C#, but the principle applies.
The advantage of this is that it doesn't search when you're still entering your query. It should then stop trying to perform two searches at once.

Run the search on another thread, and show some loading animation or a progress bar while that thread is running.
You may also try to parallelize the LINQ query.
var queryResults = strings.AsParallel().Where(item => item.Contains("1")).ToList();
Here is a benchmark that demonstrates the performance advantages of AsParallel():
{
IEnumerable<string> queryResults;
bool useParallel = true;
var strings = new List<string>();
for (int i = 0; i < 2500000; i++)
strings.Add(i.ToString());
var stp = new Stopwatch();
stp.Start();
if (useParallel)
queryResults = strings.AsParallel().Where(item => item.Contains("1")).ToList();
else
queryResults = strings.Where(item => item.Contains("1")).ToList();
stp.Stop();
Console.WriteLine("useParallel: {0}\r\nTime Elapsed: {1}", useParallel, stp.ElapsedMilliseconds);
}

Update:
I did some profiling.
(Update 3)
List content: Numbers generated from 0 to 2.499.999
Filter text: 123 (20.477 results)
Core i5-2500, Win7 64bit, 8GB RAM
VS2012 + JetBrains dotTrace
The initial test run for 2.500.000 records took me 20.000ms.
Number one culprit is the call to textBox_search.Text inside Contains. This makes a call for each element to the expensive get_WindowText method of the textbox. Simply changing the code to:
var text = textBox_search.Text;
listBox_choices.DataSource = allUsers.Where(item => item.Contains(text)).ToList();
reduced the execution time to 1.858ms.
Update 2 :
The other two significant bottle-necks are now the call to string.Contains (about 45% of the execution time) and the update of the listbox elements in set_Datasource (30%).
We could make a trade-off between speed and memory usage by creating a Suffix tree as Basilevs has suggested to reduce the number of necessary compares and push some processing time from the search after a key-press to the loading of the names from file which might be preferable for the user.
To increase the performance of loading the elements into the listbox I would suggest to load only the first few elements and indicate to the user that there are further elements available. This way you give a feedback to the user that there are results available so they can refine their search by entering more letters or load the complete list with a press of a button.
Using BeginUpdate and EndUpdate made no change in the execution time of set_Datasource.
As others have noted here, the LINQ query itself runs quite fast. I believe your bottle-neck is the updating of the listbox itself. You could try something like:
if (textBox_search.Text.Length > 2)
{
listBox_choices.BeginUpdate();
listBox_choices.DataSource = allUsers.Where(item => item.Contains(textBox_search.Text)).ToList();
listBox_choices.EndUpdate();
}
I hope this helps.

Assuming you are only matching by prefixes, the data structure you are looking for is called a trie, also known as "prefix tree". The IEnumerable.Where method that you're using now will have to iterate through all items in your dictionary on each access.
This thread shows how to create a trie in C#.

The WinForms ListBox control really is your enemy here. It will be slow to load the records and the ScrollBar will fight you to show all 120,000 records.
Try using an old-fashioned DataGridView data-sourced to a DataTable with a single column [UserName] to hold your data:
private DataTable dt;
public Form1() {
InitializeComponent();
dt = new DataTable();
dt.Columns.Add("UserName");
for (int i = 0; i < 120000; ++i){
DataRow dr = dt.NewRow();
dr[0] = "user" + i.ToString();
dt.Rows.Add(dr);
}
dgv.AutoSizeColumnsMode = DataGridViewAutoSizeColumnsMode.Fill;
dgv.AllowUserToAddRows = false;
dgv.AllowUserToDeleteRows = false;
dgv.RowHeadersVisible = false;
dgv.DataSource = dt;
}
Then use a DataView in the TextChanged event of your TextBox to filter the data:
private void textBox1_TextChanged(object sender, EventArgs e) {
DataView dv = new DataView(dt);
dv.RowFilter = string.Format("[UserName] LIKE '%{0}%'", textBox1.Text);
dgv.DataSource = dv;
}

First I would change how ListControl sees your data source, you're converting result IEnumerable<string> to List<string>. Especially when you just typed few characters this may be inefficient (and unneeded). Do not make expansive copies of your data.
I would wrap .Where() result to a collection that implements only what is required from IList (search). This will save you to create a new big list for each character is typed.
As alternative I would avoid LINQ and I'd write something more specific (and optimized). Keep your list in memory and build an array of matched indices, reuse array so you do not have to reallocate it for each search.
Second step is to do not search in the big list when small one is enough. When user started to type "ab" and he adds "c" then you do not need to research in the big list, search in the filtered list is enough (and faster). Refine search every time is possible, do not perform a full search each time.
Third step may be harder: keep data organized to be quickly searched. Now you have to change the structure you use to store your data. imagine a tree like this:
A B C
Add Better Ceil
Above Bone Contour
This may simply be implemented with an array (if you're working with ANSI names otherwise a dictionary would be better). Build the list like this (illustration purposes, it matches beginning of string):
var dictionary = new Dictionary<char, List<string>>();
foreach (var user in users)
{
char letter = user[0];
if (dictionary.Contains(letter))
dictionary[letter].Add(user);
else
{
var newList = new List<string>();
newList.Add(user);
dictionary.Add(letter, newList);
}
}
Search will be then done using first character:
char letter = textBox_search.Text[0];
if (dictionary.Contains(letter))
{
listBox_choices.DataSource =
new MyListWrapper(dictionary[letter].Where(x => x.Contains(textBox_search.Text)));
}
Please note I used MyListWrapper() as suggested in first step (but I omitted by 2nd suggestion for brevity, if you choose right size for dictionary key you may keep each list short and fast to - maybe - avoid anything else). Moreover note that you may try to use first two characters for your dictionary (more lists and shorter). If you extend this you'll have a tree (but I don't think you have such big number of items).
There are many different algorithms for string searching (with related data structures), just to mention few:
Finite state automaton based search: in this approach, we avoid backtracking by constructing a deterministic finite automaton (DFA) that recognizes stored search string. These are expensive to construct—they are usually created using the powerset construction—but are very quick to use.
Stubs: Knuth–Morris–Pratt computes a DFA that recognizes inputs with the string to search for as a suffix, Boyer–Moore starts searching from the end of the needle, so it can usually jump ahead a whole needle-length at each step. Baeza–Yates keeps track of whether the previous j characters were a prefix of the search string, and is therefore adaptable to fuzzy string searching. The bitap algorithm is an application of Baeza–Yates' approach.
Index methods: faster search algorithms are based on preprocessing of the text. After building a substring index, for example a suffix tree or suffix array, the occurrences of a pattern can be found quickly.
Other variants: some search methods, for instance trigram search, are intended to find a "closeness" score between the search string and the text rather than a "match/non-match". These are sometimes called "fuzzy" searches.
Few words about parallel search. It's possible but it's seldom trivial because overhead to make it parallel can be easily much higher that search itself. I wouldn't perform search itself in parallel (partitioning and synchronization will become soon too expansive and maybe complex) but I would move search to a separate thread. If main thread isn't busy your users won't feel any delay while they're typing (they won't note if list will appear after 200 ms but they'll feel uncomfortable if they have to wait 50 ms after they typed). Of course search itself must be fast enough, in this case you don't use threads to speed up search but to keep your UI responsive. Please note that a separate thread will not make your query faster, it won't hang UI but if your query was slow it'll still be slow in a separate thread (moreover you have to handle multiple sequential requests too).

You could try using PLINQ (Parallel LINQ).
Although this does not garantee a speed boost, this you need to find out by trial and error.

I doubt you'll be able to make it faster, but for sure you should:
a) Use the AsParallel LINQ extension method
a) Use some kind of timer to delay filtering
b) Put a filtering method on another thread
Keep some kind of string previousTextBoxValue somewhere. Make a timer with a delay
of 1000 ms, that fires searching on tick if previousTextBoxValue is same as your textbox.Text value. If not - reassign previousTextBoxValue to the current value and reset the timer. Set the timer start to the textbox changed event, and it'll make your application smoother. Filtering 120,000 records in 1-3 seconds is OK, but your UI must remain responsive.

You can also try using BindingSource.Filter function. I have used it and it works like a charm to filter from bunch of records, every time update this property with the text being search. Another option would be to use AutoCompleteSource for TextBox control.
Hope it helps!

I would try to sort collection, search to match only start part and limit search by some number.
so on ininialization
allUsers.Sort();
and search
allUsers.Where(item => item.StartWith(textBox_search.Text))
Maybe you can add some cache.

Use Parallel LINQ. PLINQ is a parallel implementation of LINQ to Objects. PLINQ implements the full set of LINQ standard query operators as extension methods for the T:System.Linq namespace and has additional operators for parallel operations. PLINQ combines the simplicity and readability of LINQ syntax with the power of parallel programming. Just like code that targets the Task Parallel Library, PLINQ queries scale in the degree of concurrency based on the capabilities of the host computer.
Introduction to PLINQ
Understanding Speedup in PLINQ
Also you can use Lucene.Net
Lucene.Net is a port of the Lucene search engine library, written in
C# and targeted at .NET runtime users. The Lucene search library is
based on an inverted index. Lucene.Net has three primary goals:

According to what I have seen I agree with the fact to sort the list.
However to sort when the list is construct will be very slow, sort when building, you will have a better execution time.
Otherwise if you don't need to display the list or to keep the order, use a hashmap.
The hashmap will hash your string and search at the exact offset. It should be faster I think.

Try use BinarySearch method it should work faster then Contains method.
Contains will be an O(n)
BinarySearch is an O(lg(n))
I think that sorted collection should work faster on search and slower on adding new elements, but as I understood you have only search perfomance problem.

Related

C# Fastest way to determine if a string contains all elements of a list

Quick background. I have a string of words - I separate out those words into a List (I've tried HashSet it doesn't make any difference - and you lose the ordered nature of a List).
I then manipulate the original words in many dull ways - and create thousands of "new strings" - all of these strings are in a StringBuilder which has been set .ToString();
At the end of the manipulation, I want to QC those new strings - and be sure that every word that was in the original set - is still somewhere in those new strings and I have not accidentally lost a word.
That original string, can run to hundreds of individual words.
Short Example:
List<string> uniqueWords = new List<string> { "two", "three", "weather sunday" };
string final = "two and tomorrow\n\rtwo or wednesday\n\rtwo with thursday\n\rtwo without friday\n\rthree gone tomorrow\n\rthree weather saturday\n\rthree timely sunday";
The output string can run to tens of millions of characters, millions of words, 200,000+ rows of data (when split). You may notice that there are words that are actually two words separated by a space - so I cannot simply split out the individual words by splitting on the space as comparing them to the original would fail, and I need to confirm the words are exactly as they appeared originally - having weather somewhere and sunday somewhere - is not the same as having 'weather sunday' - for my purposes.
The the code I have tried so far and have benchmarked:
First attempt:
var allWords = uniqueWords.Where(substring => final.Contains(substring, StringComparison.CurrentCultureIgnoreCase)).ToList();
Second Attempt:
List<string> removeableList = new(uniqueWords);
foreach (var item in uniqueWords)
{
if (removeableList.Count == 0)
{
break;
}
if (final.Contains(item))
{
removeableList.Remove(item);
}
}
Third Attempt:
List<string> removeableList = new(uniqueWords);
for (int i = uniqueWords.Count; i >= 0; i--)
{
if (removeableList.Count == 0)
{
break;
}
if (final.Contains(uniqueWords[i]))
{
removeableList.Remove(uniqueWords[i]);
}
}
These are the results:
These results are repeatable, though I will say that the First Attempt tends to fluctuate quite a lot while the Second and Third Attempts tend to remain at about the same level - the Third Attempt does seem to do better than the Second rather consistently.
Are there any options that I am missing?
I have tried it using a Regex Matches collection into a HashSet - oh that was bad, 4 times worse than the First Attempt.
If there is a way to improve the performance on this task I would love to find it.
Your attempt #1 uses CurrentCultureIgnoreCase which will be slow. But even after removing that, you are adding to the list, rather than removing, and therefore the list might need to be resized.
You are also measuring two different things: option #1 is getting the list of words which are in final, the others get the list of words which are not.
Further options include:
Use List.RemoveAll
List<string> remainingWords = new(uniqueWords);
remainingWords.RemoveAll(final.Contains); // use delegate directly, without anonymous delegate
Use a pre-sized list and use Linq
List<string> remainingWords = new(uniqueWords.Length);
remainingWords.AddRange(uniqueWords.Where(s => !final.Contains(s)));
Each of these two options can be flipped depending on what result you are trying to achieve, as mentioned.
List<string> words = new(uniqueWords);
words.RemoveAll(s => !final.Contains(s));
List<string> words = new(uniqueWords.Length);
words.AddRange(uniqueWords.Where(final.Contains)); // use delegate directly, without anonymous delegate
#Charlieface, thanks for that - I tried those, I think you have a point about adding to a list - as that appears much slower. For me it doesn't matter whether it is adding or removing, the result is a True/False return - whether the list is empty or of the size of the original list.
Sixth Attempt:
List<string> removeableList = new(uniqueWords.Count);
removeableList.AddRange(uniqueWords.Where(s => !parsedTermsComplete!.Contains(s)));
Seventh Attempt:
List<string> removeableList = new(uniqueWords);
removeableList.RemoveAll(parsedTermsComplete!.Contains);
Results in comparison to Third Attempt (fastest generally):
The adding does appear slower - and memory is a little higher for the RemoveAll but timing is consistent - bearing in mind it fluctuates depending on what Windows decides to do at any given moment...
Here is an interesting implementation of the AhoCorasickTree method - which I saw mentioned on this site somewhere else.
My knowledge on this is extremely limited so this may not be a good implementation at all - I am not saying it is a good implementation just that it works - this comes from a nuget package, but I am unsure on SO's policy on nuget package links, so won't link for now. In testing, creating an array was faster than creating a list.
Eighth Attempt:
var wordArray = uniqueWords.ToArray();
int i = uniqueWords.Count - 1;
foreach (var item in wordArray)
{
var keyWords = new AhoCorasickTree(new[] { item });
if (keyWords.Contains(parsedTermsComplete))
{
uniqueWords.RemoveAt(i);
}
i--;
}
I noticed in testing that creating a "removableList" was actually slower than creating a removableArray (found this out implementing the above Aho run). I updated the Third Attempt to incorporate this:
var removeableArray = uniqueWords.ToArray();
for (int i = removeableArray.Length -1; i >= 0; i--)
{
if (!uniqueWords.Any())
{
break;
}
if (parsedTermsComplete!.Contains(removeableArray[i]))
{
uniqueWords.RemoveAt(i);
}
}
The Benchmarks come out like this, the Third Attempt is updated to an array, the Seventh Attempt is the AhoCorasick implementation on a list, and the Eighth Attempt is the AhoCorasick implementation on an Array.
The ToArray - does seem faster than List, which is good to know.
My only issue with the AhoCorasick is that in practice - in a WASM application - this is actually much slower, so not a good option for me - but I put it here because it does seem to be much faster in Benchmarks (may be using multiple threads where WASM is limited to 1) and doesn't appear to allocate any memory, so might be useful to someone - interesting that the Third Attempt also appears to be allocated no memory when using an Array implementation whereas on a list it was allocated.

Iterate over strings that ".StartsWith" without using LINQ

I'm building a custom textbox to enable mentioning people in a social media context. This means that I detect when somebody types "#" and search a list of contacts for the string that follows the "#" sign.
The easiest way would be to use LINQ, with something along the lines of Members.Where(x => x.Username.StartsWith(str). The problem is that the amount of potential results can be extremely high (up to around 50,000), and performance is extremely important in this context.
What alternative solutions do I have? Is there anything similar to a dictionary (a hashtable based solution) but that would allow me to use Key.StartsWith without itterating over every single entry? If not, what would be the fastest and most efficient way to achieve this?
Do you have to show a dropdown of 50000? If you can limit your dropdown, you can for example just display the first 10.
var filteredMembers = new List<MemberClass>
foreach(var member in Members)
{
if(member.Username.StartWith(str)) filteredMembers.Add(member);
if(filteredMembers >= 10) break;
}
Alternatively:
You can try storing all your member's usernames into a Trie in addition to your collection. That should give you a better performance then looping through all 50000 elements.
Assuming your usernames are unique, you can store your member information in a dictionary and use the usernames as the key.
This is a tradeoff of memory for performance of course.
It is not really clear where the data is stored in the first place. Are all the names in memory or in a database?
In case you store them in database, you can just use the StartsWith approach in the ORM, which would translate to a LIKE query on the DB, which would just do its job. If you enable full text on the column, you could improve the performance even more.
Now supposing all the names are already in memory. Remember the computer CPU is extremely fast so even looping through 50 000 entries takes just a few moments.
StartsWith method is optimized and it will return false as soon as it encounters a non-matching character. Finding the ones that actually match should be pretty fast. But you can still do better.
As others suggest, you could build a trie to store all the names and be able to search for matches pretty fast, but there is a disadvantage - building the trie requires you to read all the names and create the whole data structure which is complex. Also you would be restricted only to a given set of characters and a unexpected character would have to be dealt with separately.
You can however group the names into "buckets". First start with the first character and create a dictionary with the character as a key and a list of names as the value. Now you effectively narrowed every following search approximately 26 times (supposing English alphabet). But don't have to stop there - you can perform this on another level, for the second character in each group. And then third and so on.
With each level you are effectively narrowing each group significantly and the search will be much faster afterwards. But there is of course the up-front cost of building the data structure, so you always have to find the right trade-off for you. More work up-front = faster search, less work = slower search.
Finally, when the user types, with each new letter she narrows the target group. Hence, you can always maintain the set of relevant names for the current input and cut it down with each successive keystroke. This will prevent you from having to go from the beginning each time and will improve the efficiency significantly.
Use BinarySearch
This is a pretty normal case, assuming that the data are stored in-memory, and here is a pretty standard way to handle it.
Use a normal List<string>. You don't need a HashTable or a SortedList. However, an IEnumerable<string> won't work; it has to be a list.
Sort the list beforehand (using LINQ, e.g. OrderBy( s => s)), e.g. during initialization or when retrieving it. This is the key to the whole approach.
Find the index of the best match using BinarySearch. Because the list is sorted, a binary search can find the best match very quickly and without scanning the whole list like Select/Where might.
Take the first N entries after the found index. Optionally you can truncate the list if not all N entries are a decent match, e.g. if someone typed "AZ" and there are only one or two items before "BA."
Example:
public static IEnumerable<string> Find(List<string> list, string firstFewLetters, int maxHits)
{
var startIndex = list.BinarySearch(firstFewLetters);
//If negative, no match. Take the 2's complement to get the index of the closest match.
if (startIndex < 0)
{
startIndex = ~startIndex;
}
//Take maxHits items, or go till end of list
var endIndex = Math.Min(
startIndex + maxHits - 1,
list.Count-1
);
//Enumerate matching items
for ( int i = startIndex; i <= endIndex; i++ )
{
var s = list[i];
if (!s.StartsWith(firstFewLetters)) break; //This line is optional
yield return s;
}
}
Click here for a working sample on DotNetFiddle.

Is there a faster way to print to a TextBox than AppendText?

I have a background thread that is reading messages from a device and formatting them and passing them to the 'Consumer' node of a Producer/Consumer collection that then prints all of the messages to the string. The problem I am running into is that the logging lags a little bit behind the end of the process, so I am trying to find a faster way to print to the screen. Here is my consumer method:
private void DisplayMessages(BlockingCollection<string[]> messages)
{
try
{
foreach (var item in messages.GetConsumingEnumerable(_cancellationTokenSource.Token))
{
this.Invoke((MethodInvoker)delegate
{
outputTextBox.AppendText(String.Join(Environment.NewLine, item) + Environment.NewLine);
});
}
}
catch (OperationCanceledException)
{
//TODO:
}
}
I have done some benchmark tests on my producer methods, and even logging to the console, and it does appear that writing to this TextBox is what is a little slower. During each process, I am logging ~61,000 lines that are about 60 characters long.
I have researched that it is better to use .AppendText() than to say textBox.Text += newText as this resets the entire TextBox's text. I am looking for a solution that may include a faster way to print to a TB (or a UI element better suited for quick logging?) or if using String.Join(Environment.NewLine, item) is inefficient and could be sped up in any way.
Code with O(n^2) performance is expected to be slow.
If you can't use String.Join to construct whole output once then consider using list-oriented controls or even grids. If you have very large number of rows most list controls and grids have "virtual" mode where text is actually requested on demand.
String.Join isn't inefficient. The way it is called though causes it to be called N times (once for each message arrau in the collection) instead of just once. You could avoid this by flattening the collection with SelectMany. SelectMany takes an IEnumerable as input and returns each individual item in it:
var allLines=messages.SelectMany(msg=>msg);
var bigText=String.Join(allLines,Environment.NewLine);
Working with such a big text box though will be very difficult. You should consider using a virtualized grid or listbox, adding filters and search functionality to make it easier for users to find the messages they want.
Another option may be to simply assign the lines to the TextBox.Lines property without creating a string:
var allLines=messages.SelectMany(msg=>msg);
outputTextBox.Lines=allLines.ToArray();

Why does my object take a long time to be created?

I'm writing code that scans large sections of text and performs some basic statistics on it, such as number of upper and lower case characters, punctuation characters etc.
Originally my code looked like this:
foreach (var character in stringToCount)
{
if (char.IsControl(character))
{
controlCount++;
}
if (char.IsDigit(character))
{
digitCount++;
}
if (char.IsLetter(character))
{
letterCount++;
} //etc.
}
And then from there I was creating a new object like this, which simply reads the local variables and passes them to the constructor:
var result = new CharacterCountResult(controlCount, highSurrogatecount, lowSurrogateCount, whiteSpaceCount,
symbolCount, punctuationCount, separatorCount, letterCount, digitCount, numberCount, letterAndDigitCount,
lowercaseCount, upperCaseCount, tempDictionary);
However a user over on Code Review Stack Exchange pointed out that I can just do the following. Great, I've saved myself a load of code which is good.
var result = new CharacterCountResult(stringToCount.Count(char.IsControl),
stringToCount.Count(char.IsHighSurrogate), stringToCount.Count(char.IsLowSurrogate),
stringToCount.Count(char.IsWhiteSpace), stringToCount.Count(char.IsSymbol),
stringToCount.Count(char.IsPunctuation), stringToCount.Count(char.IsSeparator),
stringToCount.Count(char.IsLetter), stringToCount.Count(char.IsDigit),
stringToCount.Count(char.IsNumber), stringToCount.Count(char.IsLetterOrDigit),
stringToCount.Count(char.IsLower), stringToCount.Count(char.IsUpper), tempDictionary);
However creating the object the second way takes approximately (on my machine) an extra ~200ms.
How can this be? While it might not seem a significant amount of extra time, it soon adds up when I've left it running processing text.
What should I be doing differently?
You are using method groups (syntactic sugar hiding a lambda or delegate) and iterating over the characters many times, whereas you could get it done with one pass (as in your original code).
I remember your previous question, and I recall seeing the recommendation to use the method group and string.Count(char.IsLetterOrDigit) and thinking "yeh that looks pretty but won't perform well", so it was amusing to actually see that you found exactly that.
If performance is important, I would just do it without delegates period, and use one giant loop with a single pass, the traditional way without delegates or multiple iterations, and even further, tune it by organizing the logic such that any case that excludes other cases is organized such that you do "lazy evaluation". Example, if you know a character is whitespace, then don't check for digit or alpha, etc. Or if you know it is digitOrAlpha, then include digit and alpha checks inside that condition.
Something like:
foreach(var ch in string) {
if(char.IsWhiteSpace(ch)) {
...
}
else {
if(char.IsLetterOrDigit(ch)) {
letterOrDigit++;
if(char.IsDigit(ch)) digit++;
if(char.IsLetter(ch)) letter++;
}
}
}
If you REALLY want to micro-optimize, write a program to pre-calculate all of the options and emit a huge switch statement which does table lookups.
switch(ch) {
case 'A':
isLetter++;
isUpper++;
isLetterOrDigit++;
break;
case 'a':
isLetter++;
isLower++;
isLetterOrDigit++;
break;
case '!':
isPunctuation++;
...
}
Now if you want to get REALLY crazy, organize the switch statement according to real-life frequency of occurence, and put the most common letters at the top of the "tree", and so forth. Of course, if you care that much about speed, it might be a job for plain C.
But I've wandered a bit far afield from your original question. :)
Your old way you walked through the text once, increasing all of your counters as you go. In your new way you walk though the text 13 times (once for each call to stringToCount.Count() and only update one counter per pass.
However, this kind of problem is the perfect situation for Parallel.ForEach. You can walk through the text with multiple threads (being sure your increments are thread safe) and get your totals faster.
Parallel.ForEach(stringToCount, character =>
{
if (char.IsControl(character))
{
//Interlocked.Increment gives you a thread safe ++
Interlocked.Increment(ref controlCount);
}
if (char.IsDigit(character))
{
Interlocked.Increment(ref digitCount);
}
if (char.IsLetter(character))
{
Interlocked.Increment(ref letterCount);
} //etc.
});
var result = new CharacterCountResult(controlCount, highSurrogatecount, lowSurrogateCount, whiteSpaceCount,
symbolCount, punctuationCount, separatorCount, letterCount, digitCount, numberCount, letterAndDigitCount,
lowercaseCount, upperCaseCount, tempDictionary);
It still walks through the text once, but many workers will be walking through various parts of the text at the same time.

Remove duplicates from array of struct

I can't figured out in remove duplicates entries from an Array of struct
I have this struct:
public struct stAppInfo
{
public string sTitle;
public string sRelativePath;
public string sCmdLine;
public bool bFindInstalled;
public string sFindTitle;
public string sFindVersion;
public bool bChecked;
}
I have changed the stAppInfo struct to class here thanks to Jon Skeet
The code is like this: (short version)
stAppInfo[] appInfo = new stAppInfo[listView1.Items.Count];
int i = 0;
foreach (ListViewItem item in listView1.Items)
{
appInfo[i].sTitle = item.Text;
appInfo[i].sRelativePath = item.SubItems[1].Text;
appInfo[i].sCmdLine = item.SubItems[2].Text;
appInfo[i].bFindInstalled = (item.SubItems[3].Text.Equals("Sí")) ? true : false;
appInfo[i].sFindTitle = item.SubItems[4].Text;
appInfo[i].sFindVersion = item.SubItems[5].Text;
appInfo[i].bChecked = (item.SubItems[6].Text.Equals("Sí")) ? true : false;
i++;
}
I need that appInfo array be unique in sTitle and sRelativePath members the others members can be duplicates
EDIT:
Thanks to all for the answers but this application is "portable" I mean I just need the .exe file and I don't want to add another files like references *.dll so please no external references this app is intended to use in a pendrive
All data comes form a *.ini file what I do is: (pseudocode)
ReadFile()
FillDataFromFileInAppInfoArray()
DeleteDuplicates()
FillListViewControl()
When I want to save that data into a file I have these options:
Using ListView data
Using appInfo array (this is more faster¿?)
Any other¿?
EDIT2:
Big thanks to: Jon Skeet, Michael Hays thanks for your time guys!!
Firstly, please don't use mutable structs. They're a bad idea in all kinds of ways.
Secondly, please don't use public fields. Fields should be an implementation detail - use properties.
Thirdly, it's not at all clear to me that this should be a struct. It looks rather large, and not particularly "a single value".
Fourthly, please follow the .NET naming conventions so your code fits in with all the rest of the code written in .NET.
Fifthly, you can't remove items from an array, as arrays are created with a fixed size... but you can create a new array with only unique elements.
LINQ to Objects will let you do that already using GroupBy as shown by Albin, but a slightly neater (in my view) approach is to use DistinctBy from MoreLINQ:
var unique = appInfo.DistinctBy(x => new { x.sTitle, x.sRelativePath })
.ToArray();
This is generally more efficient than GroupBy, and also more elegant in my view.
Personally I generally prefer using List<T> over arrays, but the above will create an array for you.
Note that with this code there can still be two items with the same title, and there can still be two items with the same relative path - there just can't be two items with the same relative path and title. If there are duplicate items, DistinctBy will always yield the first such item from the input sequence.
EDIT: Just to satisfy Michael, you don't actually need to create an array to start with, or create an array afterwards if you don't need it:
var query = listView1.Items
.Cast<ListViewItem>()
.Select(item => new stAppInfo
{
sTitle = item.Text,
sRelativePath = item.SubItems[1].Text,
bFindInstalled = item.SubItems[3].Text == "Sí",
sFindTitle = item.SubItems[4].Text,
sFindVersion = item.SubItems[5].Text,
bChecked = item.SubItems[6].Text == "Sí"
})
.DistinctBy(x => new { x.sTitle, x.sRelativePath });
That will give you an IEnumerable<appInfo> which is lazily streamed. Note that if you iterate over it more than once, however, it will iterate over listView1.Items the same number of times, performing the same uniqueness comparisons each time.
I prefer this approach over Michael's as it makes the "distinct by" columns very clear in semantic meaning, and removes the repetition of the code used to extract those columns from a ListViewItem. Yes, it involves building more objects, but I prefer clarity over efficiency until benchmarking has proved that the more efficient code is actually required.
What you need is a Set. It ensures that the items entered into it are unique (based on some qualifier which you will set up). Here is how it is done:
First, change your struct to a class. There is really no getting around that.
Second, provide an implementation of IEqualityComparer<stAppInfo>. It may be a hassle, but it is the thing that makes your set work (which we'll see in a moment):
public class AppInfoComparer : IEqualityComparer<stAppInfo>
{
public bool Equals(stAppInfo x, stAppInfo y) {
if (ReferenceEquals(x, y)) return true;
if (x == null || y == null) return false;
return Equals(x.sTitle, y.sTitle) && Equals(x.sRelativePath,
y.sRelativePath);
}
// this part is a pain, but this one is already written
// specifically for your question.
public int GetHashCode(stAppInfo obj) {
unchecked {
return ((obj.sTitle != null
? obj.sTitle.GetHashCode() : 0) * 397)
^ (obj.sRelativePath != null
? obj.sRelativePath.GetHashCode() : 0);
}
}
}
Then, when it is time to make your set, do this:
var appInfoSet = new HashSet<stAppInfo>(new AppInfoComparer());
foreach (ListViewItem item in listView1.Items)
{
var newItem = new stAppInfo {
sTitle = item.Text,
sRelativePath = item.SubItems[1].Text,
sCmdLine = item.SubItems[2].Text,
bFindInstalled = (item.SubItems[3].Text.Equals("Sí")) ? true : false,
sFindTitle = item.SubItems[4].Text,
sFindVersion = item.SubItems[5].Text,
bChecked = (item.SubItems[6].Text.Equals("Sí")) ? true : false};
appInfoSet.Add(newItem);
}
appInfoSet now contains a collection of stAppInfo objects with unique Title/Path combinations, as per your requirement. If you must have an array, do this:
stAppInfo[] appInfo = appInfoSet.ToArray();
Note: I chose this implementation because it looks like the way you are already doing things. It has an easy to read for-loop (though I do not need the counter variable). It does not involve LINQ (wich can be troublesome if you aren't familiar with it). It requires no external libraries outside of what .NET framework provides to you. And finally, it provides an array just like you've asked. As for reading the file in from an INI file, hopefully you see that the only thing that will change is your foreach loop.
Update
Hash codes can be a pain. You might have been wondering why you need to compute them at all. After all, couldn't you just compare the values of the title and relative path after each insert? Well sure, of course you could, and that's exactly how another set, called SortedSet works. SortedSet makes you implement IComparer in the same way that I implemented IEqualityComparer above.
So, in this case, AppInfoComparer would look like this:
private class AppInfoComparer : IComparer<stAppInfo>
{
// return -1 if x < y, 1 if x > y, or 0 if they are equal
public int Compare(stAppInfo x, stAppInfo y)
{
var comparison = x.sTitle.CompareTo(y.sTitle);
if (comparison != 0) return comparison;
return x.sRelativePath.CompareTo(y.sRelativePath);
}
}
And then the only other change you need to make is to use SortedSet instead of HashSet:
var appInfoSet = new SortedSet<stAppInfo>(new AppInfoComparer());
It's so much easier in fact, that you are probably wondering what gives? The reason that most people choose HashSet over SortedSet is performance. But you should balance that with how much you actually care, since you'll be maintaining that code. I personally use a tool called Resharper, which is available for Visual Studio, and it computes these hash functions for me, because I think computing them is a pain, too.
(I'll talk about the complexity of the two approaches, but if you already know it, or are not interested, feel free to skip it.)
SortedSet has a complexity of O(log n), that is to say, each time you enter a new item, will effectively go the halfway point of your set and compare. If it doesn't find your entry, it will go to the halfway point between its last guess and the group to the left or right of that guess, quickly whittling down the places for your element to hide. For a million entries, this takes about 20 attempts. Not bad at all. But, if you've chosen a good hashing function, then HashSet can do the same job, on average, in one comparison, which is O(1). And before you think 20 is not really that big a deal compared to 1 (after all computers are pretty quick), remember that you had to insert those million items, so while HashSet took about a million attempts to build that set up, SortedSet took several million attempts. But there is a price -- HashSet breaks down (very badly) if you choose a poor hashing function. If the numbers for lots of items are unique, then they will collide in the HashSet, which will then have to try again and again. If lots of items collide with the exact same number, then they will retrace each others steps, and you will be waiting a long time. The millionth entry will take a million times a million attempts -- HashSet has devolved into O(n^2). What's important with those big-O notations (which is what O(1), O(log n), and O(n^2) are, in fact) is how quickly the number in parentheses grows as you increase n. Slow growth or no growth is best. Quick growth is sometimes unavoidable. For a dozen or even a hundred items, the difference may be negligible -- but if you can get in the habit of programming efficient functions as easily as alternatives, then it's worth conditioning yourself to do so as problems are cheapest to correct closest to the point where you created that problem.
Use LINQ2Objects, group by the things that should be unique and then select the first item in each group.
var noDupes = appInfo.GroupBy(
x => new { x.sTitle, x.sRelativePath })
.Select(g => g.First()).ToArray();
!!! Array of structs (value type) + sorting or any kind of search ==> a lot of unboxing operations.
I would suggest to stick with recommendations of Jon and Henk, so make it as a class and use generic List<T>.
Use LINQ GroupBy or DistinctBy, as for me it is much simple to use built in GroupBy, but it also interesting to take a look at an other popular library, perhaps it gives you some insights.
BTW, Also take a look at the LambdaComparer it will make you life easier each time you need such kind of in place sorting/search, etc...

Categories