C# Fastest way to intersect lists of strings - c#

I am using hashet, linq Intersect() and Count() to find intersection of two lists of strings.
Code being used
private HashSet<string> Words { get; }
public Sentence(IEnumerable<string> words)
{
Words = words.ToHashSet();
}
public int GetSameWordCount(Sentence sentence)
{
return Words.Intersect(sentence.Words).Count();
}
Method GetSameWordCount is Taking > 90% of program runtime as there are milions of Sentences to compare with each other.
Is there any faster way to do this?
I am using .net core 3.1.1 / C# 8 so any recent features can be used.
More info:
Input data is coming from text file (e.g. book excerpt, articles from web).
Sentences are then unaccented, lowercased and split to words by whitespace >regex.
Short words (<3 length) are ignored.
I am creating groups of sentences which have N words in common and ordering >these groups by number of shared words.

The below code will utilize HashSet<T>.Contains method which is more performant. Time complexity of HashSet<T>.Contains is O(1).
public int GetSameWordCount(Sentence sentence)
{
var count;
foreach(var word in sentence.Words)
{
if(Words.Contains(word))
count++;
}
return count;
}
Note
If the list of the words is sorted you can use below approach.
var enumerator1 = set1.GetEnumerator();
var enumerator2 = set2.GetEnumerator();
var count = 0;
if (enumerator1.MoveNext() && enumerator2.MoveNext())
{
while (true)
{
var value = enumerator1.Current.CompareTo(enumerator2.Current);
if (value == 0)
{
count++;
if (!enumerator1.MoveNext() || !enumerator2.MoveNext())
break;
}
else if (value < 0)
{
if (!enumerator1.MoveNext())
break;
}
else
{
if (!enumerator2.MoveNext())
break;
}
}
}

Related

Algorithm for grouping consecutive numbers

I am trying to build an efficient algorithm that can process thousands of rows of data that contains zip codes of customers. I would then want to cross check those zip codes against a grouping of around 1000 zip codes, but I have about 100 columns of 1000 zip codes. A lot of these zip codes are consecutive numbers, but there is also a lot of random zip codes thrown in there. So what I would like to do is group consecutive zip codes together that I can then just check to see if the zip code falls within that range instead of checking it against every single zip code.
Sample data -
90001
90002
90003
90004
90005
90006
90007
90008
90009
90010
90012
90022
90031
90032
90033
90034
90041
This should be grouped as follows:
{ 90001-90010, 90012, 90022, 90031-90034, 90041 }
Here's my idea for the algorithm:
public struct gRange {
public int start, end;
public gRange(int a, int b) {
start = a;
if(b != null) end = b;
else end = a;
}
}
function groupZips(string[] zips){
List<gRange> zipList = new List<gRange>();
int currZip, prevZip, startRange, endRange;
startRange = 0;
bool inRange = false;
for(int i = 1; i < zips.length; i++) {
currZip = Convert.ToInt32(zips[i]);
prevZip = Convert.ToInt32(zips[i-1]);
if(currZip - prevZip == 1 && inRange == false) {
inRange = true;
startRange = prevZip;
continue;
}
else if(currZip - prevZip == 1 && inRange == true) continue;
else if(currZip - prevZip != 1 && inRange == true) {
inRange = false;
endRange = prevZip;
zipList.add(new gRange(startRange, endRange));
continue;
}
else if(currZip - prevZip != 1 && inRange == false) {
zipList.add(new gRange(prevZip, prevZip));
}
//not sure how to handle the last case when i == zips.length-1
}
}
So as of now, I am unsure of how to handle the last case, but looking at this algorithm, it doesn't strike me as efficient. Is there a better/easier way to be sorting a group of numbers like this?
Here is a O(n) solution even if your zip codes are not guaranteed to be in order.
If you need the output groupings to be sorted, you can't do any better than O(n*log(n)) because somewhere you'll have to sort something, but if grouping the zip codes is your only concern and sorting the groups isn't required then I'd use an algorithm like this. It makes good use of a HashSet, a Dictionary, and a DoublyLinkedList. To my knowledge this algorithm is O(n), because I believe that a HashSet.Add() and HashSet.Contains() are performed in constant time.
Here is a working dotnetfiddle
// I'm assuming zipcodes are ints... convert if desired
// jumbled up your sample data to show that the code would still work
var zipcodes = new List<int>
{
90012,
90033,
90009,
90001,
90005,
90004,
90041,
90008,
90007,
90031,
90010,
90002,
90003,
90034,
90032,
90006,
90022,
};
// facilitate constant-time lookups of whether zipcodes are in your set
var zipHashSet = new HashSet<int>();
// lookup zipcode -> linked list node to remove item in constant time from the linked list
var nodeDictionary = new Dictionary<int, DoublyLinkedListNode<int>>();
// linked list for iterating and grouping your zip codes in linear time
var zipLinkedList = new DoublyLinkedList<int>();
// initialize our datastructures from the initial list
foreach (int zipcode in zipcodes)
{
zipLinkedList.Add(zipcode);
zipHashSet.Add(zipcode);
nodeDictionary[zipcode] = zipLinkedList.Last;
}
// object to store the groupings (ex: "90001-90010", "90022")
var groupings = new HashSet<string>();
// iterate through the linked list, but skip nodes if we group it with a zip code
// that we found on a previous iteration of the loop
var node = zipLinkedList.First;
while (node != null)
{
var bottomZipCode = node.Element;
var topZipCode = bottomZipCode;
// find the lowest zip code in this group
while (zipHashSet.Contains(bottomZipCode - 1))
{
var nodeToDel = nodeDictionary[bottomZipCode - 1];
// delete node from linked list so we don't observe any node more than once
if (nodeToDel.Previous != null)
{
nodeToDel.Previous.Next = nodeToDel.Next;
}
if (nodeToDel.Next != null)
{
nodeToDel.Next.Previous = nodeToDel.Previous;
}
// see if previous zip code is in our group, too
bottomZipCode--;
}
// get string version zip code bottom of the range
var bottom = bottomZipCode.ToString();
// find the highest zip code in this group
while (zipHashSet.Contains(topZipCode + 1))
{
var nodeToDel = nodeDictionary[topZipCode + 1];
// delete node from linked list so we don't observe any node more than once
if (nodeToDel.Previous != null)
{
nodeToDel.Previous.Next = nodeToDel.Next;
}
if (nodeToDel.Next != null)
{
nodeToDel.Next.Previous = nodeToDel.Previous;
}
// see if next zip code is in our group, too
topZipCode++;
}
// get string version zip code top of the range
var top = topZipCode.ToString();
// add grouping in correct format
if (top == bottom)
{
groupings.Add(bottom);
}
else
{
groupings.Add(bottom + "-" + top);
}
// onward!
node = node.Next;
}
// print results
foreach (var grouping in groupings)
{
Console.WriteLine(grouping);
}
** a small refactoring of the common linked list node deletion logic is in order
If Sorting is Required
A O(n*log(n)) algorithm is much simpler, because once you sort your input list the groups can be formed in one iteration of the list with no additional data structures.
I believe you are overthinking this one. Just using Linq against an IEnumerable can search 80,000+ records in less than 1/10 of a second.
I used the free CSV zip code list from here: http://federalgovernmentzipcodes.us/free-zipcode-database.csv
using System;
using System.IO;
using System.Collections.Generic;
using System.Data;
using System.Data.OleDb;
using System.Linq;
using System.Text;
namespace ZipCodeSearchTest
{
struct zipCodeEntry
{
public string ZipCode { get; set; }
public string City { get; set; }
}
class Program
{
static void Main(string[] args)
{
List<zipCodeEntry> zipCodes = new List<zipCodeEntry>();
string dataFileName = "free-zipcode-database.csv";
using (FileStream fs = new FileStream(dataFileName, FileMode.Open, FileAccess.Read))
using (StreamReader sr = new StreamReader(fs))
while (!sr.EndOfStream)
{
string line = sr.ReadLine();
string[] lineVals = line.Split(',');
zipCodes.Add(new zipCodeEntry { ZipCode = lineVals[1].Trim(' ', '\"'), City = lineVals[3].Trim(' ', '\"') });
}
bool terminate = false;
while (!terminate)
{
Console.WriteLine("Enter zip code:");
var userEntry = Console.ReadLine();
if (userEntry.ToLower() == "x" || userEntry.ToString() == "q")
terminate = true;
else
{
DateTime dtStart = DateTime.Now;
foreach (var arrayVal in zipCodes.Where(z => z.ZipCode == userEntry.PadLeft(5, '0')))
Console.WriteLine(string.Format("ZipCode: {0}", arrayVal.ZipCode).PadRight(20, ' ') + string.Format("City: {0}", arrayVal.City));
DateTime dtStop = DateTime.Now;
Console.WriteLine();
Console.WriteLine("Lookup time: {0}", dtStop.Subtract(dtStart).ToString());
Console.WriteLine("\n\n");
}
}
}
}
}
In this particular case, it is quite possible that a hash will be faster. However, the range-based solution will use a lot less memory, so it would be appropriate if your lists were very large (and I'm not convinced that there are enough possible zipcodes for any list of zipcodes to be large enough.)
Anyway, here's a simpler logic for making the range list and finding if a target is in a range:
Make ranges a simple list of integers (or even zipcodes), and push the first element of zip as its first element.
For each element of zip except the last one, if that element plus one is not the same as the next element, add both that element plus one and the next element to ranges.
Push one more than the last element of zip at the end of `ranges.
Now, to find out if a zipcode is in ranges, do a binary search into ranges for the smallest element which is greater than the target zipcode. [Note 1] If the index of that element is odd, then the target is in one of the ranges, otherwise it isn't.
Notes:
AIUI, the BinarySearch method on a C# list returns the index of the element found or the complement of the index of the first larger element. To get the result needed by the suggested algorithm, you could use something like index >= 0 ? index + 1 : ~index, but it might be simpler to just search for the zipcode one less than the target and then use the complement of the low-order bit of the result.

Optimizing counting characters within a string

I just created a simple method to count occurences of each character within a string, without taking caps into account.
static List<int> charactercount(string input)
{
char[] characters = "abcdefghijklmnopqrstuvwxyz".ToCharArray();
input = input.ToLower();
List<int> counts = new List<int>();
foreach (char c in characters)
{
int count = 0;
foreach (char c2 in input) if (c2 == c)
{
count++;
}
counts.Add(count);
}
return counts;
}
Is there a cleaner way to do this (i.e. without creating a character array to hold every character in the alphabet) that would also take into account numbers, other characters, caps, etc?
Conceptually, I would prefer to return a Dictionary<string,int> of counts. I'll assume that it's ok to know by omission rather than an explicit count of 0 that a character occurs zero times, you can do it via LINQ. #Oded's given you a good start on how to do that. All you would need to do is replace the Select() with ToDictionary( k => k.Key, v => v.Count() ). See my comment on his answer about doing the case insensitive grouping. Note: you should decide if you care about cultural differences in characters or not and adjust the ToLower method accordingly.
You can also do this without LINQ;
public static Dictionary<string,int> CountCharacters(string input)
{
var counts = new Dictionary<char,int>(StringComparer.OrdinalIgnoreCase);
foreach (var c in input)
{
int count = 0;
if (counts.ContainsKey(c))
{
count = counts[c];
}
counts[c] = counts + 1;
}
return counts;
}
Note if you wanted a Dictionary<char,int>, you could easily do that by creating a case invariant character comparer and using that as the IEqualityComparer<T> for a dictionary of the required type. I've used string for simplicity in the example.
Again, adjust the type of the comparer to be consistent with how you want to handle culture.
Using GroupBy and Select:
aString.GroupBy(c => c).Select(g => new { Character = g.Key, Num = g.Count() })
The returned anonymous type list will contain each character and the number of times it appears in the string.
You can then filter it in any way you wish, using the static methods defined on Char.
Your code is kind of slow because you are looping through the range a-z instead of just looping through the input.
If you only need to count letters (like your code suggests), the fastest way to do it would be:
int[] CountCharacters(string text)
{
var counts = new int[26];
for (var i = 0; i < text.Length; i++)
{
var charIndex - text[index] - (int)'a';
counts[charIndex] = counts[charindex] + 1;
}
return counts;
}
Note that you need to add some thing like verify the character is in the range, and convert it to lowercase when needed, or this code might throw exceptions. I'll leave those for you to add. :)
Based on +Ran's answer to avoiding IndexOutOfRangeException:
static readonly int differ = 'a';
int[] CountCharacters(string text) {
text = text.ToLower();
var counts = new int[26];
for (var i = 0; i < text.Length; i++) {
var charIndex = text[i] - differ;
// to counting chars between 'a' and 'z' we have to do this:
if(charIndex >= 0 && charIndex < 26)
counts[charIndex] += 1;
}
return counts;
}
Actually using Dictionary and/or LINQ is not optimized enough as counting chars and working with a low level array.

How to get the a string that is most repeated in a list

I have a lot of lists like the following:
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[1]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[2]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[2]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[3]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[3]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[4]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[4]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[5]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[5]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[6]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[2]/div[1]/div[6]/div[1]/div[2]/ul[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[7]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[2]/div[1]/div[6]/div[1]/div[2]/ul[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[8]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[8]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[9]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[9]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[10]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[10]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[11]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[2]/div[1]/div[6]/div[1]/div[2]/ul[2]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[12]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[12]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[13]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[13]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[14]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[14]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[15]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[15]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[16]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[16]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[17]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[2]/div[1]/div[6]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[18]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[18]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[19]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[19]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[20]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[20]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[21]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[2]/div[1]/div[6]/div[1]/div[2]/ul[2]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[22]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[22]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[23]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[23]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[24]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[24]/div[2]/div[4]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[25]/div[2]/h4[1]
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[25]/div[2]/div[4]
And I need to extract the portion that is most repeated in each line, which in this case is
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li
What's the best way to do this?
I'm using C#/.net
thanks!
If I understand your question correctly, what you want is the longest common prefix of all lines. You could obtain it by doing something like that:
void Main()
{
string path = #"D:\tmp\so5670107.txt";
string[] lines = File.ReadAllLines(path);
string prefix = LongestCommonPrefix(lines);
Console.WriteLine(prefix);
}
static string LongestCommonPrefix(string a, string b)
{
int length = 0;
for (int i = 0; i < a.Length && i < b.Length; i++)
{
if (a[i] == b[i])
length++;
else
break;
}
return a.Substring(0, length);
}
static string LongestCommonPrefix(IEnumerable<string> strings)
{
return strings.Aggregate(LongestCommonPrefix);
}
The result is:
/html[1]/body[1]/div[5]/div[1]/div[2]/div[
(the expected result you give in the question seems incorrect, since there are lines that don't match it)
I chose a naive approach for the sake of simplicity, but of course there are more efficient ways of finding the longest common prefix between two strings (using a dichotomic search for instance)
You could do this with a loop. Assumption is that your list of strings is in a collection called paths:
var countByPath = new Dictionary<string, int>();
foreach (var path in paths)
{
if (!countByPath.ContainsKey(path))
{
countByPath[path] = 1;
}
else
{
countByPath[path]++;
}
}
The longest substring that is repeated in the list? Assumption is that your list of strings is in a collection called paths:
var currentChoice = "";
foreach (var path in paths)
{
for (int i = path.Length; i > 0; i--)
{
var candidate = path.Substring(0, i);
if (i > currentChoice.Length &&
paths.Count(p => p.StartsWith(candidate)) > 1)
currentChoice = candidate;
else
break;
}
}
Console.WriteLine(currentChoice);
The result is then
/html[1]/body[1]/div[5]/div[1]/div[2]/div[3]/div[1]/div[3]/div[1]/div[2]/div[3]/ul[1]/li[10]
since it is repeated twice
There is already an algorithm for this. I can't remember what it's called, but if you are interested in language independent implementation. It works in the following way:
Read first line
Read second line. If second line is the same as first line, than increase counter by one, otherwise keep counter at zero.
Carry on reading lines, if three lines are the same (i.e. repeat), than your counter will be 2. If next line is different to the previous three, than decrease counter by 1.
E.g.
String1 - Counter: 0
String1 - Counter: 1 (Store String1 in a variable)
String1 - Counter: 2 (Store String1 in same variable)
String2 - Counter: 1 (Still store String1 in variable)
I hope this makese sense. I did this at uni few years ago. Can't remember mathematician who came up with algorithm, but it's fairly old.

Fast string suffix checking in C# (.NET 4.0)?

What is the fastest method of checking string suffixes in C#?
I need to check each string in a large list (anywhere from 5000 to 100000 items) for a particular term. The term is guaranteed never to be embedded within the string. In other words, if the string contains the term, it will be at the end of the string. The string is also guaranteed to be longer than the suffix. Cultural information is not important.
These are how different methods performed against 100000 strings (half of them have the suffix):
1. Substring Comparison - 13.60ms
2. String.Contains - 22.33ms
3. CompareInfo.IsSuffix - 24.60ms
4. String.EndsWith - 29.08ms
5. String.LastIndexOf - 30.68ms
These are average times. [Edit] Forgot to mention that the strings also get put into separate lists, but this is not important. It does add to the running time though.
On my system substring comparison (extracting the end of the string using the String.Substring method and comparing it to the suffix term) is consistently the fastest when tested against 100000 strings. The problem with using substring comparison though is that Garbage Collection can slow it down considerably (more than the other methods) because String.Substring creates new strings. The effect is not as bad in .NET 4.0 as it was in 3.5 and below, but it is still noticeable. In my tests, String.Substring performed consistently slower on sets of 12000-13000 strings. This will obviously differ between systems and implementations.
[EDIT]
Benchmark code:
http://pastebin.com/smEtYNYN
[EDIT]
FlyingStreudel's code runs fast, but Jon Skeet's recommendation of using EndsWith in conjunction with StringComparison.Ordinal appears to be the best option.
If that's the time taken to check 100,000 strings, does it really matter?
Personally I'd use string.EndsWith on the grounds that it's the most descriptive: it says exactly what you're trying to test.
I'm somewhat suspicious of the fact that it appears to be performing worst though... if you could post your benchmark code, that would be very useful. (In particular, it really shouldn't have to do as much work as string.Contains.)
Have you tried specifying an ordinal match? That may well make it significantly faster:
if (x.EndsWith(y, StringComparison.Ordinal))
Of course, you shouldn't do that unless you want an ordinal comparison - are you expecting culturally-sensitive matches? (Developers tend not to consider this sort of thing, and I very firmly include myself in that category.)
Jon is absolutely right; this is potentially not an apples-to-apples comparison because different string methods have different defaults for culteral sensitivity. Be very sure that you are getting the comparison semantics you intend to in each one.
In addition to Jon's answer, I'd add that the relevant question is not "which is fastest?" but rather "which is too slow?" What's your performance goal for this code? The slowest method still finds the result in less time than it takes a movie projector to advance to the next frame, and obviously that is not noticable by humans. If your goal is that the search appears instantaneous to the user then you're done; any of those methods work. If your goal is that the search take less than a millisecond then none of those methods work; they are all orders of magnitude too slow. What's the budget?
I took a look at your benchmark code and frankly, it looks dodgy.
You are measuring all kinds of extraneous things along with what it is you want to measure; you're measuring the cost of the foreach and the adding to a list, both of which might have costs of the same order of magnitude as the thing you are attempting to test.
Also, you are not throwing out the first run; remember, the JIT compiler is going to jit the code that you call the first time through the loop, and it is going to be hot and ready to go the second time, so your results will therefore be skewed; you are averaging one potentially very large thing with many small things. In the past when I have done this I have discovered situations where the jit time actually dominated the time of everything else. Is that realistic? Do you mean to measure the jit time, or should it be not considered as part of the average?
I dunno how fast this is, but this is what I would do?
static bool HasSuffix(string check, string suffix)
{
int offset = check.Length - suffix.Length;
for (int i = 0; i < suffix.Length; i++)
{
if (check[offset + i] != suffix[i])
{
return false;
}
}
return true;
}
edit: OOPS x2
edit: So I wrote my own little benchmark... does this count? It runs 25 trials of evaluating one million strings and takes the average of the difference in performance. The handful of times I ran it it was consistently outputting that CharCompare was faster by ~10-40ms over one million records. So that is a hugely unimportant increase in efficiency (.000000001s/call) :) All in all I doubt it will matter which method you implement.
class Program
{
volatile static List<string> strings;
static double[] results = new double[25];
static void Main(string[] args)
{
strings = new List<string>();
Random r = new Random();
for (int rep = 0; rep < 25; rep++)
{
Console.WriteLine("Run " + rep);
strings.Clear();
for (int i = 0; i < 1000000; i++)
{
string temp = "";
for (int j = 0; j < r.Next(3, 101); j++)
{
temp += Convert.ToChar(
Convert.ToInt32(
Math.Floor(26 * r.NextDouble() + 65)));
}
if (i % 4 == 0)
{
temp += "abc";
}
strings.Add(temp);
}
OrdinalWorker ow = new OrdinalWorker(strings);
CharWorker cw = new CharWorker(strings);
if (rep % 2 == 0)
{
cw.Run();
ow.Run();
}
else
{
ow.Run();
cw.Run();
}
Thread.Sleep(1000);
results[rep] = ow.finish.Subtract(cw.finish).Milliseconds;
}
double tDiff = 0;
for (int i = 0; i < 25; i++)
{
tDiff += results[i];
}
double average = tDiff / 25;
if (average < 0)
{
average = average * -1;
Console.WriteLine("Char compare faster by {0}ms average",
average.ToString().Substring(0, 4));
}
else
{
Console.WriteLine("EndsWith faster by {0}ms average",
average.ToString().Substring(0, 4));
}
}
}
class OrdinalWorker
{
List<string> list;
int count;
public Thread t;
public DateTime finish;
public OrdinalWorker(List<string> l)
{
list = l;
}
public void Run()
{
t = new Thread(() => {
string suffix = "abc";
for (int i = 0; i < list.Count; i++)
{
count = (list[i].EndsWith(suffix, StringComparison.Ordinal)) ?
count + 1 : count;
}
finish = DateTime.Now;
});
t.Start();
}
}
class CharWorker
{
List<string> list;
int count;
public Thread t;
public DateTime finish;
public CharWorker(List<string> l)
{
list = l;
}
public void Run()
{
t = new Thread(() =>
{
string suffix = "abc";
for (int i = 0; i < list.Count; i++)
{
count = (HasSuffix(list[i], suffix)) ? count + 1 : count;
}
finish = DateTime.Now;
});
t.Start();
}
static bool HasSuffix(string check, string suffix)
{
int offset = check.Length - suffix.Length;
for (int i = 0; i < suffix.Length; i++)
{
if (check[offset + i] != suffix[i])
{
return false;
}
}
return true;
}
}
Did you try direct access ?
I mean, you can make a loop watching for similar string, it could be faster than make a substring and having the same behaviour.
int i,j;
foreach(String testing in lists){
i=0;
j=0;
int ok=1;
while(ok){
i = testing.lenght - PATTERN.lenght;
if(i>0 && i<testing.lenght && testing[i] != PATTERN[j])
ok = 0;
i++;
j++;
}
if(ok) return testing;
}
Moreover if it's big strings, you could try using hashs.
I don't profess to be an expert in this area, however I felt compelled to at least profile this to some extent (knowing full well that my fictitious scenario will differ substantially from your own) and here is what I came up with:
It seems, at least for me, EndsWith takes the lead with LastIndexOf consistently coming in second, some timings are:
SubString: 00:00:00.0191877
Contains: 00:00:00.0201980
CompareInfo: 00:00:00.0255181
EndsWith: 00:00:00.0120296
LastIndexOf: 00:00:00.0133181
These were gleaned from processing 100,000 strings where the desired suffix appeared in all strings and so to me simply echoes Jon's answer (where the benefit is both speed and descriptiveness). And the code used to come to these results:
class Program
{
class Profiler
{
private Stopwatch Stopwatch = new Stopwatch();
public TimeSpan Elapsed { get { return Stopwatch.Elapsed; } }
public void Start()
{
Reset();
Stopwatch.Start();
}
public void Stop()
{
Stopwatch.Stop();
}
public void Reset()
{
Stopwatch.Reset();
}
}
static string suffix = "_sfx";
static Profiler profiler = new Profiler();
static List<string> input = new List<string>();
static List<string> output = new List<string>();
static void Main(string[] args)
{
GenerateSuffixedStrings();
FindStringsWithSuffix_UsingSubString(input, suffix);
Console.WriteLine("SubString: {0}", profiler.Elapsed);
FindStringsWithSuffix_UsingContains(input, suffix);
Console.WriteLine("Contains: {0}", profiler.Elapsed);
FindStringsWithSuffix_UsingCompareInfo(input, suffix);
Console.WriteLine("CompareInfo: {0}", profiler.Elapsed);
FindStringsWithSuffix_UsingEndsWith(input, suffix);
Console.WriteLine("EndsWith: {0}", profiler.Elapsed);
FindStringsWithSuffix_UsingLastIndexOf(input, suffix);
Console.WriteLine("LastIndexOf: {0}", profiler.Elapsed);
Console.WriteLine();
Console.WriteLine("Press any key to exit...");
Console.ReadKey();
}
static void GenerateSuffixedStrings()
{
for (var i = 0; i < 100000; i++)
{
input.Add(Guid.NewGuid().ToString() + suffix);
}
}
static void FindStringsWithSuffix_UsingSubString(IEnumerable<string> strings, string suffix)
{
output.Clear();
profiler.Start();
foreach (var s in strings)
{
if(s.Substring(s.Length - 4) == suffix)
output.Add(s);
}
profiler.Stop();
}
static void FindStringsWithSuffix_UsingContains(IEnumerable<string> strings, string suffix)
{
output.Clear();
profiler.Start();
foreach (var s in strings)
{
if (s.Contains(suffix))
output.Add(s);
}
profiler.Stop();
}
static void FindStringsWithSuffix_UsingCompareInfo(IEnumerable<string> strings, string suffix)
{
var ci = CompareInfo.GetCompareInfo("en-GB");
output.Clear();
profiler.Start();
foreach (var s in strings)
{
if (ci.IsSuffix(s, suffix))
output.Add(s);
}
profiler.Stop();
}
static void FindStringsWithSuffix_UsingEndsWith(IEnumerable<string> strings, string suffix)
{
output.Clear();
profiler.Start();
foreach (var s in strings)
{
if (s.EndsWith(suffix))
output.Add(s);
}
profiler.Stop();
}
static void FindStringsWithSuffix_UsingLastIndexOf(IEnumerable<string> strings, string suffix)
{
output.Clear();
profiler.Start();
foreach (var s in strings)
{
if (s.LastIndexOf(suffix) == s.Length - 4)
output.Add(s);
}
profiler.Stop();
}
}
EDIT:
As commented, I attempted this again with only some of the strings having a suffix applied and these are the results:
SubString: 00:00:00.0079731
Contains: 00:00:00.0243696
CompareInfo: 00:00:00.0334056
EndsWith: 00:00:00.0196668
LastIndexOf: 00:00:00.0229599
The string generator method was updated as follows, to produce the strings:
static void GenerateSuffixedStrings()
{
var nxt = false;
var rnd = new Random();
for (var i = 0; i < 100000; i++)
{
input.Add(Guid.NewGuid().ToString() +
(rnd.Next(0, 2) == 0 ? suffix : string.Empty));
}
}
Further, this trend continues if none of the string have a suffix:
SubString: 00:00:00.0055584
Contains: 00:00:00.0187089
CompareInfo: 00:00:00.0228983
EndsWith: 00:00:00.0114227
LastIndexOf: 00:00:00.0199328
However, this gap shortens again when assigning a quarter of the inputs a suffix (the first quarter, then sorting to randomise the coverage):
SubString: 00:00:00.0302997
Contains: 00:00:00.0305685
CompareInfo: 00:00:00.0306335
EndsWith: 00:00:00.0351229
LastIndexOf: 00:00:00.0322899
Conclusion? IMO, and agreeing with Jon, EndsWith seems the way to go (based on this limited test, anyway).
Further Edit:
To cure Jon's curiosity I ran a few more tests on EndsWith, with and without Ordinal string comparison...
On 100,000 strings with a quarter of them suffixed:
EndsWith: 00:00:00.0795617
OrdinalEndsWith: 00:00:00.0240631
On 1,000,000 strings with a quarter of them suffixed:
EndsWith: 00:00:00.5460591
OrdinalEndsWith: 00:00:00.2807860
On 10,000,000 strings with a quarter of them suffixed:
EndsWith: 00:00:07.5889581
OrdinalEndsWith: 00:00:03.3248628
Note that I only ran the last test once as generating the strings proved this laptop is in need of a replacement
There's a lot of good information here. I wanted to note that if your suffix is short, it could be even faster to look at the last few characters individually. My modified version of the benchmark code in question is here: http://pastebin.com/6nNdbEvW. It gives theses results:
Last char equality: 1.52 ms (50000)
Last 2 char equality: 1.56 ms (50000)
EndsWith using StringComparison.Ordinal: 3.75 ms (50000)
Contains: 11.10 ms (50000)
LastIndexOf: 14.85 ms (50000)
IsSuffix: 11.30 ms (50000)
Substring compare: 17.69 ms (50000)

Testing for repeated characters in a string

I'm doing some work with strings, and I have a scenario where I need to determine if a string (usually a small one < 10 characters) contains repeated characters.
`ABCDE` // does not contain repeats
`AABCD` // does contain repeats, ie A is repeated
I can loop through the string.ToCharArray() and test each character against every other character in the char[], but I feel like I am missing something obvious.... maybe I just need coffee. Can anyone help?
EDIT:
The string will be sorted, so order is not important so ABCDA => AABCD
The frequency of repeats is also important, so I need to know if the repeat is pair or triplet etc.
If the string is sorted, you could just remember each character in turn and check to make sure the next character is never identical to the last character.
Other than that, for strings under ten characters, just testing each character against all the rest is probably as fast or faster than most other things. A bit vector, as suggested by another commenter, may be faster (helps if you have a small set of legal characters.)
Bonus: here's a slick LINQ solution to implement Jon's functionality:
int longestRun =
s.Select((c, i) => s.Substring(i).TakeWhile(x => x == c).Count()).Max();
So, OK, it's not very fast! You got a problem with that?!
:-)
If the string is short, then just looping and testing may well be the simplest and most efficient way. I mean you could create a hash set (in whatever platform you're using) and iterate through the characters, failing if the character is already in the set and adding it to the set otherwise - but that's only likely to provide any benefit when the strings are longer.
EDIT: Now that we know it's sorted, mquander's answer is the best one IMO. Here's an implementation:
public static bool IsSortedNoRepeats(string text)
{
if (text.Length == 0)
{
return true;
}
char current = text[0];
for (int i=1; i < text.Length; i++)
{
char next = text[i];
if (next <= current)
{
return false;
}
current = next;
}
return true;
}
A shorter alternative if you don't mind repeating the indexer use:
public static bool IsSortedNoRepeats(string text)
{
for (int i=1; i < text.Length; i++)
{
if (text[i] <= text[i-1])
{
return false;
}
}
return true;
}
EDIT: Okay, with the "frequency" side, I'll turn the problem round a bit. I'm still going to assume that the string is sorted, so what we want to know is the length of the longest run. When there are no repeats, the longest run length will be 0 (for an empty string) or 1 (for a non-empty string). Otherwise, it'll be 2 or more.
First a string-specific version:
public static int LongestRun(string text)
{
if (text.Length == 0)
{
return 0;
}
char current = text[0];
int currentRun = 1;
int bestRun = 0;
for (int i=1; i < text.Length; i++)
{
if (current != text[i])
{
bestRun = Math.Max(currentRun, bestRun);
currentRun = 0;
current = text[i];
}
currentRun++;
}
// It's possible that the final run is the best one
return Math.Max(currentRun, bestRun);
}
Now we can also do this as a general extension method on IEnumerable<T>:
public static int LongestRun(this IEnumerable<T> source)
{
bool first = true;
T current = default(T);
int currentRun = 0;
int bestRun = 0;
foreach (T element in source)
{
if (first || !EqualityComparer<T>.Default(element, current))
{
first = false;
bestRun = Math.Max(currentRun, bestRun);
currentRun = 0;
current = element;
}
}
// It's possible that the final run is the best one
return Math.Max(currentRun, bestRun);
}
Then you can call "AABCD".LongestRun() for example.
This will tell you very quickly if a string contains duplicates:
bool containsDups = "ABCDEA".Length != s.Distinct().Count();
It just checks the number of distinct characters against the original length. If they're different, you've got duplicates...
Edit: I guess this doesn't take care of the frequency of dups you noted in your edit though... but some other suggestions here already take care of that, so I won't post the code as I note a number of them already give you a reasonably elegant solution. I particularly like Joe's implementation using LINQ extensions.
Since you're using 3.5, you could do this in one LINQ query:
var results = stringInput
.ToCharArray() // not actually needed, I've left it here to show what's actually happening
.GroupBy(c=>c)
.Where(g=>g.Count()>1)
.Select(g=>new {Letter=g.First(),Count=g.Count()})
;
For each character that appears more than once in the input, this will give you the character and the count of occurances.
I think the easiest way to achieve that is to use this simple regex
bool foundMatch = false;
foundMatch = Regex.IsMatch(yourString, #"(\w)\1");
If you need more information about the match (start, length etc)
Match match = null;
string testString = "ABCDE AABCD";
match = Regex.Match(testString, #"(\w)\1+?");
if (match.Success)
{
string matchText = match.Value; // AA
int matchIndnex = match.Index; // 6
int matchLength = match.Length; // 2
}
How about something like:
string strString = "AA BRA KA DABRA";
var grp = from c in strString.ToCharArray()
group c by c into m
select new { Key = m.Key, Count = m.Count() };
foreach (var item in grp)
{
Console.WriteLine(
string.Format("Character:{0} Appears {1} times",
item.Key.ToString(), item.Count));
}
Update Now, you'd need an array of counters to maintain a count.
Keep a bit array, with one bit representing a unique character. Turn the bit on when you encounter a character, and run over the string once. A mapping of the bit array index and the character set is upto you to decide. Break if you see that a particular bit is on already.
/(.).*\1/
(or whatever the equivalent is in your regex library's syntax)
Not the most efficient, since it will probably backtrack to every character in the string and then scan forward again. And I don't usually advocate regular expressions. But if you want brevity...
I started looking for some info on the net and I got to the following solution.
string input = "aaaaabbcbbbcccddefgg";
char[] chars = input.ToCharArray();
Dictionary<char, int> dictionary = new Dictionary<char,int>();
foreach (char c in chars)
{
if (!dictionary.ContainsKey(c))
{
dictionary[c] = 1; //
}
else
{
dictionary[c]++;
}
}
foreach (KeyValuePair<char, int> combo in dictionary)
{
if (combo.Value > 1) //If the vale of the key is greater than 1 it means the letter is repeated
{
Console.WriteLine("Letter " + combo.Key + " " + "is repeated " + combo.Value.ToString() + " times");
}
}
I hope it helps, I had a job interview in which the interviewer asked me to solve this and I understand it is a common question.
When there is no order to work on you could use a dictionary to keep the counts:
String input = "AABCD";
var result = new Dictionary<Char, int>(26);
var chars = input.ToCharArray();
foreach (var c in chars)
{
if (!result.ContainsKey(c))
{
result[c] = 0; // initialize the counter in the result
}
result[c]++;
}
foreach (var charCombo in result)
{
Console.WriteLine("{0}: {1}",charCombo.Key, charCombo.Value);
}
The hash solution Jon was describing is probably the best. You could use a HybridDictionary since that works well with small and large data sets. Where the letter is the key and the value is the frequency. (Update the frequency every time the add fails or the HybridDictionary returns true for .Contains(key))

Categories