I know I can get whether 2 strings are equal in content, but I need to be able to get the number of characters that differ in the result of comparing 2 string values.
For instance:
"aaaBaaaCaaaDaaaEaaa"
"aaaXaaaYaaaZaaaEaaa"
so the asnwer is 3 for this case.
Is there an easy way to do this, using regex, linq or any other way?
EDIT: Also the strings are VERY long. Say 10k+ characters.
In case there are inserts and deletes:
Levenstein distance
and here's the C# implementation
You can use LINQ:
string a = "aaaBaaaCaaaDaaaEaaa";
string b = "aaaXaaaYaaaZaaaEaaa";
int result = a.Zip(b, (x, y) => x == y).Count(z => !z)
+ Math.Abs(a.Length - b.Length);
A solution with a loop is probably more efficient though.
Hey, look at this: http://en.wikipedia.org/wiki/Hamming_distance
It will help you if you want to count deletions and insertions, not only replacements.
I would simply loop over the character arrays, adding up a counter for each difference.
This will not account for strings with different lengths, however.
If both strings have the same length and do not have complicated Unicode characters like surrogates, you can loop through each character and increment a counter if the characters at that index in each string are different.
It is theoretically impossible to do it any faster. (You need to check every single character)
Related
I have a list, each element in the list is a string that contains date and integer in specific format: yyyyMMdd_number.
List<string> listStr = new List<string> { "20170822_10", "20170821_1", "20170823_4", "20170821_10", "20170822_11", "20170822_5",
"20170822_2", "20170821_3", "20170823_6", "20170823_21", "20170823_20", "20170823_2"};
When use method listStr.Sort();
Result as below:
20170821_1
20170821_10
20170821_3
20170822_10
20170822_11
20170822_2
20170822_5
20170823_2
20170823_20
20170823_21
20170823_4
20170823_6
Expected Output:
20170821_1
20170821_3
20170821_10
20170822_2
20170822_5
20170822_10
20170822_11
20170823_2
20170823_4
20170823_6
20170823_20
20170823_21
The way: i think every string(day_number) will split with an underline, then compare and sort by number.
But please suggest me LINQ solution or better way to sort in this case.
Since the dates are in the format that can be ordered lexicographically, you could sort by the date prefix using string ordering, and resolve ties by parsing the integer:
var sorted = listStr
.OrderBy(s => s.Split('_')[0])
.ThenBy(s => int.Parse(s.Split('_')[1]));
Demo.
I imagine any numeric ordering would first require converting the value to a numeric type. So you could split on the underscore, sort by the first value, then by the second value. Something like this:
list.OrderBy(x => x.Split('_')[0]).ThenBy(x => int.Parse(x.Split('_')[1]))
You could improve this, if necessary, by creating a class which takes the string representation on its constructor and provides the numeric representations (and the original string representation) as properties. Then .Select() into a list of that class and sort. That class could internally do type checking, range checking, etc.
The answers above are much easier to follow / understand, but purely as an alternative for academic interest, you could do the following:
var sorted = listStr.OrderBy(x => Convert.ToInt32(x.Split('_')[0])*100 + Convert.ToInt32(x.Split('_')[1]));
It works on the premise that the suffix part after the underscore is going to be less than 100, and turns the two elements of the string into an integer with the relative 'magnitude' preserved, that can then be sorted.
The other two methods are much, much easier to follow, but one thing going for my alternative is that it only needs to sort once, so would be a bit faster (although I doubt it is going to matter for any real-world scenario).
Is there any function in C# that check the % of similarity of two strings?
For example i have:
var string1="Hello how are you doing";
var string2= " hi, how are you";
and the
function(string1, string2)
will return similarity ratio because the words "how", "are", "you" are present in the line.
Or even better, return me 60% of similarity because "how", "are", "you" is a 3/5 of string1.
Does any function exist in C# which do that?
A common measure for similarity of strings is the so-called Levenshtein distance or edit distance. In this approach, a certain defined set of edit operation is defined. The Levenshtein distance is the minimum number of edit steps which is necessary to obtain the second string from the first. Closely related is the Damerau-Levenshtein distance, which uses a different set of edit operations.
Algorithmically, the Levenshtein distance can be calculated using Dynamic programming, which can be considered efficient. However, note that this approach does not actually take single words into account and cannot directly express the similarity in percent.
Now i am going to risk a -1 here for my suggestions, but in situations where you are trying to get something which is close but not so complex, then there is a lot of simpler solutions then the Levenshtein distance, which is perfect if you need exakt results and have time to code it.
If you are a bit looser concerning the accuracy, then i would follow this simple rules:
compare literal first (strSearch == strReal) - if match exit
convert search string and real string to lowercase
remove vowels and other chars from strings [aeiou-"!]
now you have two converted strings. your search string:
mths dhlgrn mtbrn
and your real string to compare to
rstrnt mths dhlgrn
compare the converted strings, if they match exit
split only the search strings by its words either with simple split function or using Regular Expressions \W+
calculate the virtual value (weight) of one part by dividing 100 by the number of parts - in this case 33
compare each part of the search string with the
real string, if it is contained, and add the value for each match to your total weight. In this case we have three elements and two matches so the result is 66 - so 66% match
This method is simple and extendable to go more and more in detail, actually you could use steps 1-7 and if step 7 returns anything above 50% then you figure you have a match, and otherwise you use more complex calculations.
ok, now don't -1 me too fast, because other answers are perfect, this is just a solution for lazy developers and might be of value there, where the result fulfills the expectations.
You can create a function that splits both strings into arrays, and then iterate over one of them to check if the word exists in the other one.
If you want percentage of it you would have to count total amount of words and see how many are similar and create a number based on that.
I have to do a program in C# Form, which has to load from a file which looks something like that:
100ACTGGCTTACACTAATCAAG
101TTAAGGCACAGAAGTTTCCA
102ATGGTATAAACCAGAAGTCT
...
120GCATCAGTACGTACCCGTAC
20 lines formed with a number (ID) and 20 letters (ADN); the other file looks like that:
TGCAACGTGTACTATGGACC
In few words, this is a game where a murder is done, there are 20 people; i have to load and split the letters and.. i have to compare them and in the end i have to find the best match.
I have no idea how to do that, I don't know how to load the letters in the array and then to split them.. and then to compare them.
What you want to do here, is use something like a calculation of the Levenshtein distance between the strings.
In simple terms, that provides a count of how many single letters you have to change for a string to become equal to another. In the context of DNA or Proteins, this can be interpreted as representing the number of mutations between two individuals or samples. A shorter distance will therefore indicate a closer relationship between the two.
The algorithm can be fairly heavy computationally, but will give you a good answer. It's also quite fun and enlightening to implement. You can find a couple of ways of implementing it under the wikipedia article.
If you find it challenging to understand how it works, I recommend you set up an example grid by hand, with one short string horizontally along the top, and one vertically along the left side, and try going through the calculations manually, just to understand the concept properly (it can be confusing at first, but is really not that difficult).
This is a simple match function. It might not be of the complexity your game requires. This solution does not require an explicit split on the strings in order to get an array of DNA "letters". The DNA is compared in place.
Compare each "suspect" entry to the "evidence one.
int idLength = 3;
string evidence = //read from file
List<string> suspects = //read from file
List<double> matchScores = new List<double>();
foreach (string suspect in suspects)
{
int count = 0;
for (int i = idLength; i < suspect.Length; i++)
{
if (suspect[i + idLength] == evidence[i]) count++;
}
matchScores.Add(count * 100 / evidence.Length);
}
The matchScores list now contains all the individual match scores. I did not save the maximum match score in a separate variable as there can be several "suspects" with the same score. To find out which subject has the best match, just iterate the matchScores list. The index of the best match is the index of the suspect in the suspects list.
Optimization notes:
you could check each "suspect" string to see where (i.e. at what index does) the DNA sequence starts, as it could be variable;
a dictionary could be used here, instead of two lists, with the "suspect string" as key and the match score as value
This seems so trivial but I'm not finding an answer with Google.
I'm after a high value for a string for a semaphore at the end of a sorted list of strings.
It seems to me that char.highest.ToString() should do it--but this compares low, not high.
Obviously it's not truly possible to create a highest possible string because it would always be lower than the same thing + more data but the strings I'm sorting are all valid pathnames and thus the symbols used are constrained.
In response to the comments:
In the pre-unicode days in Delphi I would simply have used #255. I simply want a string that will compare higher than any possible pathname. This should be trivial--why isn't it??
Response #2:
It's not the sorting that requires the sentinel, it's the processing afterwards. I have multiple lists that I am sort-of merging (a simplistic merge won't do the job.) and either I duplicate code or I have dummy values that always compare high.
A string representation of the highest character will only be one character long.
Why don't you just append it as a semaphore after sorting, rather than trying to make it something that will sort afterwards?
Alternatively, you could specify your own comparator that sorts your token after any other string, and calls the default comparator otherwise.
I had the same problem when trying to put null values at the bottom of a list in a LINQ OrderBy() statement. I ended up using...
Char.ConvertFromUtf32(0x10ffff)
...which worked a treat.
Something like this?
public static String Highest(this String value)
{
Char highest = '\0';
foreach (Char c in value)
{
highest = Math.Max(c, highest);
}
return new String(new Char[] { highest });
}
I have a 2-dimensional array of objects (predominantly, but not exclusively strings) that I want to filter by a string (sSearch) using LINQ. The following query works, but isn't as fast as I would like.
I have changed Count to Any, which led to a significant increase in speed and replaced Contains by a regular expression that ignores case, thereby elimiating the call to ToLower. Combined this has more than halved the execution time.
What is now very noticeable is that increasing the length of the search term from 1 to 2 letters triples the execution time and there is another jump from 3 to 4 letters (~50% increase in execution time). While this is obviously not surprising I wonder whether there is anything else that could be done to optimise the matching of strings?
Regex rSearch = new Regex(sSearch, RegexOptions.IgnoreCase);
rawData.Where(row => row.Any(column => rSearch.IsMatch(column.ToString())));
In this case the dataset has about 10k rows and 50 columns, but the size could vary fairly significantly.
Any suggestions on how to optimise this would be greatly appreciated.
One optimisation is to use Any instead of Count - that way as soon as one matching column has been found, the row can be returned.
rawData.Where(row => row.Any(column => column.ToString()
.ToLower().Contains(sSearch)))
You should also be aware that ToLower is culture-sensitive. If may not be a problem in your case, but it's worth being aware of. ToLowerInvariant may be a better option for you. It's a shame there isn't an overload for Contains which lets you specify that you want a case-insensitive match...
EDIT: You're using a regular expression now - have you tried RegexOptions.Compiled? It may or may not help...