Find number of differences in 2 strings - c#

int n = string.numDifferences("noob", "newb"); // 2
??

The number you are trying to find is called the edit distance. Wikipedia lists several algorithms you might want to use; the Hamming distance is a very common way of finding the edit difference between two strings of the same length (it's often used in error-correcting codes); the Levenshtein distance is similar, but also takes insertions and deletions into account. Wikipedia, of course, lists several others (e.g. Damerau-Levenshtein distance, which includes transpositions); I don't know which you want, as I'm no expert and the choice is domain-specific. One of these, though, should do the trick.

Assuming that you only want to compare characters at the same indices, the following C# solution (using methods provided by LINQ) should do the trick:
var count = s1.Zip(s2, (c1, c2) => c1 == c2 ? 0 : 1).Sum();
This "zips" the two strings, and then returns 0 for each index where the characters are the same and 1 for each index where they differ. Then we simply sum the numbers and we get the result.

You already got excellent answers if you mean "edit distance". If you just mean "number of characters that differ" (for two strings of the same length), in Python, the simplest approach would be:
sum(c1!=c2 for c1, c2 in zip(s1, s2))
and if you also want to add the length difference, append
+ abs(len(s1) - len(s2))
Of course, if you do want edit distances, this approach would be far too simplistic;-).

import java.util.*;
class AnagramStringDifference
{ public int AnagramStringDifferenceString A, String B)
{ int diff=0,Ai=0,Bi=0;
char[] Aa= A.toCharArray();
char[] Bb= B.toCharArray();
Arrays.sort(Aa);
Arrays.sort(Bb);
while(Ai<Aa.length && Bi< Bb.length)
{ int c=Character.compare(Aa[Ai],Bb[Bi]);
if(c<0)
{ diff++;
Ai++;
}else if(c>0)
{ diff++;
Bi++;
}else if(c==0)
{ Ai++;
Bi++;
}
}
diff+=Math.abs((Aa.length-Ai)-(Bb.length-Bi));
return diff;
}
}
P.S. I got asked such a similar difficult question from Codility online test for online job application, with around only 2 hours limit to 4 hard questions. I wonder how overcrowded the IT industry is, or how much unreasonable pressure that management places on IT workers, if temp agency recruiters can get away asking such an difficult question for screening an entry-level pay technical support job.

import math
def differences(s1, s2):
count = 0
for i in range(len(s1)):
count += int(s1[i] != s2[1])
# count += math.sqrt( (len(s1) - len(s2)) **2) #add this line if the two strings are of different length and differences counts the how many characters one string has more than the other.
return count
Hope this helps

Related

Loop through every possible combination of values in a BitArray

I'm trying to solve a larger problem. As part of this, I have created a BitArray to represent a series of binary decisions taken sequentially. I know that all valid decision series will have half of all decisions true, and half of all false, but I don't know the order:
ttttffff
[||||||||]
Or:
tftftftf
[||||||||]
Or:
ttffttff
[||||||||]
Or any other combination where half of all bits are true, and half false.
My BitArray is quite a bit longer than this, and I need to move through each set of possible decisions (each possible combination of half true, half false), making further checks on their validity. I'm struggling to conceptually work out how to do this with a loop, however. It seems like it should be simple, but my brain is failing me.
EDIT: Because the BitArray wasn't massive, I used usr's suggestion and implemented a bitshift loop. Based on some of the comments and answers, I re-googled the problem with the key-word "permutations" and found this Stack Overflow question which is very similar.
I'd do this using a recursive algorithm. Each level sets the next bit. You keep track of how many zeroes and ones have been decided already. If one of those counters goes above N / 2 you abort the branch and backtrack. This should give quite good performance because it will tend to cut off infeasible branches quickly. For example, after setting tttt only f choices are viable.
A simpler, less well-performing, version would be to just loop through all possible N-bit integers using a for loop and discarding the ones that do not fulfill the condition. This is easy to implement for up to 63 bits. Just have a for loop from 0 to 1 << 63. Clearly, with high bitcounts this is too slow.
You are looking for all permutations of N / 2 zeroes and N / 2 ones. There are algorithms for generating those. If you can find one implemented this should give the best possible performance. I believe those algorithms use clever math tricks to only visit viable combinations.
If you're OK with using the bits in an integer instead of a BitArray, this is a general solution to generate all patterns with some constant number of bits set.
Start with the lowest valid value, which is with all the ones at the right side of the number, which you can calculate as low = ~(-1 << k) (doesn't work for k=32, but that's not an issue in this case).
Then take Gosper's Hack (also shown in this answer), which is a way to generate the next highest integer with equally many bits set, and keep applying it until you reach the highest valid value, low << k in this case.
This will result in duplicates, but you could check for duplicates before adding to the List if you want to.
static void Main(string[] args)
{
// Set your bits here:
bool[] bits = { true, true, false };
BitArray original_bits = new BitArray(bits);
permuteBits(original_bits, 0, original_bits.Length - 1);
foreach (BitArray ba in permutations)
{
// You can check Validity Here
foreach (bool i in ba)
{
Console.Write(Convert.ToInt32(i));
}
Console.WriteLine();
}
}
static List<BitArray> permutations = new List<BitArray>();
static void permuteBits(BitArray bits, int minIndex, int maxIndex)
{
int current_index;
if (minIndex == maxIndex)
{
permutations.Add(new BitArray(bits));
}
else
{
for (current_index = minIndex; current_index <= maxIndex; current_index++)
{
swap(bits, minIndex, current_index);
permuteBits(bits, minIndex + 1, maxIndex);
swap(bits, minIndex, current_index);
}
}
}
private static void swap(BitArray bits, int i, int j)
{
bool temp = bits[i];
bits[i] = bits[j];
bits[j] = temp;
}
If you want to clear the concept of finding all the permutation of a string having duplicates entry(i.e zeros and ones), you can read this article
They have used recursive solution to solve this problem and the explanation is also good.

System of the Unicode Box Drawing table

I'm implementing a function in C# where providing from which side goes what kind of line, it will return one character from the Box Drawing table (0x2500-0x257F) from Unicode. However I've failed (yet) to find a system in the position of these characters in the table, that would make a significantly simpler function, then assigning all possible input to an output in one enormous if-then-else block.
I've noted that there are 9 different line styles (thin, double, thick, double-dashed, triple-dashed, quad-triple-dashed, thick double-dashed, ...) in that table, and with the four direction, with the "no line" information makes 10 different states, which would make up to 9999 different combination not including the "none of the side has a line" case, which in my case would be a space character.
The easiest way I've found to implement this, is to make one freakin' huge array containing all 10000 possible outcome, (where the first digit notes North, the second East, then South and West) but I believe that this is actually the second worst case scenario I've found, and there is a much more elegant solution. (BTW This would be hilarious if you're not planning on implement it this way. That is how I feel about this anyways.)
This question is probably not suitable here, but considering the size of this task, I even take that risk:
Is there a system how the Box Drawing table arranges the characters, and/or is there a simpler algorithm that does the exact same I would like to do?
The simplest/shortes solution I see, needs an array/list of 128 elements.
You declare a struct/class like this:
// I use consts instead of enum to shorten the code below
const int thin = 1;
const int double = 2;
const int thick = 3;
... // other line styles
struct BoxDrawingChar{
int UpLine, DownLine, LeftLine, RightLine;
BoxDrawingChar(int UpLine, int DownLine, int LeftLine, int RightLine)
{ ... }
};
Then you describe appearance of each character:
BoxDrawingChar[] BoxDrawingCharList =
{
new BoxDrawingChar(0, 0, thin, thin), // 0x2500
new BoxDrawingChar(0, 0, thick, thick), // 0x2501
...
new BoxDrawingChar(...), // 0x257F
}
Then your function will be quite simple:
int GetCharCode(int UpLine, int DownLine, int LeftLine, int RightLine)
{
for(int i = 0; i < BoxDrawingCharList.Length; ++i){
BoxDrawingChar ch = BoxDrawingCharList[i];
if (ch.UpLine == UpLine && ch.DownLine == DownLine && ...)
return i + 0x2500;
}
return 0;
}
Of course you can add diagonal lines, rounded angles etc and refactor the code in many ways. I gave only a general idea.

How can I get a random x number of decimals from a list of unique decimals that total up to y?

Say I have a sorted list of 1000 or so unique decimals, arranged by value.
List<decimal> decList
How can I get a random x number of decimals from a list of unique decimals that total up to y?
private List<decimal> getWinningValues(int xNumberToGet, decimal yTotalValue)
{
}
Is there any way to avoid a long processing time on this? My idea so far is to take xNumberToGet random numbers from the pool. Something like (cool way to get random selection from a list)
foreach (decimal d in decList.OrderBy(x => randomInstance.Next())Take(xNumberToGet))
{
}
Then I might check the total of those, and if total is less, i might shift the numbers up (to the next available number) slowly. If the total is more, I might shift the numbers down. I'm still now sure how to implement or if there is a better design readily available. Any help would be much appreciated.
Ok, start with a little extension I got from this answer,
public static IEnumerable<IEnumerable<T>> Combinations<T>(
this IEnumerable<T> source,
int k)
{
if (k == 0)
{
return new[] { Enumerable.Empty<T>() };
}
return source.SelectMany((e, i) =>
source.Skip(i + 1).Combinations(k - 1)
.Select(c => (new[] { e }).Concat(c)));
}
this gives you a pretty efficient method to yield all the combinations with k members, without repetition, from a given IEnumerable. You could make good use of this in your implementation.
Bear in mind, if the IEnumerable and k are sufficiently large this could take some time, i.e. much longer than you have. So, I've modified your function to take a CancellationToken.
private static IEnumerable<decimal> GetWinningValues(
IEnumerable<decimal> allValues,
int numberToGet,
decimal targetValue,
CancellationToken canceller)
{
IList<decimal> currentBest = null;
var currentBestGap = decimal.MaxValue;
var locker = new object();
allValues.Combinations(numberToGet)
.AsParallel()
.WithCancellation(canceller)
.TakeWhile(c => currentBestGap != decimal.Zero)
.ForAll(c =>
{
var gap = Math.Abs(c.Sum() - targetValue);
if (gap < currentBestGap)
{
lock (locker)
{
currentBestGap = gap;
currentBest = c.ToList();
}
}
}
return currentBest;
}
I've an idea that you could sort the initial list and quit iterating the combinations at a certain point, when the sum must exceed the target. After some consideration, its not trivial to identify that point and, the cost of checking may exceed the benefit. This benefit would have to be balanced agaist some function of the target value and mean of the set.
I still think further optimization is possible but I also think that this work has already been done and I'd just need to look it up in the right place.
There are k such subsets of decList (k might be 0).
Assuming that you want to select each one with uniform probability 1/k, I think you basically need to do the following:
iterate over all the matching subsets
select one
Step 1 is potentially a big task, you can look into the various ways of solving the "subset sum problem" for a fixed subset size, and adapt them to generate each solution in turn.
Step 2 can be done either by making a list of all the solutions and choosing one or (if that might take too much memory) by using the clever streaming random selection algorithm.
If your data is likely to have lots of such subsets, then generating them all might be incredibly slow. In that case you might try to identify groups of them at a time. You'd have to know the size of the group without visiting its members one by one, then you can choose which group to use weighted by its size, then you've reduced the problem to selecting one of that group at random.
If you don't need to select with uniform probability then the problem might become easier. At the best case, if you don't care about the distribution at all then you can return the first subset-sum solution you find -- whether you'd call that "at random" is another matter...

Determining how close an array is to the target array

I'm playing a little experiment to increase my knowledge and I have reached a part where I feel i could really optimize it, but am not quite sure how to do this.
I have many arrays of numbers. (for simplicity, lets say each array has 4 numbers: 1, 2, 3, and 4)
The target is to have all of the numbers in ascending order (ie,
1-2-3-4), but the numbers are all scrambled in the different arrays.
A higher weight is placed upon larger numbers.
I need to sort all of these arrays in order of how close they are to
the target.
Ie, 4-3-2-1 would be the worst possible case.
Some example cases:
3-4-2-1 is better than 4-3-2-1
2-3-4-1 is better than 1-4-3-2 (even though two numbers match (1 and 3).
the biggest number is closer to its spot.)
So the big numbers always take precedence over the smaller numbers. Here is my attempt:
var tmp = from m in moves
let mx = m.Max()
let ranking = m.IndexOf(s => s == mx)
orderby ranking descending
select m;
return tmp.ToArray();
P.S IndexOf in the above example, is an extension I wrote to take an array and expression, and return the index of the element that satisfies the expression. It is needed because the situation is really a little more complicated, i'm simplifying it with my example.
The problem with my attempt here though, is that it would only sort by the biggest number, and forget all of the other numbers. it SHOULD rank by biggest number first, then by second largest, then by third.
Also, since it will be doing this operation over and over again for several minutes, it should be as efficient as possible.
You could implement a bubble sort, and count the number of times you have to move data around. The number of data moves will be large on arrays that are far away from the sorted ideal.
int GetUnorderedness<T>(T[] data) where T : IComparable<T>
{
data = (T[])data.Clone(); // don't modify the input data,
// we weren't asked to actually sort.
int swapCount = 0;
bool isSorted;
do
{
isSorted = true;
for(int i = 1; i < data.Length; i++)
{
if(data[i-1].CompareTo(data[i]) > 0)
{
T temp = data[i];
data[i] = data[i-1];
data[i-1] = temp;
swapCount++;
isSorted = false;
}
}
} while(!isSorted);
}
From your sample data, this will give slightly different results than you specified.
Some example cases:
3-4-2-1 is better than 4-3-2-1
2-3-4-1 is better than 1-4-3-2
3-4-2-1 will take 5 swaps to sort, 4-3-2-1 will take 6, so that works.
2-3-4-1 will take 3, 1-4-3-2 will also take 3, so this doesn't match up with your expected results.
This algorithm doesn't treat the largest number as the most important, which it seems you want; all numbers are treated equally. From your description, you'd consider 2-1-3-4 as much better than 1-2-4-3, because the first one has both the largest and second largest numbers in their proper place. This algorithm would consider those two equal, because each requires only 1 swap to sort the array.
This algorithm does have the advantage that it's not just a comparison algorithm, each input has a discrete output, so you only need to run the algorithm once for each input array.
I hope this helps
var i = 0;
var temp = (from m in moves select m).ToArray();
do
{
temp = (from m in temp
orderby m[i] descending
select m).ToArray();
}
while (++i < moves[0].Length);

Looking for a way to optimize this algorithm for parsing a very large string

The following class parses through a very large string (an entire novel of text) and breaks it into consecutive 4-character strings that are stored as a Tuple. Then each tuple can be assigned a probability based on a calculation. I am using this as part of a monte carlo/ genetic algorithm to train the program to recognize a language based on syntax only (just the character transitions).
I am wondering if there is a faster way of doing this. It takes about 400ms to look up the probability of any given 4-character tuple. The relevant method _Probablity() is at the end of the class.
This is a computationally intensive problem related to another post of mine: Algorithm for computing the plausibility of a function / Monte Carlo Method
Ultimately I'd like to store these values in a 4d-matrix. But given that there are 26 letters in the alphabet that would be a HUGE task. (26x26x26x26). If I take only the first 15000 characters of the novel then performance improves a ton, but my data isn't as useful.
Here is the method that parses the text 'source':
private List<Tuple<char, char, char, char>> _Parse(string src)
{
var _map = new List<Tuple<char, char, char, char>>();
for (int i = 0; i < src.Length - 3; i++)
{
int j = i + 1;
int k = i + 2;
int l = i + 3;
_map.Add
(new Tuple<char, char, char, char>(src[i], src[j], src[k], src[l]));
}
return _map;
}
And here is the _Probability method:
private double _Probability(char x0, char x1, char x2, char x3)
{
var subset_x0 = map.Where(x => x.Item1 == x0);
var subset_x0_x1_following = subset_x0.Where(x => x.Item2 == x1);
var subset_x0_x2_following = subset_x0_x1_following.Where(x => x.Item3 == x2);
var subset_x0_x3_following = subset_x0_x2_following.Where(x => x.Item4 == x3);
int count_of_x0 = subset_x0.Count();
int count_of_x1_following = subset_x0_x1_following.Count();
int count_of_x2_following = subset_x0_x2_following.Count();
int count_of_x3_following = subset_x0_x3_following.Count();
decimal p1;
decimal p2;
decimal p3;
if (count_of_x0 <= 0 || count_of_x1_following <= 0 || count_of_x2_following <= 0 || count_of_x3_following <= 0)
{
p1 = e;
p2 = e;
p3 = e;
}
else
{
p1 = (decimal)count_of_x1_following / (decimal)count_of_x0;
p2 = (decimal)count_of_x2_following / (decimal)count_of_x1_following;
p3 = (decimal)count_of_x3_following / (decimal)count_of_x2_following;
p1 = (p1 * 100) + e;
p2 = (p2 * 100) + e;
p3 = (p3 * 100) + e;
}
//more calculations omitted
return _final;
}
}
EDIT - I'm providing more details to clear things up,
1) Strictly speaking I've only worked with English so far, but its true that different alphabets will have to be considered. Currently I only want the program to recognize English, similar to whats described in this paper: http://www-stat.stanford.edu/~cgates/PERSI/papers/MCMCRev.pdf
2) I am calculating the probabilities of n-tuples of characters where n <= 4. For instance if I am calculating the total probability of the string "that", I would break it down into these independent tuples and calculate the probability of each individually first:
[t][h]
[t][h][a]
[t][h][a][t]
[t][h] is given the most weight, then [t][h][a], then [t][h][a][t]. Since I am not just looking at the 4-character tuple as a single unit, I wouldn't be able to just divide the instances of [t][h][a][t] in the text by the total no. of 4-tuples in the next.
The value assigned to each 4-tuple can't overfit to the text, because by chance many real English words may never appear in the text and they shouldn't get disproportionally low scores. Emphasing first-order character transitions (2-tuples) ameliorates this issue. Moving to the 3-tuple then the 4-tuple just refines the calculation.
I came up with a Dictionary that simply tallies the count of how often the tuple occurs in the text (similar to what Vilx suggested), rather than repeating identical tuples which is a waste of memory. That got me from about ~400ms per lookup to about ~40ms per, which is a pretty great improvement. I still have to look into some of the other suggestions, however.
In yoiu probability method you are iterating the map 8 times. Each of your wheres iterates the entire list and so does the count. Adding a .ToList() ad the end would (potentially) speed things. That said I think your main problem is that the structure you've chossen to store the data in is not suited for the purpose of the probability method. You could create a one pass version where the structure you store you're data in calculates the tentative distribution on insert. That way when you're done with the insert (which shouldn't be slowed down too much) you're done or you could do as the code below have a cheap calculation of the probability when you need it.
As an aside you might want to take puntuation and whitespace into account. The first letter/word of a sentence and the first letter of a word gives clear indication on what language a given text is written in by taking punctuaion charaters and whitespace as part of you distribution you include those characteristics of the sample data. We did that some years back. Doing that we shown that using just three characters was almost as exact (we had no failures with three on our test data and almost as exact is an assumption given that there most be some weird text where the lack of information would yield an incorrect result). as using more (we test up till 7) but the speed of three letters made that the best case.
EDIT
Here's an example of how I think I would do it in C#
class TextParser{
private Node Parse(string src){
var top = new Node(null);
for (int i = 0; i < src.Length - 3; i++){
var first = src[i];
var second = src[i+1];
var third = src[i+2];
var fourth = src[i+3];
var firstLevelNode = top.AddChild(first);
var secondLevelNode = firstLevelNode.AddChild(second);
var thirdLevelNode = secondLevelNode.AddChild(third);
thirdLevelNode.AddChild(fourth);
}
return top;
}
}
public class Node{
private readonly Node _parent;
private readonly Dictionary<char,Node> _children
= new Dictionary<char, Node>();
private int _count;
public Node(Node parent){
_parent = parent;
}
public Node AddChild(char value){
if (!_children.ContainsKey(value))
{
_children.Add(value, new Node(this));
}
var levelNode = _children[value];
levelNode._count++;
return levelNode;
}
public decimal Probability(string substring){
var node = this;
foreach (var c in substring){
if(!node.Contains(c))
return 0m;
node = node[c];
}
return ((decimal) node._count)/node._parent._children.Count;
}
public Node this[char value]{
get { return _children[value]; }
}
private bool Contains(char c){
return _children.ContainsKey(c);
}
}
the usage would then be:
var top = Parse(src);
top.Probability("test");
I would suggest changing the data structure to make that faster...
I think a Dictionary<char,Dictionary<char,Dictionary<char,Dictionary<char,double>>>> would be much more efficient since you would be accessing each "level" (Item1...Item4) when calculating... and you would cache the result in the innermost Dictionary so next time you don't have to calculate at all..
Ok, I don't have time to work out details, but this really calls for
neural classifier nets (Just take any off the shelf, even the Controllable Regex Mutilator would do the job with way more scalability) -- heuristics over brute force
you could use tries (Patricia Tries a.k.a. Radix Trees to make a space optimized version of your datastructure that can be sparse (the Dictionary of Dictionaries of Dictionaries of Dictionaries... looks like an approximation of this to me)
There's not much you can do with the parse function as it stands. However, the tuples appear to be four consecutive characters from a large body of text. Why not just replace the tuple with an int and then use the int to index the large body of text when you need the character values. Your tuple based method is effectively consuming four times the memory the original text would use, and since memory is usually the bottleneck to performance, it's best to use as little as possible.
You then try to find the number of matches in the body of text against a set of characters. I wonder how a straightforward linear search over the original body of text would compare with the linq statements you're using? The .Where will be doing memory allocation (which is a slow operation) and the linq statement will have parsing overhead (but the compiler might do something clever here). Having a good understanding of the search space will make it easier to find an optimal algorithm.
But then, as has been mentioned in the comments, using a 264 matrix would be the most efficent. Parse the input text once and create the matrix as you parse. You'd probably want a set of dictionaries:
SortedDictionary <int,int> count_of_single_letters; // key = single character
SortedDictionary <int,int> count_of_double_letters; // key = char1 + char2 * 32
SortedDictionary <int,int> count_of_triple_letters; // key = char1 + char2 * 32 + char3 * 32 * 32
SortedDictionary <int,int> count_of_quad_letters; // key = char1 + char2 * 32 + char3 * 32 * 32 + char4 * 32 * 32 * 32
Finally, a note on data types. You're using the decimal type. This is not an efficient type as there is no direct mapping to CPU native type and there is overhead in processing the data. Use a double instead, I think the precision will be sufficient. The most precise way will be to store the probability as two integers, the numerator and denominator and then do the division as late as possible.
The best approach here is to using sparse storage and pruning after each each 10000 character for example. Best storage strucutre in this case is prefix tree, it will allow fast calculation of probability, updating and sparse storage. You can find out more theory in this javadoc http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/NGramProcessLM.html

Categories