Index independent character comparison within text blocks

Index independent character comparison within text blocks - c#

I have the following task:developing a program where there is a block of sample text which should be typed by user.Any typos the user does during the test are registered.Basically I can compare each typed char with the sample char based on caret index position of the input.But there is one significant flow in such a "naive" approach- if the user typed mistakenly more letters than a whole string has, or inserted more white spaces between the string than should be , in such a case , there rest of comparisons will be wrong because of the index offsets added by the additional wrong insertions.I have thought of designing some kind of parser where each string (or even a char ) is tokenized and the comparisons are made "char-wise" and not "index-wise".But that seems to me like an overkill for such a task.I would like to get a reference to possibly existing algorithms which can be helpful in solving this kind of problems.
Also ,I am not sure if this question better matches the spirit of "Programmers" site so I posted it there too.
UPDATE
Another important detail I forgot to mention is that the evalution must be done on each input and not at the end of the task as it includes typing time recording etc...

If the user is retyping existing text then it should be as easy as doing index based comparisons for simple text (one char representing one letter/number etc.) or you could employ the StringInfo.GetTextElementEnumerator as example below.
If you were to ask the user to type in the text first and then check how many errors were made, you could employ Levenshtein distance (see here for implementation and explanation: http://www.dotnetperls.com/levenshtein). The Levenshtein distance is essentially the amount of edits required to make one string look like another.
Please note the implementation would be technically incomplete if you were tasked to support unicode with combining characters and surrogate pairs where 2 or more characters are required to represent one letter. In the latter case, you could modify that implementation to use StringInfo.GetTextElementEnumerator to produce and compare text elements as follows:
using System;
using System.Collections.Generic;
using System.Globalization;
/// <summary>
/// Contains approximate string matching
/// </summary>
internal static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings using text elements.
/// </summary>
public static int ComputeByTextElements(string s, string t) {
string[] strA = GetTextElements(s),
strB = GetTextElements(t);
int n = strA.Length;
int m = strB.Length;
var d = new int[n + 1,m + 1];
// Step 1
if (n == 0) {
return m;
}
if (m == 0) {
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++) {}
for (int j = 0; j <= m; d[0, j] = j++) {}
// Step 3
for (int i = 1; i <= n; i++) {
//Step 4
for (int j = 1; j <= m; j++) {
// Step 5
int cost = (strB[j - 1] == strA[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
private static string[] GetTextElements(string str) {
if (string.IsNullOrEmpty(str)) {
return new string[] {};
}
var result = new List<string>();
var enumerator = StringInfo.GetTextElementEnumerator(str);
while (enumerator.MoveNext()) {
result.Add(enumerator.Current.ToString());
}
return result.ToArray();
}
}
Also be aware that the above solution is case sensitive, so you might want to make your inputs all upper or lower case if you require case insensitive Levenshtein distance calculations.
Regarding your edit
I wouldn't settle for (character) index based comparison because it doesn't take into account combining characters and surrogate pairs. I would build up an array of text elements representing the text to be typed out and then per key stroke create an array (or subset for comparison via some caching implementation to optimize for speed) of text elements of the typed input and compare them.

Related

Can't figure out how to correct my array?

Here is my code. I get a red line under StockPrices saying that it can not implicitly convert type decimal to int. Which I understand since StockPrices Array is set as a decimal. I can't figure out how to convert it. (If you find any other issues, please call it out. I'm still learning!)
public int FindNumTimesNegativePriceChange()
{
int difference = 0;
decimal[] negativeChange = new decimal[StockPrices];
for (int i = 0; i < StockPrices.Length - 1; ++i)
{
difference = (int)(StockPrices[i + 1] - StockPrices[i]);
if (difference < 0)
{
negativeChange++;
}
}
return negativeChange;
Currently no result is returned.

If you want a new array with the same length as an existing array, use the Length property for the source array, not the array itself:
new decimal[StockPrices.Length];
But I'm not sure that is what you are looking for at all.
You want a counter, so difference only needs to be an int in this case.
The next issue is that you are explicitly casting decimal values to an int which means you will lose precision. Other data types would throw an exception in this case but decimal allows it and will truncate the values, not round them.
For stock prices, commonly the changes are less than 1, so in this business domain precision is usually important.
If it is your intention to only count whole integer losses then you should include a comment in the code, mainly because explicit casting like this is a common code mistake, comments are a great way to prevent future code reviewers from editing your logic to correct what looks like a mistake.
Depending on your source code management practises, it can be a good idea to include a reference to the documentation / task / change request that is the authority for this logic.
public int FindNumTimesNegativePriceChange()
{
int difference = 0;
int negativeChange = 0;
for (int i = 0; i < StockPrices.Length - 1; ++i)
{
// #11032: Only counting whole dollar changes
difference = (int)(StockPrices[i + 1] - StockPrices[i]);
if (difference < 0)
{
negativeChange++;
}
}
return negativeChange;
}
A final peer review item, this method processes a single input, but currently that input needs to be managed outside of the scope of this method. In this case StockPrices must be declared at the member level, but this logic is easier to isolate and test if you refactor it to pass through the source array:
public int FindNumTimesNegativePriceChange(decimal[] stockPrices)
{
decimal difference = 0;
int negativeChange = 0;
for (int i = 0; i < stockPrices.Length - 1; ++i)
{
difference = stockPrices[i + 1] - stockPrices[i];
if (difference < 0)
{
negativeChange++;
}
}
return negativeChange;
}
This version also computes the difference (delta) as a decimal
Fiddle: https://dotnetfiddle.net/XAyFnm

Selection sort with strings

Okay, I've been using this code to do a selection sort on integers:
public void selectSort(int [] arr)
{
//pos_min is short for position of min
int pos_min,temp;
for (int i=0; i < arr.Length-1; i++)
{
pos_min = i; //set pos_min to the current index of array
for (int j=i+1; j < arr.Length; j++)
{
if (arr[j] < arr[pos_min])
{
//pos_min will keep track of the index that min is in, this is needed when a swap happens
pos_min = j;
}
}
//if pos_min no longer equals i than a smaller value must have been found, so a swap must occur
if (pos_min != i)
{
temp = arr[i];
arr[i] = arr[pos_min];
arr[pos_min] = temp;
}
}
}
but now I want to run the same algorithm on a string list instead.
How could that be accomplished? It feels really awkward and like you would need additional loops to compare multiple chars of different strings..?
I tried a lot, but I couldn't come up with anything useful. :/
Note:
I know, selection sort isn't very efficient. This is for learning purposes only. I'm not looking for alternative algorithms or classes that are already part of C#. ;)

IComparable is an interface that gives us a function called CompareTo, which is a comparison operator. This operator works for all types which implement the IComparable interface, which includes both integers and strings.
// Forall types A where A is a subtype of IComparable
public void selectSort<A>(A[] arr)
where A : IComparable
{
//pos_min is short for position of min
int pos_min;
A temp;
for (int i=0; i < arr.Length-1; i++)
{
pos_min = i; //set pos_min to the current index of array
for (int j=i+1; j < arr.Length; j++)
{
// We now use 'CompareTo' instead of '<'
if (arr[j].CompareTo(arr[pos_min]) < 0)
{
//pos_min will keep track of the index that min is in, this is needed when a swap happens
pos_min = j;
}
}
//if pos_min no longer equals i than a smaller value must have been found, so a swap must occur
if (pos_min != i)
{
temp = arr[i];
arr[i] = arr[pos_min];
arr[pos_min] = temp;
}
}
}

The System.String class has a static int Compare(string, string) method that returns a negative number if the first string is smaller than the second, zero if they are equal, and a positive integer if the first is larger.
By "smaller" I mean that it comes before the other in the lexical order and by larger that it comes after the other in lexical order.
Therefore you can compare String.Compare(arr[j], arr[pos_min]) < 0 instead of just arr[j] < arr[pos_min] for integers.

I am writing the code in python 3.6
First import sys module for use of various features in our syntax.
import sys
Consider an array of string data type items.
A = ['Chuck', 'Ana', 'Charlie', 'Josh']
for i in range(0, len(A)):
min_val = i
for j in range(i+1, len(A)):
if A[min_val] > A[j]:
min_val = j
Swapping the indexed and minimum value here.
(A[min_val], A[i]) = (A[i], A[min_val])
print("Sorted Array is :")
for i in range(len(A)):
print("%s" % A[i])
This works perfectly fine for an array of string datatype and sorts out the input data in an alphabetical way out.
As in the input 'Charlie' and 'Chuck' are being compared according to their alphabetical preference till the 3rd place and arranged accordingly.
The output of this program on python console is
Sorted Array is :
Ana
Charlie
Chuck
Josh

Most evenly distribute letters of the alphabet across sequence

I'm wondering if there is a sweet way I can do this in LINQ or something but I'm trying to most evenly distribute the letters of the alphabet across X parts where X is a whole number > 0 && <= 26. For example here might be some possible outputs.
X = 1 : 1 partition of 26
X = 2 : 2 partitions of 13
X = 3 : 2
partitions of 9 and one partition of 8
etc....
Ultimately I don't want to have any partitions that didn't end up getting at least one and I'm aiming to have them achieve the most even distribution that the range of difference between partition sizes is as small as posssible.
This is the code I tried orginally:
char[] alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ".ToCharArray();
int pieces = (int)Math.Round((double)alphabet.Count() / numberOfParts);
for (int i = 0; i < numberOfParts.Count; ++i) {
char[] subset = i == numberOfParts.Count - 1 ? alphabet.Skip(i * pieces).ToArray()
: alphabet.Skip(i * pieces).Take(pieces).ToArray();
... // more code following
This seemed to be working fine at first but I realized in testing that there is a problem when X is 10. Based on this logic I'm getting 8 groups of 3 and one group of 2, leaving the 10th group 0 which is no good as I'm going for the most even distribution.
The most ideal distribution for 10 in this case would be 6 groupings of 3 and 4 groupings of 2. Any thoughts on how this might be implemented?

In general, the easiest way to implement the logic is using the modulo operator, %. Get familiar with this operator; it's very useful for the situations where it helps. There are a number of ways to write the actual code to do the distribution of letters, using arrays or not as you wish etc., but this short expression should give you an idea:
"ABCDEFGHIJKLMNOPQRSTUVWXYZ".IndexOf(letter) % partitionCount
This expression gives the zero-based index of the partition in which to deposit an upper-case letter. The string is just shown for convenience, but could be an array or some other way of representing the alphabet. You could loop over the alphabet, using logic similar to the above to choose where to deposit each letter. Up to you would be where to put the logic: inside a loop, in a method, etc.
There's nothing magical about modular arithmetic; it just "wraps around" after the end of the set of usable numbers is reached. A simple context in which we've all encountered this is in division; the % operator is essentially just giving a division remainder. Now that you understand what the % operator is doing, you could easily write your own code to do the same thing, in any language.
Putting this all together, you could write a utility, class or extension method like this one--note the % to calculate the remainder, and that simple integer division discards it:
/// <summary>
/// Returns partition sized which are as close as possible to equal while using the indicated total size available, with any extra distributed to the front
/// </summary>
/// <param name="totalSize">The total number of elements to partition</param>
/// <param name="partitionCount">The number of partitions to size</param>
/// <param name="remainderAtFront">If true, any remainder will be distributed linearly starting at the beginning; if false, backwards from the end</param>
/// <returns>An int[] containing the partition sizes</returns>
public static int[] GetEqualizedPartitionSizes(int totalSize, int partitionCount, bool remainderAtFront = true)
{
if (totalSize < 1)
throw new ArgumentException("Cannot partition a non-positive number (" + totalSize + ")");
else if (partitionCount < 1)
throw new ArgumentException("Invalid partition count (" + partitionCount + ")");
else if (totalSize < partitionCount)
throw new ArgumentException("Cannot partition " + totalSize + " elements into " + partitionCount + " partitions");
int[] partitionSizes = new int[partitionCount];
int basePartitionSize = totalSize / partitionCount;
int remainder = totalSize % partitionCount;
int remainderPartitionSize = basePartitionSize + 1;
int x;
if (remainderAtFront)
{
for (x = 0; x < remainder; x++)
partitionSizes[x] = remainderPartitionSize;
for (x = remainder; x < partitionCount; x++)
partitionSizes[x] = basePartitionSize;
}
else
{
for (x = 0; x < partitionCount - remainder; x++)
partitionSizes[x] = basePartitionSize;
for (x = partitionCount - remainder; x < partitionCount; x++)
partitionSizes[x] = remainderPartitionSize;
}
return partitionSizes;
}

I feel like the simplest way to achieve this is to perform a round robin distribution on each letter. Cycle through each letter of the alphabet and add to it, then repeat. Have a running count that determines what letter you will be putting your item in, then when it hits >26, reset it back to 0!

What I did in one app I had to distribute things in groups was something like this
var numberOfPartitions = GetNumberOfPartitions();
var numberOfElements = GetNumberOfElements();
while (alphabet.Any())
{
var elementsInCurrentPartition = Math.Ceil((double)numberOfPartitions / numberOfElements)
for (int i = 0; i < elementsInCurrentPartition; i++)
{
//fill your partition one element at a time and remove the element from the alphabet
numberOfElements--;
}
numberOfPartitions--;
}
This won't give you the exact result you expected (i.e. ideal result for 10 partitions) but it's pretty close.
p.s. this isn't tested :)

A pseudocode algorithm I have now tested:
Double count = alphabet.Count()
Double exact = count / numberOfParts
Double last = 0.0
Do Until last >= count
Double next = last + exact
ranges.Add alphabet, from:=Round(last), to:=Round(next)
last = next
Loop
ranges.Add can ignore empty ranges :-)
Here is a LinqPad VB.NET implementation of this algorithm.
So a Linq version of this would be something like
Double count = alphabet.Count();
Double exact = count / numberOfParts;
var partitions = Enumerable.Range(0, numberOfParts + 1).Select(n => Round((Double)n * exact));
Here is a LinqPad VB.NET implementation using this Linq query.

(sorry for formatting, mobile)
First, you need something like a batch method:
public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int groupSize)
{
var tempSource = source.Select(n => n);
while (tempSource.Any())
{
yield return tempSource.Take(groupSize);
tempSource = tempSource.Skip(groupSize);
}
}
Then, just call it like this:
var result = alphabet.Batch((int)Math.Ceiling(x / 26.0));

How to make this code faster?

I have a function which works a very slow for my task (it must be 10-100 times faster)
Here is code
public long Support(List<string[]> sequences, string[] words)
{
var count = 0;
foreach (var sequence in sequences)
{
for (int i = 0; i < sequence.Length - words.Length + 1; i++)
{
bool foundSeq = true;
for (int j = 0; j < words.Length; j++)
{
foundSeq = foundSeq && sequence[i + j] == words[j];
}
if (foundSeq)
{
count++;
break;
}
}
}
return count;
}
public void Support(List<string[]> sequences, List<SequenceInfo> sequenceInfoCollection)
{
System.Threading.Tasks.Parallel.ForEach(sequenceInfoCollection.Where(x => x.Support==null),sequenceInfo =>
{
sequenceInfo.Support = Support(sequences, sequenceInfo.Sequence);
});
}
Where List<string[]> sequences is a array of array of words. This array usually contains 250k+ rows. Each row is about 4-7 words. string[] words is a array of words(all words contains in sequences at least one time) which we trying to count.
The problem is foundSeq = foundSeq && sequence[i + j] == words[j];. This code take most of all execution time(Enumerable.MoveNext at second place). I want to hash all words in my array. Numbers compares faster then strings, right? I think it can help me to get 30%-80% of perfomance. But i need 10x! What can i to do? If you want to know it's a part of apriory algorithm.
Support function check if the words sequence is a part any sequence in the sequences list and count how much times.

Knuth–Morris–Pratt algorithm
In computer science, the Knuth–Morris–Pratt string searching algorithm (or KMP algorithm) searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.
The algorithm was conceived in 1974 by Donald Knuth and Vaughan Pratt, and independently by James H. Morris. The three published it jointly in 1977.
From Wikipedia: https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
This is one of the improvements that you should make. With a small difference: a "word" in your code is a "characters" in the terminology of the algorithm; your "words" array is what is a word in KMP.
The idea is that when you search for "abc def ghi jkl", and have matched "abc def ghi" already, but the next word does not match, you can jump three positions.
Search: abc def ghi jkl
Text: abc def ghi klm abc def ghi jkl
i=0: abc def ghi jkl?
skip 2: XXX XXX <--- you save two iterations here, i += 2
i=2: abc?
i=3: abc? ...

The first optimisation I would make is an early-fail. Your inner loop continues through the whole sequence even if you know it's failed, and you're doing an unnecessary bit of boolean logic. Your code is this:
for (int j = 0; j < words.Length; j++)
{
foundSeq = foundSeq && sequence[i + j] == words[j];
}
Instead, just do this (or equivalent):
for (int j = 0; j < words.Length; j++)
{
if (sequence[i + j] != words[j])
{
foundSeq = false;
break;
}
}
This will save you the majority of your comparisons (you will drop out at the first word if it doesn't match, instead of continuing to compare when you know the result is false). It could even make the tenfold difference you're looking for if you expect the occurrence of the individual words in each sequence to be low (if, say, you are finding sentences in a page of English text).

Theoretically, you could concatenate each sequence and using substring matching. I don't have a compiler at hand now, so I can't test whether it will actually improve performance, but this is the general idea:
List<string> sentences = sequences.Select(seq => String.Join(" ", seq));
string toMatch = String.Join(" ", words);
return sentences.Count(sentence => sentence.Contains(toMatch));

Returning arrays

main()
{
....
i = index;
while (i < j)
{
if (ip[i] == "/")
{
ip[i - 1] = (double.Parse(ip[i - 1]) / double.Parse(ip[i + 1])).ToString();
for (int k = i; k < (ip.Length - 2); k++)
{
ip[k] = ip[k + 2];
}
Array.Resize(ref ip, ip.Length - 2);
j = j - 2;
i--;
}
i++;
}
}
For the above code I wanted to apply Oop's concepts.
This pattern repeats almost 5 times (for div,mul,add,sub,pow) in main program, with four identical lines .
To decrease the no of lines and there by to increase efficiency of code, I wrote the same like this.
i = index;
while (i < j)
{
if (ip[i] == "/")
{
ip[i - 1] = (double.Parse(ip[i - 1]) / double.Parse(ip[i + 1])).ToString();
ext.Resize(ip, i, j);
}
i++;
}
class ext
{
public static void Resize(string [] ip, int i, int j)
{
for (int k = i; k < (ip.Length - 2); k++) { ip[k] = ip[k + 2]; }
Array.Resize(ref ip, ip.Length - 2);
j=j-2; i--;
return ;
}
}
Code got compiled successfully. But the problem is the changes in array and variables that took place in called function are not reflecting in main program. The array and variables are remaining unchanged in main program.
I am unable to understand where I went wrong.
Plz guide me.
Thank You.

I don't think you understand what ref parameters are for - once you understand those (and the fact that arrays themselves can't change in size), you'll see why Array.Resize takes a ref parameter. Have a look at my parameter passing article for details.
You can fix your code by changing it like this:
public static void Resize(ref string [] ip, ref int i)
{
for (int k = i; k < (ip.Length - 2); k++)
{
ip[k] = ip[k + 2];
}
Array.Resize(ref ip, ip.Length - 2);
j = j - 2;
i--;
}
and calling it like this:
ext.Resize(ref ip, ref i);
However, I suspect that using a more appropriate data structure would make your code clearer. Is there any reason you can't use a List<string> instead?

You're removing items from the middle of a sequence, so shrinking its length. So using arrays is just making it difficult.
If ip was a List<string> instead of string[]:
i = index;
while (i < j)
{
if (ip[i] == "/")
{
ip[i - 1] = (double.Parse(ip[i - 1]) / double.Parse(ip[i + 1])).ToString();
ip.RemoveAt(i);
ip.RemoveAt(i);
j = j - 2;
i--;
}
i++;
}
It looks like you're parsing an arithmetic expression. However, you might want to allow for parentheses to control the order of evaluation, and that's going to be tricky with this structure.
Update: What your code says is: You are going to scan through a sequence of strings. Anywhere in that sequence you may find a division operator symbol: /. If so, you assume that the things on either side of it can be parsed with double.Parse. But:
( 5 + 4 ) / ( 6 - 2 )
The tokens on either side of the / in this example are ) and ( so double.Parse isn't going to work.
So I'm really just checking that you have another layer of logic outside this that deals with parentheses. For example, perhaps you are using recursive descent first, and then only running the piece of code you posted on sequences that contain no parentheses.
If you want the whole thing to be more "objecty", you could treat the problem as one of turning a sequence of tokens into a tree. Each node in the tree can be evaluated. The root node's value is the value of the whole expression. A number is a really simple node that evaluates to the number value itself. An operator has two child nodes. Parenthesis groups would just disappear from the tree - they are used to guide how you build it. If I have some time later I could develop this into a short example.
And another question: how are you splitting the whole string into tokens?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Index independent character comparison within text blocks - c#

Related

Can't figure out how to correct my array?

Selection sort with strings

Most evenly distribute letters of the alphabet across sequence

How to make this code faster?

Returning arrays

Categories

Resources