Binary searching to find an unknown number - c#

I have a function called Slice. It consumes 2 values. First Node value and a second node value. I'm trying to find the value between [First,Second] inclusively that makes function G(x) go to zero or very near, closest by 2 decimal places. I can solve this problem using an iterative function starting at first number and incrementing by .01 but this could take a very long time.
I'm trying to have it with a binary run time. The tricky part is that I don't know which slice to take after I find the mid point. If I can get some tips or suggestions to continue please.
public double Slice(decimal first, decimal last)
{
double firstNodeValue = FinalOptionDecider(first);
double midNodeValue = FinalOptionDecider((last + first) / 2);
double lastNodeValue = FinalOptionDecider(last);
}

Binary search relies on sorted data to make the decision on which half to search next. So unless the function you want to pass your data through also retains your data in a sorted order, you cannot use binary search.
You have two options:
Give up on binary search and just do a linear search
Process all the inputs, sort and binary search the outputs (you can pair the input and output together in dictionary to retrieve the input).

As a general problem for an arbitrary function, this can be very difficult to solve. It becomes a lot easier if you can make certain assumptions. The algorithm you've sort of started to stub out is called a bisection algorithm.
First you need to bracket the value that you're looking for. So if you want FinalOptionDecider(x) to return zero, your firstNodeValue and lastNodeValue must be positive and negative. If they're both positive or both negative, you've failed to bracket the value that you want and you can't make any guarantee that searching between first and last will find an answer. And you also won't be able to guarantee that you can make the decision described in the next paragraph. So check for that first.
That condition is basically your answer... when you get the midNodeValue you need to check and see if your desired value is between firstNodeValue and midNodeValue or if it's between midNodeValue and lastNodeValue. Depending on which it is, you need to do a slice on that interval again. Repeat until you reach the desired precision.
If your function has multiple zeroes (like g(x) = x^2 - 1 does) then you will only find one of the zeroes.

You're just looking for a root-finding algorithm. The one you've started implementing here is just the Bisection method. Bisection isn't the fastest, but it has the major advantage that it's simple and guaranteed to converge to a root in the specified interval provided the function changes sign over the interval and the function is continuous (thus, by the intermediate value theorem, a root is guaranteed to exist in the interval. Here's a simple, somewhat generic implementation of the bisection method:
public static decimal FindRoot( decimal first, decimal last, Func<decimal, double> f, double value, decimal tolerance = 0.01m )
{
double fa = f( first );
double fb = f( last );
if( fa * fb > 0 )
{
throw new ArgumentException( "Interval not guaranteed to contain root." );
}
else if( fa == 0 )
{
return first;
}
else if( fb == 0 )
{
return last;
}
while( Math.Abs( first - last ) > tolerance )
{
decimal mid = ( first + last ) / 2;
double fc = f( mid );
if( fc * fb < 0 )
{
first = mid;
fa = fc;
}
else if( fc * fa < 0 )
{
last = mid;
fb = fc;
}
else
{
return mid;
}
}
return ( first - last ) * (decimal) ( fa / ( fb - fa ) ) + first;
}

The binary search is all about narrowing down an interval until the interval only consists of one value. I don't know what you mean by random but a binary sort can only be performed on a sorted dataset. Remember that it is almost always faster to search randomly than to sort and then search!

Related

multiplicative persistence - recursion?

I'm working on this:
Write a function, persistence, that takes in a positive parameter num
and returns its multiplicative persistence, which is the number of
times you must multiply the digits in num until you reach a single
digit.
For example:
persistence(39) == 3 // because 3*9 = 27, 2*7 = 14, 1*4=4
// and 4 has only one digit
persistence(999) == 4 // because 9*9*9 = 729, 7*2*9 = 126,
// 1*2*6 = 12, and finally 1*2 = 2
persistence(4) == 0 // because 4 is already a one-digit number
This is what I tried:
public static int Persistence(long n)
{
List<long> listofints = new List<long>();
while (n > 0)
{
listofints.Add(n % 10);
n /= 10;
}
listofints.Reverse();
// list of a splited number
int[] arr = new int[listofints.Count];
for (int i = 0; i < listofints.Count; i++)
{
arr[i] = (int)listofints[i];
}
//list to array
int pro = 1;
for (int i = 0; i < arr.Length; i++)
{
pro *= arr[i];
}
// multiply each number
return pro;
}
I have a problem with understanding recursion - probably there is a place to use it. Can some1 give me advice not a solution, how to deal with that?
It looks like you've got the complete function to process one iteration. Now all you need to do is add the recursion. At the end of the function call Persistence again with the result of the first iteration as the parameter.
Persistence(pro);
This will recursively call your function passing the result of each iteration as the parameter to the next iteration.
Finally, you need to add some code to determine when you should stop the recursion, so you only want to call Persistence(pro) if your condition is true. This way, when your condition becomes false you'll stop the recursion.
if (some stop condition is true)
{
Persistence(pro);
}
Let me take a stab at explaining when you should consider using a recursive method.
Example of Factorial: Factorial of n is found by multiplying 1*2*3*4*..*n.
Suppose you want to find out what the factorial of a number is. For finding the answer, you can write a foreach loop that keeys multiplying a number with the next number and the next number until it reaches 0. Once you reach 0, you are done, you'll return your result.
Instead of using loops, you can use Recursion because the process at "each" step is the same. Multiply the first number with the result of the next, result of the next is found by multiplying that next number with the result of the next and so on.
5 * (result of rest)
4 * (result of rest )
3 * (result of rest)
...
1 (factorial of 0 is 1).---> Last Statement.
In this case, if we are doing recursion, we have a terminator of the sequence, the last statement where we know for a fact that factorial of 0 = 1. So, we can write this like,
FactorialOf(5) = return 5 * FactorialOf(4) = 120 (5 * 24)
FactorialOf(4) = return 4 * FactorialOf(3) = 24 (4 * 6)
FactorialOf(3) = return 3 * FactorialOf(2) = 6 (3 * 2)
FactorialOf(2) = return 2 * FactorialOf(1) = 2 (2 * 1)
FactorialOf(1) = return 1 * FactorialOf(0) = 1 (1 * 1)
FactorialOf(0) = Known -> 1.
So, it would make sense to use the same method over and over and once we get to our terminator, we stop and start going back up the tree. Each statement that called the FactorialOf would start returning numbers until it reaches all the way to the top. At the top, we will have our answer.
Your case of Persistence
It calls for recursive method as well as you are taking the result and doing the same process on it each time.
Persistence(39) (not single) = return 1 + Persistence(3 * 9 = 27) = 3
Persistence(27) (not single) = return 1 + Persistence(2 * 7 = 14) = 2
Persistence(14) (not single) = return 1 + Persistence(1 * 4 = 4) = 1
Persistence(4) (single digit) = Known -> 0 // Terminator.
At the end of the day, if you have same process performed after each calculation / processing with a termination, you can most likely find a way to use recursion for that process.
You definitely can invoke your multiplication call recursively.
You will need initial sate (0 multiplications) and keep calling your method until you reach your stop condition. Then you return the last iteration you've got up to as your result and pass it through all the way up:
int persistence(int input, int count = 0) {} // this is how I would define the method
// and this is how I see the control flowing
var result = persistence(input: 39, count: 0) {
//write a code that derives 27 out of 39
//then keep calling persistence() again, incrementing the iteration count with each invocation
return persistence(input: 27, count: 1) {
return persistence(input: 14, count: 2) {
return persistence(input: 4, count: 3) {
return 3
}
}
}
}
the above is obviously not a real code, but I'm hoping that illustrates the point well enough for you to explore it further
Designing a simple recursive solution usually involves two steps:
- Identify the trivial base case to which you can calculate the answer easily.
- Figure out how to turn a complex case to a simpler one, in a way that quickly approaches the base case.
In your problem:
- Any single-digit number has a simple solution, which is persistence = 1.
- Multiplying all digits of a number produces a smaller number, and we know that the persistence of the bigger number is greater than the persistence of the smaller number by exactly one.
That should bring you to your solution. All you need to do is understand the above and write that in C#. There are only a few modifications that you need to make in your existing code. I won't give you a ready solution as that kinda defeats the purpose of the exercise, doesn't it. If you encounter technical problems with codifying your solution into C#, you're welcome to ask another question.
public int PerRec(int n)
{
string numS = n.ToString();
if(numS.Length == 1)
return 0;
var number = numS.ToArray().Select(x => int.Parse(x.ToString())).Aggregate((a,b) => a*b);
return PerRec(number) + 1;
}
For every recursion, you should have a stop condition(a single digit in this case).
The idea here is taking your input and convert it to string to calculate that length. If it is 1 then you return 0
Then you need to do your transformation. Take all the digits from the string representation(in this case from the char array, parse all of them, after getting the IEnumerable<int>, multiply each digit to calculate the next parameter for your recursion call.
The final result is the new recursion call + 1 (which represents the previous transformation)
You can do this step in different ways:
var number = numS.ToArray().Select(x => int.Parse(x.ToString())).Aggregate((a,b) => a*b);
convert numS into an array of char calling ToArray()
iterate over the collection and convert each char into its integer representation and save it into an array or a list
iterate over the int list multiplying all the digits to have the next number for your recursion
Hope this helps
public static int Persistence(long n)
{
if (n < 10) // handle the trivial cases - stop condition
{
return 0;
}
long pro = 1; // int may not be big enough, use long instead
while (n > 0) // simplify the problem by one level
{
pro *= n % 10;
n /= 10;
}
return 1 + Persistence(pro); // 1 = one level solved, call the same function for the rest
}
It is the classic recursion usage. You handle the basic cases, simplify the problem by one level and then use the same function again - that is the recursion.
You can rewrite the recursion into loops if you wish, you always can.

How to compare large string integer values

Currently I am working on a program that processes extremely large integernumbers .
To prevent hitting the intiger.maxvalue a script that processes strings as numbers, and splits them up into a List<int>as following
0 is the highest currently known value
list entry 0: 123 (hundred twenty three million)
list entry 1: 321 (three hundred twenty one thousand)
list entry 2: 777 (seven hundred seventy seven)
Now my question is: How would one check if the incoming string value is sub tractable from these values?
The start for subtraction I currently made is as following, but I am getting stuck on the subtracting part.
public bool Subtract(string value)
{
string cleanedNumeric = NumericAndSpaces(value);
List<string> input = new List<string>(cleanedNumeric.Split(' '));
// In case 1) the amount is bigger 2) biggest value exceeded by a 10 fold
// 3) biggest value exceeds the value
if (input.Count > values.Count ||
input[input.Count - 1].Length > values[0].ToString().Length ||
FastParseInt(input[input.Count -1]) > values[0])
return false;
// Flip the array for ease of comparison
input.Reverse();
return true;
}
EDIT
Current target for the highest achievable number in this program is a Googolplex And are limited to .net3.5 MONO
You should do some testing on this because I haven't run extensive tests but it has worked on the cases I've put it through. Also, it might be worth ensuring that each character in the string is truly a valid integer as this procedure would bomb given a non-integer character. Finally, it expects positive numbers for both subtrahend and minuend.
static void Main(string[] args)
{
// In subtraction, a subtrahend is subtracted from a minuend to find a difference.
string minuend = "900000";
string subtrahend = "900001";
var isSubtractable = IsSubtractable(subtrahend, minuend);
}
public static bool IsSubtractable(string subtrahend, string minuend)
{
minuend = minuend.Trim();
subtrahend = subtrahend.Trim();
// maybe loop through characters and ensure all are valid integers
// check if the original number is longer - clearly subtractable
if (minuend.Length > subtrahend.Length) return true;
// check if original number is shorter - not subtractable
if (minuend.Length < subtrahend.Length) return false;
// at this point we know the strings are the same length, so we'll
// loop through the characters, one by one, from the start, to determine
// if the minued has a higher value character in a column of the number.
int numberIndex = 0;
while (numberIndex < minuend.Length )
{
Int16 minuendCharValue = Convert.ToInt16(minuend[numberIndex]);
Int16 subtrahendCharValue = Convert.ToInt16(subtrahend[numberIndex]);
if (minuendCharValue > subtrahendCharValue) return true;
if (minuendCharValue < subtrahendCharValue) return false;
numberIndex++;
}
// number are the same
return true;
}
[BigInteger](https://msdn.microsoft.com/en-us/library/system.numerics.biginteger.aspx) is of aribtary size.
Run this code if you don't believe me
var foo = new BigInteger(2);
while (true)
{
foo = foo * foo;
}
Things get crazy. My debugger (VS2013) becomes unable to represent the number before it's done. ran it for a short time and got a number with 1.2 million digits in base 10 from ToString. It is big enough. There is a 2GB limit on object, which can be overriden in .NET 4.5 with the setting gcAllowVeryLargeObjects
Now what to do if you are using .NET 3.5? You basically need to reimplement BigInteger (obviously only taking what you need, there is a lot in there).
public class MyBigInteger
{
uint[] _bits; // you need somewhere to store the value to an arbitrary length.
....
You also need to perform maths on these arrays. here is the Equals method from BigInteger:
public bool Equals(BigInteger other)
{
AssertValid();
other.AssertValid();
if (_sign != other._sign)
return false;
if (_bits == other._bits)
// _sign == other._sign && _bits == null && other._bits == null
return true;
if (_bits == null || other._bits == null)
return false;
int cu = Length(_bits);
if (cu != Length(other._bits))
return false;
int cuDiff = GetDiffLength(_bits, other._bits, cu);
return cuDiff == 0;
}
It basically does cheap length and sign comparisons of the byte arrays, then, if that doesn't produce a difference hands off to GetDiffLength.
internal static int GetDiffLength(uint[] rgu1, uint[] rgu2, int cu)
{
for (int iv = cu; --iv >= 0; )
{
if (rgu1[iv] != rgu2[iv])
return iv + 1;
}
return 0;
}
Which does the expensive check of looping through the arrays looking for a difference.
All you math will have to follow this pattern and can largely be ripped of from the .Net source code.
Googleplex and 2GB:
Here the 2GB limit becomes a problem, because you will be needing an object size of 3.867×10^90 gigabyte. This the the point where you give up, or get clever and store objects as powers at the cost of not being able to represent a lot of them. *2
if you moderate your expectations, it doesn't actually change the maths of BigInteger to split _bits into multiple jagged arrays *1. You change the cheap checks a bit. Rather than checking the size of the array, you check the number of subarrays and then the size of the last one. Then the loop needs to be a bit more (but not much) more complex in that it does elementwise array comparison for each sub array. There are other changes as well, but it's by no means impossible and gets you out of the 2GB limit.
*1 Note use jagged arrays[][], not multidimensional arrays [,] which are still subject to the same limit.
*2 Ie give up on precision and store the mantissa and exponent. If you look how floating point numbers are implemented they can't represent all numbers between their max and min (as the number of real numbers in a range is 'bigger' than infinite). They make a complex trade off between precision and range. If you are wanting to do this, looking at float implementations will be a lot more useful than taking about integer representations like Biginteger.

Roman numbers subtract without conversion

Is possible to subtract roman numbers without conversion to decimal numbers?
For Example:
X - III = VII
So in input I have X and III. In output I have VII.
I need algorithm without conversion to decimal number.
Now I don't have an idea.
The most simple algorithm will be to create -- function for Romans. Subtracting A-B means repeating simultaneous A-- and B--, until having nothing in B.
But I wanted to do something more effective
The Roman numbers can be looked at as positional in some very weak way. We'll use it.
Let's make short tables of substraction:
X-V=V
X-I=IX
IX-I=VIII
VIII-I=VII
VII-I=VI
VI-I=V
V-I=IV
IV-I=III
III-I=II
II-I=I
I-I=_
And addition:
V+I=VI
And the same for CLX and MDC levels. Of course, you could create only one table, but to use it on different levels by substitution of letters.
Let's take numbers, for example, A=MMDCVI=2606 a B=CCCXLIII=343
Lets distribute them into levels=powers of 10. The several following operations will be inside levels only.
A=MM+DC+VI, B=CCC+XL+III
Then subtracting
A-B= MM+(DC-CCC)+(-XL)+(VI-III)
At the every level we have three possible letter: units, five-units and ten-units. The combinations (unit, five-units) and (unit, ten-unit) will be translated into differences
A-B= MM+(DC-CCC)+(-L+X)+(VI-III)
The normal combinations (where senior symbol is before junior one), will be translated into sums.
A-B= MM+(D+C-C-C-C)+(-L+X)+(V+I-I-I-I)
Shorten the combinations of same symbols
A-B= MM+(D-C-C)+(-L+X)+(V-I-I)
If some level is negative, borrow a unit from the senior level. Of course, it could work through empty level.
A-B= MM+(D-C-C-C)+(C-L+X)+(V-I-I)
Now, in every level we'll apply the subtraction table we have made, subtracting every minused symbol, strarting from the top of the table and repeating it until no minused members remain.
A-B= MM+(CD-C-C)+(L+X)+(IV-I)
A-B= MM+(CCC-C)+(L+X)+(III)
A-B= MM+(CC)+(L+X)+(III)
Now, use the addition table
A-B= MM+(CC)+(LX)+(III)
Now, we'll open the parenthesis. If there is '_' in some level, there will be nothing on its place.
A-B=MMCCLXIII =2263
The result is correct.
There is a more elegant solution than simply unrolling the whole roman number. The disadvantage of this would be a complexity in O(n) as opposed to O(log n) where n is the input number.
I found this task quite interesting. It is indeed possible without a conversion. Basically, you just have look at the last digit. If they match, take them away, if not, replace the bigger one. However, the whole task gets a lot more complicated by numbers like "IV", because you need a lookahead.
Here is the code. Since this is most likely a homework assignment, I took out some code so you have to think for yourself, how the rest should look like.
private static char[] romanLetters = { 'I', 'V', 'X', 'L', 'C', 'D', 'M' };
private static string[] vals = { "IIIII", "VV", "XXXXX", "LL", "CCCCC", "DD" };
static string RomanSubtract(string a, string b)
{
var _a = new StringBuilder(a);
var _b = new StringBuilder(b);
var aIndex = a.Length - 1;
var bIndex = b.Length - 1;
while (_a.Length > 0 && _b.Length > 0)
{
if (characters match)
{
if (lookahead for a finds a smaller char)
{
aIndex = ReplaceRomans(_a, aIndex, aChar);
continue;
}
if (lookahead for b finds a smaller char)
{
bIndex = ReplaceRomans(_b, bIndex, bChar);
continue;
}
_a.Remove(aIndex, 1);
_b.Remove(bIndex, 1);
aIndex--;
bIndex--;
}
else if (aChar > bChar)
{
aIndex = ReplaceRomans(_a, aIndex, aChar);
}
else
{
bIndex = ReplaceRomans(_b, bIndex, bChar);
}
}
return _a.Length > 0 ? _a.ToString() : "-" + _b.ToString();
}
private static int ReplaceRomans(StringBuilder roman, int index, int charIndex)
{
if (index > 0)
{
var beforeChar = Array.IndexOf(romanLetters, roman[index - 1]);
if (beforeChar < charIndex)
{
Replace e.g. IX with VIIII
}
}
Replace e.g. V with IIIII
}
Apart from checking every possible combination of input numbers - assuming the input is bounded - there is no way to do what you're asking. Roman numerals are awful in terms of mathematical operations.
You could write an algorithm that doesn't convert them, but it'd have to use decimal numbers at some point. Or you could normalize them to e.g. "IIIII...", but again you'd need to write some equivalences like "50 chars = L".
Rough idea:
Create a "map" or list of how each roman numeral relates to simpler numerals, for instance IV corresponds to (II + II), while V corresponds to (III + II), and X corresponds to (V + V).
When calculating e.g. X - III, treat this not as a mathematical term, but a string, which can be changed in several steps, where you each time check for something to remove from both sides of the minus operator:
x - III // Nothing to remove
(V + V) - III // Still nothing to remove
(III + II + III + II) - III // NOW we can remove a "III" from both sides
// while still treating these as roman numerals.
Result: III + II + II
Rejoined: V + II = VII.
If you make each number correspond to something as simple as possible in the "map" (e.g. III can correspond to (II + I), so you don't get stuck with left-overs), then I'm pretty sure you can figure out some kind of solution here.
Of course this requires a bunch of string-operations, comparisons, and a map from which your algorithm can "know" how to compare or switch values. Not exactly traditional maths, but then again, I suppose this is how roman numerals do work.
The basic sketch of my idea is to build up simple converters that chain together via either iterators or observables.
So, for instance, on the input side of things you have a CConverter that performs the transormations of the combinations CD, CM, D and M into CCCC, CCCCCCCCC, CCCCC, and CCCCCCCCCC respectively. All other received inputs are passed through unmolested. Then the next converter in line XConverter converts XL, XC, L and X into the appropriate number of Xs, and so on until you just have a stream of all Is.
Then you perform the subtraction by consuming both of these streams of Is, in lockstep. If the minuend runs out first, then the answer is 0 or negative, in which case everything has gone wrong. Otherwise, when the subtrahend runs out, you just start emitting all remaining Is from the minuend.
Now you need to convert back. So the first INormalizer queues up Is until it's received five of them, then it emits a V. If it reaches the end of the stream and it received four, then it emits IV. Otherwise it just emits as many Is as it received until the end of the stream, and then ends its own stream.
Next, the VNormalizer queues up Vs until it's received two, and then emits an X. If it receives an IV and it has one queued V then it emits IX, otherwise it emits IV.
And if the stream it's receiving ends or just starts sending Is and it still has a V queued, then it emits that, then whatever else the sending stream wanted to send, and then ends its own stream.
And so on, building back up into the correct roman numerals.
Parse the input strings to group the digits in the mixed 5/10 base (M, D, C, L, X, I). I.e. MMXVII yields MM||||X|V|II.
Now subtract from right to left, by canceling the digits in pairs. I.e. V|III - II = V|II - I = V|I.
When required, do a borrow, i.e. split the next highest digit (V splits to IIIII, X to VV...). Example: V|I - III = V| - II = IIIII - II = III. Borrows may need to be recursive, like X||I - III = X|| - II = VV| - II = V|IIIII - II = V|III.
The prefix notation (IV, IX, XL, XC...) makes it a little more complicated. An approach is to preprocess the string to remove them on input (substitute with IIII, VIIII, XXXX, LXXXX...) and postprocess to restore them on output.
Example:
XCIX - LVI = LXXXXVIIII - LVI = L|XXXX|V|IIII - L|V|I = L|XXXX|V|III - L|V| = L|XXXX||III - L|| = XXXX||III = XXXXXIII = XLIII
Pure character processing, no arithmetic involved.
Digits= "MDCLXVI"
Divided= ["DD", "CCCCC", "LL", "XXXXX", "VV", "IIIII"]
def In(Input):
return Input.replace("CM", "DCCCC").replace("CD", "CCCC").replace("XC", "LXXXX").replace("XL", "XXXX").replace("IX", "VIIII").replace("IV", "IIII")
def Group(Input):
Groups= []
for Digit in Digits:
# Split after the last digit
m= Input.rfind(Digit) + 1
Groups.append(Input[:m])
Input= Input[m:]
return Groups
def Decrement(A, i):
if len(A[i]) == 0:
# Borrow
Decrement(A, i - 1)
A[i]= Divided[i - 1] + A[i]
A[i]= A[i][:-1]
def Subtract(A, B):
for i in range(len(Digits) - 1, -1, -1):
while len(B[i]) > 0:
Decrement(A, i)
B[i]= B[i][:-1]
def Out(Input):
return Input.replace("DCCCC", "CM").replace("CCCC", "CD").replace("LXXXX", "XC").replace("XXXX", "XL").replace("VIIII", "IX").replace("IIII", "IV")
A= Group(In("MMDCVI"))
B= Group(In("CCCXLIII"))
Subtract(A, B)
print Out("".join(A))
>>>
MMCCLXIII
How about an Enum?
public enum RomanNumber
{
I = 1,
II = 2,
III = 3,
IV = 4,
V = 5,
VI = 6,
VII = 7,
VIII = 8,
IX = 9
X = 10
}
Then using it like this:
int newRomanNumber = (int) RomanNumber.X - (int) RomanNumber.III
If your input is 'X - III = VII', then you will also have to parse this string.
But I won't do this work for you. ;-)

Looking for a way to optimize this algorithm for parsing a very large string

The following class parses through a very large string (an entire novel of text) and breaks it into consecutive 4-character strings that are stored as a Tuple. Then each tuple can be assigned a probability based on a calculation. I am using this as part of a monte carlo/ genetic algorithm to train the program to recognize a language based on syntax only (just the character transitions).
I am wondering if there is a faster way of doing this. It takes about 400ms to look up the probability of any given 4-character tuple. The relevant method _Probablity() is at the end of the class.
This is a computationally intensive problem related to another post of mine: Algorithm for computing the plausibility of a function / Monte Carlo Method
Ultimately I'd like to store these values in a 4d-matrix. But given that there are 26 letters in the alphabet that would be a HUGE task. (26x26x26x26). If I take only the first 15000 characters of the novel then performance improves a ton, but my data isn't as useful.
Here is the method that parses the text 'source':
private List<Tuple<char, char, char, char>> _Parse(string src)
{
var _map = new List<Tuple<char, char, char, char>>();
for (int i = 0; i < src.Length - 3; i++)
{
int j = i + 1;
int k = i + 2;
int l = i + 3;
_map.Add
(new Tuple<char, char, char, char>(src[i], src[j], src[k], src[l]));
}
return _map;
}
And here is the _Probability method:
private double _Probability(char x0, char x1, char x2, char x3)
{
var subset_x0 = map.Where(x => x.Item1 == x0);
var subset_x0_x1_following = subset_x0.Where(x => x.Item2 == x1);
var subset_x0_x2_following = subset_x0_x1_following.Where(x => x.Item3 == x2);
var subset_x0_x3_following = subset_x0_x2_following.Where(x => x.Item4 == x3);
int count_of_x0 = subset_x0.Count();
int count_of_x1_following = subset_x0_x1_following.Count();
int count_of_x2_following = subset_x0_x2_following.Count();
int count_of_x3_following = subset_x0_x3_following.Count();
decimal p1;
decimal p2;
decimal p3;
if (count_of_x0 <= 0 || count_of_x1_following <= 0 || count_of_x2_following <= 0 || count_of_x3_following <= 0)
{
p1 = e;
p2 = e;
p3 = e;
}
else
{
p1 = (decimal)count_of_x1_following / (decimal)count_of_x0;
p2 = (decimal)count_of_x2_following / (decimal)count_of_x1_following;
p3 = (decimal)count_of_x3_following / (decimal)count_of_x2_following;
p1 = (p1 * 100) + e;
p2 = (p2 * 100) + e;
p3 = (p3 * 100) + e;
}
//more calculations omitted
return _final;
}
}
EDIT - I'm providing more details to clear things up,
1) Strictly speaking I've only worked with English so far, but its true that different alphabets will have to be considered. Currently I only want the program to recognize English, similar to whats described in this paper: http://www-stat.stanford.edu/~cgates/PERSI/papers/MCMCRev.pdf
2) I am calculating the probabilities of n-tuples of characters where n <= 4. For instance if I am calculating the total probability of the string "that", I would break it down into these independent tuples and calculate the probability of each individually first:
[t][h]
[t][h][a]
[t][h][a][t]
[t][h] is given the most weight, then [t][h][a], then [t][h][a][t]. Since I am not just looking at the 4-character tuple as a single unit, I wouldn't be able to just divide the instances of [t][h][a][t] in the text by the total no. of 4-tuples in the next.
The value assigned to each 4-tuple can't overfit to the text, because by chance many real English words may never appear in the text and they shouldn't get disproportionally low scores. Emphasing first-order character transitions (2-tuples) ameliorates this issue. Moving to the 3-tuple then the 4-tuple just refines the calculation.
I came up with a Dictionary that simply tallies the count of how often the tuple occurs in the text (similar to what Vilx suggested), rather than repeating identical tuples which is a waste of memory. That got me from about ~400ms per lookup to about ~40ms per, which is a pretty great improvement. I still have to look into some of the other suggestions, however.
In yoiu probability method you are iterating the map 8 times. Each of your wheres iterates the entire list and so does the count. Adding a .ToList() ad the end would (potentially) speed things. That said I think your main problem is that the structure you've chossen to store the data in is not suited for the purpose of the probability method. You could create a one pass version where the structure you store you're data in calculates the tentative distribution on insert. That way when you're done with the insert (which shouldn't be slowed down too much) you're done or you could do as the code below have a cheap calculation of the probability when you need it.
As an aside you might want to take puntuation and whitespace into account. The first letter/word of a sentence and the first letter of a word gives clear indication on what language a given text is written in by taking punctuaion charaters and whitespace as part of you distribution you include those characteristics of the sample data. We did that some years back. Doing that we shown that using just three characters was almost as exact (we had no failures with three on our test data and almost as exact is an assumption given that there most be some weird text where the lack of information would yield an incorrect result). as using more (we test up till 7) but the speed of three letters made that the best case.
EDIT
Here's an example of how I think I would do it in C#
class TextParser{
private Node Parse(string src){
var top = new Node(null);
for (int i = 0; i < src.Length - 3; i++){
var first = src[i];
var second = src[i+1];
var third = src[i+2];
var fourth = src[i+3];
var firstLevelNode = top.AddChild(first);
var secondLevelNode = firstLevelNode.AddChild(second);
var thirdLevelNode = secondLevelNode.AddChild(third);
thirdLevelNode.AddChild(fourth);
}
return top;
}
}
public class Node{
private readonly Node _parent;
private readonly Dictionary<char,Node> _children
= new Dictionary<char, Node>();
private int _count;
public Node(Node parent){
_parent = parent;
}
public Node AddChild(char value){
if (!_children.ContainsKey(value))
{
_children.Add(value, new Node(this));
}
var levelNode = _children[value];
levelNode._count++;
return levelNode;
}
public decimal Probability(string substring){
var node = this;
foreach (var c in substring){
if(!node.Contains(c))
return 0m;
node = node[c];
}
return ((decimal) node._count)/node._parent._children.Count;
}
public Node this[char value]{
get { return _children[value]; }
}
private bool Contains(char c){
return _children.ContainsKey(c);
}
}
the usage would then be:
var top = Parse(src);
top.Probability("test");
I would suggest changing the data structure to make that faster...
I think a Dictionary<char,Dictionary<char,Dictionary<char,Dictionary<char,double>>>> would be much more efficient since you would be accessing each "level" (Item1...Item4) when calculating... and you would cache the result in the innermost Dictionary so next time you don't have to calculate at all..
Ok, I don't have time to work out details, but this really calls for
neural classifier nets (Just take any off the shelf, even the Controllable Regex Mutilator would do the job with way more scalability) -- heuristics over brute force
you could use tries (Patricia Tries a.k.a. Radix Trees to make a space optimized version of your datastructure that can be sparse (the Dictionary of Dictionaries of Dictionaries of Dictionaries... looks like an approximation of this to me)
There's not much you can do with the parse function as it stands. However, the tuples appear to be four consecutive characters from a large body of text. Why not just replace the tuple with an int and then use the int to index the large body of text when you need the character values. Your tuple based method is effectively consuming four times the memory the original text would use, and since memory is usually the bottleneck to performance, it's best to use as little as possible.
You then try to find the number of matches in the body of text against a set of characters. I wonder how a straightforward linear search over the original body of text would compare with the linq statements you're using? The .Where will be doing memory allocation (which is a slow operation) and the linq statement will have parsing overhead (but the compiler might do something clever here). Having a good understanding of the search space will make it easier to find an optimal algorithm.
But then, as has been mentioned in the comments, using a 264 matrix would be the most efficent. Parse the input text once and create the matrix as you parse. You'd probably want a set of dictionaries:
SortedDictionary <int,int> count_of_single_letters; // key = single character
SortedDictionary <int,int> count_of_double_letters; // key = char1 + char2 * 32
SortedDictionary <int,int> count_of_triple_letters; // key = char1 + char2 * 32 + char3 * 32 * 32
SortedDictionary <int,int> count_of_quad_letters; // key = char1 + char2 * 32 + char3 * 32 * 32 + char4 * 32 * 32 * 32
Finally, a note on data types. You're using the decimal type. This is not an efficient type as there is no direct mapping to CPU native type and there is overhead in processing the data. Use a double instead, I think the precision will be sufficient. The most precise way will be to store the probability as two integers, the numerator and denominator and then do the division as late as possible.
The best approach here is to using sparse storage and pruning after each each 10000 character for example. Best storage strucutre in this case is prefix tree, it will allow fast calculation of probability, updating and sparse storage. You can find out more theory in this javadoc http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/NGramProcessLM.html

Longest recurring cycle in its decimal fraction - a bug or a misunderstanding?

This is fairly 'math-y' but I'm posting here because it's a Project Euler problem, & I have working code that presumably has bugs in it.
The question Determing longest repeating cycle in a decimal expansion solves the problem using logarithms, but I'm interested in solving with simple brute force. More accurately, I'm interested in understanding why my algorithm and code is not returning the correct solution.
The algorithm is simple:
replicate a 'long division',
at each step record the divisor and the remainder
when a divisor / remainder tuple is repeated, infer that the decimal representation will repeat.
Here are private fields, as requested
private int numerator;
private int recurrence;
private int result;
private int resultRecurrence;
private List<dynamic> digits;
and here is the code:
private void Go()
{
foreach (var i in primes)
{
digits = new List<dynamic>();
numerator = 1;
recurrence = 0;
while (numerator != 0)
{
numerator *= 10;
// quotient
var q = numerator / i;
// remainder
var r = numerator % i;
digits.Add(new { Divisor = q, Remainder = r });
// if we've found a repetition then break out
var m = digits.Where(p => p.Divisor == q && p.Remainder == r).ToList();
if (m.Count > 1)
{
recurrence = digits.LastIndexOf(m[0]) - digits.IndexOf(m[0]);
break;
}
numerator = r;
}
if (recurrence > resultRecurrence)
{
resultRecurrence = recurrence;
result = i;
}
}}
When testing integers < 10 and < 20 I get the correct result; and I correctly identify the value of i as well. However the decimal represetation that I get is incorrect - I calculate i-1 whereas the correct result is far less (something like i-250).
So presumably I either have a programming bug - which I can't find - or a logic bug.
I'm confused because it feels like a multiplicative group over p to me, in which there would be p-1 elements. I'm sure I'm missing something, can anyone provide suggestions?
edit
I'm not going to include my prime number code - it's not relevant, as I explain above I correctly identify the value of i (from memory it is 983) but I'm having problems getting the correct value for resultRecurrence.
I'm confused because it feels like a multiplicative group over p to me, in which there would be p-1 elements. I'm sure I'm missing something, can anyone provide suggestions?
Close.
For all primes except 2 and 5 (which divide 10), the sequence of remainders is formed by starting with 1 and transforming by
remainder = (10 * remainder) % prime
thus the k-th remainder is 10k (mod prime) and the set of remainders forms a subgroup of the group of nonzero remainders modulo prime[1]. The length of the recurring cycle is the order of that subgroup, which is also known as the order of 10 modulo prime.
The order of the group of nonzero remainders modulo prime is prime-1, and there's a theorem by Fermat:
Let G be a finite group of order g and H be a subgroup of G. Then the order h of H divides g.
So the length of the cycle is always a divisor of prime-1, and sometimes it's prime-1, e.g. for 7 or 19.
[1] For composite numbers n coprime to 10, that would be the group of remainders modulo n that are coprime to n.
First off, you don’t need the divisors, you only need the remainders.
Secondly, I would split the function into multiple independent parts instead of having everything in one big method: The long division / finding of the cycle length is independent of the rest (= finding the longest cycle).
Your break on Where coupled with Count is unintuitive. Why not just use a while loop with the condition (! digits.Contains(r))? (This would require putting 0 as a remainder into the digits list before the loop start.)
This leaves us with a much cleaner code that should be straightforward to debug.
recurrence = digits.LastIndexOf(m[0]) - digits.IndexOf(m[0]);
Surely the value of resultRecurrence is always going to be i-1 ? Since for a fraction of the form 1/n, the decimal starts repeating exactly when the division-in-progress (the ith digit) gives the same quotient-remainder as the very first trial division (1, hence i-1).
(as a side note, may I introduce you to Math.DivRem).

Categories