Converting String to Integer without using Multiplication C# - c#

Is there a way to convert string to integers without using Multiplication. The implementation of int.Parse() also uses multiplication. I have other similar questions where you can manually convert string to int, but that also requires mulitiplying the number by its base 10. This was an interview question I had in one of interviews and I cant seem to find any answer regarding this.

If you assume a base-10 number system and substituting the multiplication by bit shifts (see here) this can be a solution for positive integers.
public int StringToInteger(string value)
{
int number = 0;
foreach (var character in value)
number = (number << 1) + (number << 3) + (character - '0');
return number;
}
See the example on ideone.
The only assumption is that the characters '0' to '9' lie directly next to each other in the character set. The digit-characters are converted to their integer value using character - '0'.
Edit:
For negative integers this version (see here) works.
public static int StringToInteger(string value)
{
bool negative = false;
int i = 0;
if (value[0] == '-')
{
negative = true;
++i;
}
int number = 0;
for (; i < value.Length; ++i)
{
var character = value[i];
number = (number << 1) + (number << 3) + (character - '0');
}
if (negative)
number = -number;
return number;
}
In general you should take errors into account like null checks, problems with other non numeric characters, etc.

It depends. Are we talking about the logical operation of multiplication, or how it's actually done in hardware?
For example, you can convert a hexadecimal (or octal, or any other base two multiplier) string into an integer "without multiplication". You can go character by character and keep oring (|) and bitshifting (<<). This avoids using the * operator.
Doing the same with decimal strings is trickier, but we still have simple addition. You can use loops with addition to do the same thing. Pretty simple to do. Or you can make your own "multiplication table" - hopefully you learned how to multiply numbers in school; you can do the same thing with a computer. And of course, if you're on a decimal computer (rather than binary), you can do the "bitshift", just like with the earlier hexadecimal string. Even with a binary computer, you can use a series of bitshifts - (a << 1) + (a << 3) is the same as a * 2 + a * 8 == a * 10. Careful about negative numbers. You can figure out plenty of tricks to make this interesting.
Of course, both of these are just multiplication in disguise. That's because positional numeric systems are inherently multiplicative. That's how that particular numeric representation works. You can have simplifications that hide this fact (e.g. binary numbers only need 0 and 1, so instead of multiplying, you can have a simple condition
- of course, what you're really doing is still multiplication, just with only two possible inputs and two possible outputs), but it's always there, lurking. << is the same as * 2, even if the hardware that does the operation can be simpler and/or faster.
To do away with multiplication entirely, you need to avoid using a positional system. For example, roman numerals are additive (note that actual roman numerals didn't use the compactification rules we have today - four would be IIII, not IV, and it fourteen could be written in any form like XIIII, IIIIX, IIXII, VVIIII etc.). Converting such a string to integer becomes very easy - just go character by character, and keep adding. If the character is X, add ten. If V, add five. If I, add one. I hope you can see why roman numerals remained popular for so long; positional numeric systems are wonderful when you need to do a lot of multiplication and division. If you're mainly dealing with addition and subtraction, roman numerals work great, and require a lot less schooling (and an abacus is a lot easier to make and use than a positional calculator!).
With assignments like this, there's a lot of hit and miss about what the interviewer actually expects. Maybe they just want to see your thought processes. Do you embrace technicalities (<< is not really multiplication)? Do you know number theory and computer science? Do you just plunge on with your code, or ask for clarification? Do you see it as a fun challenge, or as yet another ridiculous boring interview question that doesn't have any relevance to what your job is? It's impossible for us to tell you the answer the interviewer was looking for.
But I hope I at least gave you a glimpse of possible answers :)

Considering it being an interview question, performance might not be a high priority. Why not just:
private int StringToInt(string value)
{
for (int i = int.MinValue; i <= int.MaxValue; i++)
if (i.ToString() == value)
return i;
return 0; // All code paths must return a value.
}
If the passed string is not an integer, the method will throw an overflow exception.

Any multiplication can be replaced by repeated addition. So you can replace any multiply in an existing algorithm with a version that only uses addition:
static int Multiply(int a, int b)
{
bool isNegative = a > 0 ^ b > 0;
int aPositive = Math.Abs(a);
int bPositive = Math.Abs(b);
int result = 0;
for(int i = 0; i < aPositive; ++i)
{
result += bPositive;
}
if (isNegative) {
result = -result;
}
return result;
}
You could go further and write a specialized String to Int using this idea which minimizes the number of additions (negative number and error handling omitted for brevity):
static int StringToInt(string v)
{
const int BASE = 10;
int result = 0;
int currentBase = 1;
for (int digitIndex = v.Length - 1; digitIndex >= 0; --digitIndex)
{
int digitValue = (int)Char.GetNumericValue(v[digitIndex]);
int accum = 0;
for (int i = 0; i < BASE; ++i)
{
if (i == digitValue)
{
result += accum;
}
accum += currentBase;
}
currentBase = accum;
}
return result;
}
But I don't think that's worth the trouble since performance doesn't seem to be a concern here.

Related

C# Problems while calculating Letter Frequency percentage

I am very unskilled in programming and I am trying to finish this task for my class. I've looked all over the Internet but still can't find the answer.
Here I have a piece of my code which prints out letters and the number of times it was spotted in my text file:
for (int i = 0; i < (int)char.MaxValue; i++)
{
if (c[i] > 0 &&
char.IsLetter((char)i))
{
Console.WriteLine("Letter: {0} Frequency: {1}",
(char)i,
c[i]);
}
I've calculated the number of letters in my code using int count = s.Count(char.IsLetter);
Dividing the c[i], obviously, doesn't work. I've tried several other methods but I keep getting errors. I feel like the solution is very simple but I simply can't see it. I hope that you will be willing to help me out :)
You could use a dictionary to store the frequency of each letter. You also shouldn't loop with the constraint i < (int)char.MaxValue. This will put you out of bounds unless c's length is >= char.MaxValue.
var frequency = new Dictionary<char, int>();
for (var i = 0; i < c.Length; i++)
{
var current = (char)c[i];
if (current > 0 && char.IsLetter(current))
{
if (!frequency.ContainsKey(current))
frequency.Add(current);
frequency[current]++;
Console.WriteLine("Letter: {0} Frequency: {1}", current, frequency[current]);
}
}
Maybe you have an integer division when you want a floating point division? In that case, cast either the dividend or the divisor to double (the other one will be converted automatically), for example:
(double)c[i] / count
Edit: Since you write percentage, if you need to multiply by one hundred, you can also make sure that literal is a double, then if you are careful with the precedence of the operators, you can have all casts implicit. Example:
Console.WriteLine($"Letter: {(char)i} Count: {c[i]} Percentage {c[i] * 100.0 / count}");
The multiplication goes first because of left-associativity. The literal 100.0 has type double.

Fast and low-memory-consumption way to read in pair of numbers from file and process them?

Okay, so this is my challenge taken from CodeEval. I have to read numbers from a file that is formatted in a standard way, it has a pair of numbers separated by a comma on each line (x, n). I have to read in the pair values and process them, then print out the smallest multiple of n which is greater than or equal to x, where n is a power of 2.
EXACT REQUIREMENT: Given numbers x and n, where n is a power of 2, print out the smallest multiple of n which is greater than or equal to x. Do not use division or modulo operator.
I have come up with a number of solutions, but none of them satisfy the computer's conditions to let me pass the challenge. I only get a partial completion with scores that vary from 30 to 80 (from 100).
I'm assuming that my solutions do not pass the speed but more likely the memory-usage requirements.
I would greatly appreciate it if anyone can enlighten me and offer some better, more efficient solutions.
Here are two of my solutions:
var filePath = #"C:\Users\myfile.txt";
int x;
int n;
using (var reader = new StreamReader(filePath))
{
string numsFile = string.Empty;
while ((numsFile = reader.ReadLine()) != null)
{
var nums = numsFile.Split(',').ToArray();
x = int.Parse(nums[0]);
n = int.Parse(nums[1]);
Console.WriteLine(DangleNumbers(x, n));
}
}
<<<>>>
var fileNums = File.ReadAllLines(filePath);
foreach (var line in fileNums)
{
var nums = line.Split(',').ToArray();
x = int.Parse(nums[0]);
n = int.Parse(nums[1]);
Console.WriteLine(DangleNumbers(x, n));
}
Method to check numbers
public static int DangleNumbers(int x, int n)
{
int m = 2;
while ((n * m) < x)
{
m += 2;
}
return m * n;
}
I'm fairly new to C# and programming but these two ways I found to get the best score from several others I have tried. I'm thinking that it's not too optimal for a new string to be created on each iteration, nor do I know how to use a StringBuilder and get the values into an Int from it.
Any pointers in the right direction would be appreciated as I would really like to get this challenge passed.
The smallest multiple of n that is larger or equal to x is likely this:
if(x <= n)
{
return n;
}
else
{
return x % n == 0 ? x : (x/n + 1) * n;
}
As x and n are integers, the result of x/n will be truncated (or effectively rounded down). So the next integer larger than x that is a multiple of n is (x/n + 1) * n
Since you missed the requirements, the modulo version was the most obvious choice. Though you still got your method wrong. m = 2 would not result in the smallest being returned but it could actually be the double of the smallest if n is already larger than x.
x = 7, n = 8 would get you 16 instead of 8.
Also adding 2 to m would result in a similar problem.
x = 5, n = 2 would get you 8 instead of 6.
use the following method instead:
public static int DangleNumbers(int x, int n)
{
int result = n;
while(result < x)
result += n;
return result;
}
Still capable of begin optimized but at least right according to the (now) stated constraints.
I have tried to improve the solution with some suggestions from you guys and take the variables outside the loop and drop the ToArray() call which was redundant.
static void Main(string[] args)
{
var filePath = #"C:\Users\sorin\Desktop\sorvas.txt";
int x;
int n;
string[] nums;
using (var reader = new StreamReader(filePath))
{
string numsFile = string.Empty;
while ((numsFile = reader.ReadLine()) != null)
{
nums = numsFile.Split(',');
x = int.Parse(nums[0]);
n = int.Parse(nums[1]);
Console.WriteLine(DangleNumbers(x, n));
}
}
}
public static int DangleNumbers(int x, int n)
{
int m = 2;
while ((n * m) < x)
{
m += 2;
}
return m * n;
}
So it looks like this. The thing is that even if now the numbers have slightly improved, I got a lower score.
May it be their system to blame ?
Using the first option of reading line by line (rather than reading all lines) is clearly going to use less memory (except potentially in the case where the file is very small (eg "1,1") in which case the overhead of the reader may cause problems but at that point the memory used is probably irrelevant.
Likewise declaring the variables outside the loop is generally better but in this case since the objects are value types I'm not sure it makes a difference.
Lastly the most efficient way of doing your DangleNumbers method is probably using bitwise logic operators and the fact that n is always a power of 2. Here is my attempt:
public static int DangleNumbers3(int x, int n)
{
return ((x-1) & ~(n-1))+n;
}
Essentially it relies on the fact that in binary a power of n is always a 1 followed by zero or more zeros. Thus a multiple of n will always end in that same number of zeros. So if n has M zeros after the one then you can take the binary form of x and if it already ends in M zeros then you have your answer. Otherwise you zero out the last M digits at which point you have the multiple of n that is just under x and then you add 1.
In the code ~(n-1) is a bitmask that has M zeros at the end and the leading digits are all 1. Thus when you AND it with a number it will zero out the trailing digits. I apply this to (x-1) to avoid having to do the check for if it is already the answer and have special cases.
It is important to note that this only works because of the special form of n as a power of 2. This method avoids the need for any loops and thus should run much faster (it has five operations total and no branching at all compared to other looping methods which will tend to have at the very least an operation and a comparison per loop.

What is the sum of the digits of the number 2^1000?

This is a problem from Project Euler, and this question includes some source code, so consider this your spoiler alert, in case you are interested in solving it yourself. It is discouraged to distribute solutions to the problems, and that isn't what I want. I just need a little nudge and guidance in the right direction, in good faith.
The problem reads as follows:
2^15 = 32768 and the sum of its digits is 3 + 2 + 7 + 6 + 8 = 26.
What is the sum of the digits of the number 2^1000?
I understand the premise and math of the problem, but I've only started practicing C# a week ago, so my programming is shaky at best.
I know that int, long and double are hopelessly inadequate for holding the 300+ (base 10) digits of 2^1000 precisely, so some strategy is needed. My strategy was to set a calculation which gets the digits one by one, and hope that the compiler could figure out how to calculate each digit without some error like overflow:
using System;
using System.IO;
using System.Windows.Forms;
namespace euler016
{
class DigitSum
{
// sum all the (base 10) digits of 2^powerOfTwo
[STAThread]
static void Main(string[] args)
{
int powerOfTwo = 1000;
int sum = 0;
// iterate through each (base 10) digit of 2^powerOfTwo, from right to left
for (int digit = 0; Math.Pow(10, digit) < Math.Pow(2, powerOfTwo); digit++)
{
// add next rightmost digit to sum
sum += (int)((Math.Pow(2, powerOfTwo) / Math.Pow(10, digit) % 10));
}
// write output to console, and save solution to clipboard
Console.Write("Power of two: {0} Sum of digits: {1}\n", powerOfTwo, sum);
Clipboard.SetText(sum.ToString());
Console.WriteLine("Answer copied to clipboard. Press any key to exit.");
Console.ReadKey();
}
}
}
It seems to work perfectly for powerOfTwo < 34. My calculator ran out of significant digits above that, so I couldn't test higher powers. But tracing the program, it looks like no overflow is occurring: the number of digits calculated gradually increases as powerOfTwo = 1000 increases, and the sum of digits also (on average) increases with increasing powerOfTwo.
For the actual calculation I am supposed to perform, I get the output:
Power of two: 1000 Sum of digits: 1189
But 1189 isn't the right answer. What is wrong with my program? I am open to any and all constructive criticisms.
For calculating the values of such big numbers you not only need to be a good programmer but also a good mathematician. Here is a hint for you,
there's familiar formula ax = ex ln a , or if you prefer, ax = 10x log a.
More specific to your problem
21000 Find the common (base 10) log of 2, and multiply it by 1000; this is the power of 10. If you get something like 1053.142 (53.142 = log 2 value * 1000) - which you most likely will - then that is 1053 x 100.142; just evaluate 100.142 and you will get a number between 1 and 10; and multiply that by 1053, But this 1053 will not be useful as 53 zero sum will be zero only.
For log calculation in C#
Math.Log(num, base);
For more accuracy you can use, Log and Pow function of Big Integer.
Now rest programming help I believe you can have from your side.
Normal int can't help you with such a large number. Not even long. They are never designed to handle numbers such huge. int can store around 10 digits (exact max: 2,147,483,647) and long for around 19 digits (exact max: 9,223,372,036,854,775,807). However, A quick calculation from built-in Windows calculator tells me 2^1000 is a number of more than 300 digits.
(side note: the exact value can be obtained from int.MAX_VALUE and long.MAX_VALUE respectively)
As you want precise sum of digits, even float or double types won't work because they only store significant digits for few to some tens of digits. (7 digit for float, 15-16 digits for double). Read here for more information about floating point representation, double precision
However, C# provides a built-in arithmetic
BigInteger for arbitrary precision, which should suit your (testing) needs. i.e. can do arithmetic in any number of digits (Theoretically of course. In practice it is limited by memory of your physical machine really, and takes time too depending on your CPU power)
Back to your code, I think the problem is here
Math.Pow(2, powerOfTwo)
This overflows the calculation. Well, not really, but it is the double precision is not precisely representing the actual value of the result, as I said.
A solution without using the BigInteger class is to store each digit in it's own int and then do the multiplication manually.
static void Problem16()
{
int[] digits = new int[350];
//we're doing multiplication so start with a value of 1
digits[0] = 1;
//2^1000 so we'll be multiplying 1000 times
for (int i = 0; i < 1000; i++)
{
//run down the entire array multiplying each digit by 2
for (int j = digits.Length - 2; j >= 0; j--)
{
//multiply
digits[j] *= 2;
//carry
digits[j + 1] += digits[j] / 10;
//reduce
digits[j] %= 10;
}
}
//now just collect the result
long result = 0;
for (int i = 0; i < digits.Length; i++)
{
result += digits[i];
}
Console.WriteLine(result);
Console.ReadKey();
}
I used bitwise shifting to left. Then converting to array and summing its elements. My end result is 1366, Do not forget to add reference to System.Numerics;
BigInteger i = 1;
i = i << 1000;
char[] myBigInt = i.ToString().ToCharArray();
long sum = long.Parse(myBigInt[0].ToString());
for (int a = 0; a < myBigInt.Length - 1; a++)
{
sum += long.Parse(myBigInt[a + 1].ToString());
}
Console.WriteLine(sum);
since the question is c# specific using a bigInt might do the job. in java and python too it works but in languages like c and c++ where the facility is not available you have to take a array and do multiplication. take a big digit in array and multiply it with 2. that would be simple and will help in improving your logical skill. and coming to project Euler. there is a problem in which you have to find 100! you might want to apply the same logic for that too.
Try using BigInteger type , 2^100 will end up to a a very large number for even double to handle.
BigInteger bi= new BigInteger("2");
bi=bi.pow(1000);
// System.out.println("Val:"+bi.toString());
String stringArr[]=bi.toString().split("");
int sum=0;
for (String string : stringArr)
{ if(!string.isEmpty()) sum+=Integer.parseInt(string); }
System.out.println("Sum:"+sum);
------------------------------------------------------------------------
output :=> Sum:1366
Here's my solution in JavaScript
(function (exponent) {
const num = BigInt(Math.pow(2, exponent))
let arr = num.toString().split('')
arr.slice(arr.length - 1)
const result = arr.reduce((r,c)=> parseInt(r)+parseInt(c))
console.log(result)
})(1000)
This is not a serious answer—just an observation.
Although it is a good challenge to try to beat Project Euler using only one programming language, I believe the site aims to further the horizons of all programmers who attempt it. In other words, consider using a different programming language.
A Common Lisp solution to the problem could be as simple as
(defun sum_digits (x)
(if (= x 0)
0
(+ (mod x 10) (sum_digits (truncate (/ x 10))))))
(print (sum_digits (expt 2 1000)))
main()
{
char c[60];
int k=0;
while(k<=59)
{
c[k]='0';
k++;
}
c[59]='2';
int n=1;
while(n<=999)
{
k=0;
while(k<=59)
{
c[k]=(c[k]*2)-48;
k++;
}
k=0;
while(k<=59)
{
if(c[k]>57){ c[k-1]+=1;c[k]-=10; }
k++;
}
if(c[0]>57)
{
k=0;
while(k<=59)
{
c[k]=c[k]/2;
k++;
}
printf("%s",c);
exit(0);
}
n++;
}
printf("%s",c);
}
Python makes it very simple to compute this with an oneliner:
print sum(int(digit) for digit in str(2**1000))
or alternatively with map:
print sum(map(int,str(2**1000)))

Fermat primality test

I have tried to write a code for Fermat primality test, but apparently failed.
So if I understood well: if p is prime then ((a^p)-a)%p=0 where p%a!=0.
My code seems to be OK, therefore most likely I misunderstood the basics. What am I missing here?
private bool IsPrime(int candidate)
{
//checking if candidate = 0 || 1 || 2
int a = candidate + 1; //candidate can't be divisor of candidate+1
if ((Math.Pow(a, candidate) - a) % candidate == 0) return true;
return false;
}
Reading the wikipedia article on the Fermat primality test, You must choose an a that is less than the candidate you are testing, not more.
Furthermore, as MattW commented, testing only a single a won't give you a conclusive answer as to whether the candidate is prime. You must test many possible as before you can decide that a number is probably prime. And even then, some numbers may appear to be prime but actually be composite.
Your basic algorithm is correct, though you will have to use a larger data type than int if you want to do this for non-trivial numbers.
You should not implement the modular exponentiation in the way that you did, because the intermediate result is huge. Here is the square-and-multiply algorithm for modular exponentiation:
function powerMod(b, e, m)
x := 1
while e > 0
if e%2 == 1
x, e := (x*b)%m, e-1
else b, e := (b*b)%m, e//2
return x
As an example, 437^13 (mod 1741) = 819. If you use the algorithm shown above, no intermediate result will be greater than 1740 * 1740 = 3027600. But if you perform the exponentiation first, the intermediate result of 437^13 is 21196232792890476235164446315006597, which you probably want to avoid.
Even with all of that, the Fermat test is imperfect. There are some composite numbers, the Carmichael numbers, that will always report prime no matter what witness you choose. Look for the Miller-Rabin test if you want something that will work better. I modestly recommend this essay on Programming with Prime Numbers at my blog.
You are dealing with very large numbers, and trying to store them in doubles, which is only 64 bits.
The double will do the best it can to hold your number, but you are going to loose some accuracy.
An alternative approach:
Remember that the mod operator can be applied multiple times, and still give the same result.
So, to avoid getting massive numbers you could apply the mod operator during the calculation of your power.
Something like:
private bool IsPrime(int candidate)
{
//checking if candidate = 0 || 1 || 2
int a = candidate - 1; //candidate can't be divisor of candidate - 1
int result = 1;
for(int i = 0; i < candidate; i++)
{
result = result * a;
//Notice that without the following line,
//this method is essentially the same as your own.
//All this line does is keeps the numbers small and manageable.
result = result % candidate;
}
result -= a;
return result == 0;
}

Looking for a way to optimize this algorithm for parsing a very large string

The following class parses through a very large string (an entire novel of text) and breaks it into consecutive 4-character strings that are stored as a Tuple. Then each tuple can be assigned a probability based on a calculation. I am using this as part of a monte carlo/ genetic algorithm to train the program to recognize a language based on syntax only (just the character transitions).
I am wondering if there is a faster way of doing this. It takes about 400ms to look up the probability of any given 4-character tuple. The relevant method _Probablity() is at the end of the class.
This is a computationally intensive problem related to another post of mine: Algorithm for computing the plausibility of a function / Monte Carlo Method
Ultimately I'd like to store these values in a 4d-matrix. But given that there are 26 letters in the alphabet that would be a HUGE task. (26x26x26x26). If I take only the first 15000 characters of the novel then performance improves a ton, but my data isn't as useful.
Here is the method that parses the text 'source':
private List<Tuple<char, char, char, char>> _Parse(string src)
{
var _map = new List<Tuple<char, char, char, char>>();
for (int i = 0; i < src.Length - 3; i++)
{
int j = i + 1;
int k = i + 2;
int l = i + 3;
_map.Add
(new Tuple<char, char, char, char>(src[i], src[j], src[k], src[l]));
}
return _map;
}
And here is the _Probability method:
private double _Probability(char x0, char x1, char x2, char x3)
{
var subset_x0 = map.Where(x => x.Item1 == x0);
var subset_x0_x1_following = subset_x0.Where(x => x.Item2 == x1);
var subset_x0_x2_following = subset_x0_x1_following.Where(x => x.Item3 == x2);
var subset_x0_x3_following = subset_x0_x2_following.Where(x => x.Item4 == x3);
int count_of_x0 = subset_x0.Count();
int count_of_x1_following = subset_x0_x1_following.Count();
int count_of_x2_following = subset_x0_x2_following.Count();
int count_of_x3_following = subset_x0_x3_following.Count();
decimal p1;
decimal p2;
decimal p3;
if (count_of_x0 <= 0 || count_of_x1_following <= 0 || count_of_x2_following <= 0 || count_of_x3_following <= 0)
{
p1 = e;
p2 = e;
p3 = e;
}
else
{
p1 = (decimal)count_of_x1_following / (decimal)count_of_x0;
p2 = (decimal)count_of_x2_following / (decimal)count_of_x1_following;
p3 = (decimal)count_of_x3_following / (decimal)count_of_x2_following;
p1 = (p1 * 100) + e;
p2 = (p2 * 100) + e;
p3 = (p3 * 100) + e;
}
//more calculations omitted
return _final;
}
}
EDIT - I'm providing more details to clear things up,
1) Strictly speaking I've only worked with English so far, but its true that different alphabets will have to be considered. Currently I only want the program to recognize English, similar to whats described in this paper: http://www-stat.stanford.edu/~cgates/PERSI/papers/MCMCRev.pdf
2) I am calculating the probabilities of n-tuples of characters where n <= 4. For instance if I am calculating the total probability of the string "that", I would break it down into these independent tuples and calculate the probability of each individually first:
[t][h]
[t][h][a]
[t][h][a][t]
[t][h] is given the most weight, then [t][h][a], then [t][h][a][t]. Since I am not just looking at the 4-character tuple as a single unit, I wouldn't be able to just divide the instances of [t][h][a][t] in the text by the total no. of 4-tuples in the next.
The value assigned to each 4-tuple can't overfit to the text, because by chance many real English words may never appear in the text and they shouldn't get disproportionally low scores. Emphasing first-order character transitions (2-tuples) ameliorates this issue. Moving to the 3-tuple then the 4-tuple just refines the calculation.
I came up with a Dictionary that simply tallies the count of how often the tuple occurs in the text (similar to what Vilx suggested), rather than repeating identical tuples which is a waste of memory. That got me from about ~400ms per lookup to about ~40ms per, which is a pretty great improvement. I still have to look into some of the other suggestions, however.
In yoiu probability method you are iterating the map 8 times. Each of your wheres iterates the entire list and so does the count. Adding a .ToList() ad the end would (potentially) speed things. That said I think your main problem is that the structure you've chossen to store the data in is not suited for the purpose of the probability method. You could create a one pass version where the structure you store you're data in calculates the tentative distribution on insert. That way when you're done with the insert (which shouldn't be slowed down too much) you're done or you could do as the code below have a cheap calculation of the probability when you need it.
As an aside you might want to take puntuation and whitespace into account. The first letter/word of a sentence and the first letter of a word gives clear indication on what language a given text is written in by taking punctuaion charaters and whitespace as part of you distribution you include those characteristics of the sample data. We did that some years back. Doing that we shown that using just three characters was almost as exact (we had no failures with three on our test data and almost as exact is an assumption given that there most be some weird text where the lack of information would yield an incorrect result). as using more (we test up till 7) but the speed of three letters made that the best case.
EDIT
Here's an example of how I think I would do it in C#
class TextParser{
private Node Parse(string src){
var top = new Node(null);
for (int i = 0; i < src.Length - 3; i++){
var first = src[i];
var second = src[i+1];
var third = src[i+2];
var fourth = src[i+3];
var firstLevelNode = top.AddChild(first);
var secondLevelNode = firstLevelNode.AddChild(second);
var thirdLevelNode = secondLevelNode.AddChild(third);
thirdLevelNode.AddChild(fourth);
}
return top;
}
}
public class Node{
private readonly Node _parent;
private readonly Dictionary<char,Node> _children
= new Dictionary<char, Node>();
private int _count;
public Node(Node parent){
_parent = parent;
}
public Node AddChild(char value){
if (!_children.ContainsKey(value))
{
_children.Add(value, new Node(this));
}
var levelNode = _children[value];
levelNode._count++;
return levelNode;
}
public decimal Probability(string substring){
var node = this;
foreach (var c in substring){
if(!node.Contains(c))
return 0m;
node = node[c];
}
return ((decimal) node._count)/node._parent._children.Count;
}
public Node this[char value]{
get { return _children[value]; }
}
private bool Contains(char c){
return _children.ContainsKey(c);
}
}
the usage would then be:
var top = Parse(src);
top.Probability("test");
I would suggest changing the data structure to make that faster...
I think a Dictionary<char,Dictionary<char,Dictionary<char,Dictionary<char,double>>>> would be much more efficient since you would be accessing each "level" (Item1...Item4) when calculating... and you would cache the result in the innermost Dictionary so next time you don't have to calculate at all..
Ok, I don't have time to work out details, but this really calls for
neural classifier nets (Just take any off the shelf, even the Controllable Regex Mutilator would do the job with way more scalability) -- heuristics over brute force
you could use tries (Patricia Tries a.k.a. Radix Trees to make a space optimized version of your datastructure that can be sparse (the Dictionary of Dictionaries of Dictionaries of Dictionaries... looks like an approximation of this to me)
There's not much you can do with the parse function as it stands. However, the tuples appear to be four consecutive characters from a large body of text. Why not just replace the tuple with an int and then use the int to index the large body of text when you need the character values. Your tuple based method is effectively consuming four times the memory the original text would use, and since memory is usually the bottleneck to performance, it's best to use as little as possible.
You then try to find the number of matches in the body of text against a set of characters. I wonder how a straightforward linear search over the original body of text would compare with the linq statements you're using? The .Where will be doing memory allocation (which is a slow operation) and the linq statement will have parsing overhead (but the compiler might do something clever here). Having a good understanding of the search space will make it easier to find an optimal algorithm.
But then, as has been mentioned in the comments, using a 264 matrix would be the most efficent. Parse the input text once and create the matrix as you parse. You'd probably want a set of dictionaries:
SortedDictionary <int,int> count_of_single_letters; // key = single character
SortedDictionary <int,int> count_of_double_letters; // key = char1 + char2 * 32
SortedDictionary <int,int> count_of_triple_letters; // key = char1 + char2 * 32 + char3 * 32 * 32
SortedDictionary <int,int> count_of_quad_letters; // key = char1 + char2 * 32 + char3 * 32 * 32 + char4 * 32 * 32 * 32
Finally, a note on data types. You're using the decimal type. This is not an efficient type as there is no direct mapping to CPU native type and there is overhead in processing the data. Use a double instead, I think the precision will be sufficient. The most precise way will be to store the probability as two integers, the numerator and denominator and then do the division as late as possible.
The best approach here is to using sparse storage and pruning after each each 10000 character for example. Best storage strucutre in this case is prefix tree, it will allow fast calculation of probability, updating and sparse storage. You can find out more theory in this javadoc http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/NGramProcessLM.html

Categories