Is possible to subtract roman numbers without conversion to decimal numbers?
For Example:
X - III = VII
So in input I have X and III. In output I have VII.
I need algorithm without conversion to decimal number.
Now I don't have an idea.
The most simple algorithm will be to create -- function for Romans. Subtracting A-B means repeating simultaneous A-- and B--, until having nothing in B.
But I wanted to do something more effective
The Roman numbers can be looked at as positional in some very weak way. We'll use it.
Let's make short tables of substraction:
X-V=V
X-I=IX
IX-I=VIII
VIII-I=VII
VII-I=VI
VI-I=V
V-I=IV
IV-I=III
III-I=II
II-I=I
I-I=_
And addition:
V+I=VI
And the same for CLX and MDC levels. Of course, you could create only one table, but to use it on different levels by substitution of letters.
Let's take numbers, for example, A=MMDCVI=2606 a B=CCCXLIII=343
Lets distribute them into levels=powers of 10. The several following operations will be inside levels only.
A=MM+DC+VI, B=CCC+XL+III
Then subtracting
A-B= MM+(DC-CCC)+(-XL)+(VI-III)
At the every level we have three possible letter: units, five-units and ten-units. The combinations (unit, five-units) and (unit, ten-unit) will be translated into differences
A-B= MM+(DC-CCC)+(-L+X)+(VI-III)
The normal combinations (where senior symbol is before junior one), will be translated into sums.
A-B= MM+(D+C-C-C-C)+(-L+X)+(V+I-I-I-I)
Shorten the combinations of same symbols
A-B= MM+(D-C-C)+(-L+X)+(V-I-I)
If some level is negative, borrow a unit from the senior level. Of course, it could work through empty level.
A-B= MM+(D-C-C-C)+(C-L+X)+(V-I-I)
Now, in every level we'll apply the subtraction table we have made, subtracting every minused symbol, strarting from the top of the table and repeating it until no minused members remain.
A-B= MM+(CD-C-C)+(L+X)+(IV-I)
A-B= MM+(CCC-C)+(L+X)+(III)
A-B= MM+(CC)+(L+X)+(III)
Now, use the addition table
A-B= MM+(CC)+(LX)+(III)
Now, we'll open the parenthesis. If there is '_' in some level, there will be nothing on its place.
A-B=MMCCLXIII =2263
The result is correct.
There is a more elegant solution than simply unrolling the whole roman number. The disadvantage of this would be a complexity in O(n) as opposed to O(log n) where n is the input number.
I found this task quite interesting. It is indeed possible without a conversion. Basically, you just have look at the last digit. If they match, take them away, if not, replace the bigger one. However, the whole task gets a lot more complicated by numbers like "IV", because you need a lookahead.
Here is the code. Since this is most likely a homework assignment, I took out some code so you have to think for yourself, how the rest should look like.
private static char[] romanLetters = { 'I', 'V', 'X', 'L', 'C', 'D', 'M' };
private static string[] vals = { "IIIII", "VV", "XXXXX", "LL", "CCCCC", "DD" };
static string RomanSubtract(string a, string b)
{
var _a = new StringBuilder(a);
var _b = new StringBuilder(b);
var aIndex = a.Length - 1;
var bIndex = b.Length - 1;
while (_a.Length > 0 && _b.Length > 0)
{
if (characters match)
{
if (lookahead for a finds a smaller char)
{
aIndex = ReplaceRomans(_a, aIndex, aChar);
continue;
}
if (lookahead for b finds a smaller char)
{
bIndex = ReplaceRomans(_b, bIndex, bChar);
continue;
}
_a.Remove(aIndex, 1);
_b.Remove(bIndex, 1);
aIndex--;
bIndex--;
}
else if (aChar > bChar)
{
aIndex = ReplaceRomans(_a, aIndex, aChar);
}
else
{
bIndex = ReplaceRomans(_b, bIndex, bChar);
}
}
return _a.Length > 0 ? _a.ToString() : "-" + _b.ToString();
}
private static int ReplaceRomans(StringBuilder roman, int index, int charIndex)
{
if (index > 0)
{
var beforeChar = Array.IndexOf(romanLetters, roman[index - 1]);
if (beforeChar < charIndex)
{
Replace e.g. IX with VIIII
}
}
Replace e.g. V with IIIII
}
Apart from checking every possible combination of input numbers - assuming the input is bounded - there is no way to do what you're asking. Roman numerals are awful in terms of mathematical operations.
You could write an algorithm that doesn't convert them, but it'd have to use decimal numbers at some point. Or you could normalize them to e.g. "IIIII...", but again you'd need to write some equivalences like "50 chars = L".
Rough idea:
Create a "map" or list of how each roman numeral relates to simpler numerals, for instance IV corresponds to (II + II), while V corresponds to (III + II), and X corresponds to (V + V).
When calculating e.g. X - III, treat this not as a mathematical term, but a string, which can be changed in several steps, where you each time check for something to remove from both sides of the minus operator:
x - III // Nothing to remove
(V + V) - III // Still nothing to remove
(III + II + III + II) - III // NOW we can remove a "III" from both sides
// while still treating these as roman numerals.
Result: III + II + II
Rejoined: V + II = VII.
If you make each number correspond to something as simple as possible in the "map" (e.g. III can correspond to (II + I), so you don't get stuck with left-overs), then I'm pretty sure you can figure out some kind of solution here.
Of course this requires a bunch of string-operations, comparisons, and a map from which your algorithm can "know" how to compare or switch values. Not exactly traditional maths, but then again, I suppose this is how roman numerals do work.
The basic sketch of my idea is to build up simple converters that chain together via either iterators or observables.
So, for instance, on the input side of things you have a CConverter that performs the transormations of the combinations CD, CM, D and M into CCCC, CCCCCCCCC, CCCCC, and CCCCCCCCCC respectively. All other received inputs are passed through unmolested. Then the next converter in line XConverter converts XL, XC, L and X into the appropriate number of Xs, and so on until you just have a stream of all Is.
Then you perform the subtraction by consuming both of these streams of Is, in lockstep. If the minuend runs out first, then the answer is 0 or negative, in which case everything has gone wrong. Otherwise, when the subtrahend runs out, you just start emitting all remaining Is from the minuend.
Now you need to convert back. So the first INormalizer queues up Is until it's received five of them, then it emits a V. If it reaches the end of the stream and it received four, then it emits IV. Otherwise it just emits as many Is as it received until the end of the stream, and then ends its own stream.
Next, the VNormalizer queues up Vs until it's received two, and then emits an X. If it receives an IV and it has one queued V then it emits IX, otherwise it emits IV.
And if the stream it's receiving ends or just starts sending Is and it still has a V queued, then it emits that, then whatever else the sending stream wanted to send, and then ends its own stream.
And so on, building back up into the correct roman numerals.
Parse the input strings to group the digits in the mixed 5/10 base (M, D, C, L, X, I). I.e. MMXVII yields MM||||X|V|II.
Now subtract from right to left, by canceling the digits in pairs. I.e. V|III - II = V|II - I = V|I.
When required, do a borrow, i.e. split the next highest digit (V splits to IIIII, X to VV...). Example: V|I - III = V| - II = IIIII - II = III. Borrows may need to be recursive, like X||I - III = X|| - II = VV| - II = V|IIIII - II = V|III.
The prefix notation (IV, IX, XL, XC...) makes it a little more complicated. An approach is to preprocess the string to remove them on input (substitute with IIII, VIIII, XXXX, LXXXX...) and postprocess to restore them on output.
Example:
XCIX - LVI = LXXXXVIIII - LVI = L|XXXX|V|IIII - L|V|I = L|XXXX|V|III - L|V| = L|XXXX||III - L|| = XXXX||III = XXXXXIII = XLIII
Pure character processing, no arithmetic involved.
Digits= "MDCLXVI"
Divided= ["DD", "CCCCC", "LL", "XXXXX", "VV", "IIIII"]
def In(Input):
return Input.replace("CM", "DCCCC").replace("CD", "CCCC").replace("XC", "LXXXX").replace("XL", "XXXX").replace("IX", "VIIII").replace("IV", "IIII")
def Group(Input):
Groups= []
for Digit in Digits:
# Split after the last digit
m= Input.rfind(Digit) + 1
Groups.append(Input[:m])
Input= Input[m:]
return Groups
def Decrement(A, i):
if len(A[i]) == 0:
# Borrow
Decrement(A, i - 1)
A[i]= Divided[i - 1] + A[i]
A[i]= A[i][:-1]
def Subtract(A, B):
for i in range(len(Digits) - 1, -1, -1):
while len(B[i]) > 0:
Decrement(A, i)
B[i]= B[i][:-1]
def Out(Input):
return Input.replace("DCCCC", "CM").replace("CCCC", "CD").replace("LXXXX", "XC").replace("XXXX", "XL").replace("VIIII", "IX").replace("IIII", "IV")
A= Group(In("MMDCVI"))
B= Group(In("CCCXLIII"))
Subtract(A, B)
print Out("".join(A))
>>>
MMCCLXIII
How about an Enum?
public enum RomanNumber
{
I = 1,
II = 2,
III = 3,
IV = 4,
V = 5,
VI = 6,
VII = 7,
VIII = 8,
IX = 9
X = 10
}
Then using it like this:
int newRomanNumber = (int) RomanNumber.X - (int) RomanNumber.III
If your input is 'X - III = VII', then you will also have to parse this string.
But I won't do this work for you. ;-)
Related
Is there a way to convert string to integers without using Multiplication. The implementation of int.Parse() also uses multiplication. I have other similar questions where you can manually convert string to int, but that also requires mulitiplying the number by its base 10. This was an interview question I had in one of interviews and I cant seem to find any answer regarding this.
If you assume a base-10 number system and substituting the multiplication by bit shifts (see here) this can be a solution for positive integers.
public int StringToInteger(string value)
{
int number = 0;
foreach (var character in value)
number = (number << 1) + (number << 3) + (character - '0');
return number;
}
See the example on ideone.
The only assumption is that the characters '0' to '9' lie directly next to each other in the character set. The digit-characters are converted to their integer value using character - '0'.
Edit:
For negative integers this version (see here) works.
public static int StringToInteger(string value)
{
bool negative = false;
int i = 0;
if (value[0] == '-')
{
negative = true;
++i;
}
int number = 0;
for (; i < value.Length; ++i)
{
var character = value[i];
number = (number << 1) + (number << 3) + (character - '0');
}
if (negative)
number = -number;
return number;
}
In general you should take errors into account like null checks, problems with other non numeric characters, etc.
It depends. Are we talking about the logical operation of multiplication, or how it's actually done in hardware?
For example, you can convert a hexadecimal (or octal, or any other base two multiplier) string into an integer "without multiplication". You can go character by character and keep oring (|) and bitshifting (<<). This avoids using the * operator.
Doing the same with decimal strings is trickier, but we still have simple addition. You can use loops with addition to do the same thing. Pretty simple to do. Or you can make your own "multiplication table" - hopefully you learned how to multiply numbers in school; you can do the same thing with a computer. And of course, if you're on a decimal computer (rather than binary), you can do the "bitshift", just like with the earlier hexadecimal string. Even with a binary computer, you can use a series of bitshifts - (a << 1) + (a << 3) is the same as a * 2 + a * 8 == a * 10. Careful about negative numbers. You can figure out plenty of tricks to make this interesting.
Of course, both of these are just multiplication in disguise. That's because positional numeric systems are inherently multiplicative. That's how that particular numeric representation works. You can have simplifications that hide this fact (e.g. binary numbers only need 0 and 1, so instead of multiplying, you can have a simple condition
- of course, what you're really doing is still multiplication, just with only two possible inputs and two possible outputs), but it's always there, lurking. << is the same as * 2, even if the hardware that does the operation can be simpler and/or faster.
To do away with multiplication entirely, you need to avoid using a positional system. For example, roman numerals are additive (note that actual roman numerals didn't use the compactification rules we have today - four would be IIII, not IV, and it fourteen could be written in any form like XIIII, IIIIX, IIXII, VVIIII etc.). Converting such a string to integer becomes very easy - just go character by character, and keep adding. If the character is X, add ten. If V, add five. If I, add one. I hope you can see why roman numerals remained popular for so long; positional numeric systems are wonderful when you need to do a lot of multiplication and division. If you're mainly dealing with addition and subtraction, roman numerals work great, and require a lot less schooling (and an abacus is a lot easier to make and use than a positional calculator!).
With assignments like this, there's a lot of hit and miss about what the interviewer actually expects. Maybe they just want to see your thought processes. Do you embrace technicalities (<< is not really multiplication)? Do you know number theory and computer science? Do you just plunge on with your code, or ask for clarification? Do you see it as a fun challenge, or as yet another ridiculous boring interview question that doesn't have any relevance to what your job is? It's impossible for us to tell you the answer the interviewer was looking for.
But I hope I at least gave you a glimpse of possible answers :)
Considering it being an interview question, performance might not be a high priority. Why not just:
private int StringToInt(string value)
{
for (int i = int.MinValue; i <= int.MaxValue; i++)
if (i.ToString() == value)
return i;
return 0; // All code paths must return a value.
}
If the passed string is not an integer, the method will throw an overflow exception.
Any multiplication can be replaced by repeated addition. So you can replace any multiply in an existing algorithm with a version that only uses addition:
static int Multiply(int a, int b)
{
bool isNegative = a > 0 ^ b > 0;
int aPositive = Math.Abs(a);
int bPositive = Math.Abs(b);
int result = 0;
for(int i = 0; i < aPositive; ++i)
{
result += bPositive;
}
if (isNegative) {
result = -result;
}
return result;
}
You could go further and write a specialized String to Int using this idea which minimizes the number of additions (negative number and error handling omitted for brevity):
static int StringToInt(string v)
{
const int BASE = 10;
int result = 0;
int currentBase = 1;
for (int digitIndex = v.Length - 1; digitIndex >= 0; --digitIndex)
{
int digitValue = (int)Char.GetNumericValue(v[digitIndex]);
int accum = 0;
for (int i = 0; i < BASE; ++i)
{
if (i == digitValue)
{
result += accum;
}
accum += currentBase;
}
currentBase = accum;
}
return result;
}
But I don't think that's worth the trouble since performance doesn't seem to be a concern here.
So, what I'm trying to do this something like this: (example)
a,b,c,d.. etc. aa,ab,ac.. etc. ba,bb,bc, etc.
So, this can essentially be explained as generally increasing and just printing all possible variations, starting at a. So far, I've been able to do it with one letter, starting out like this:
for (int i = 97; i <= 122; i++)
{
item = (char)i
}
But, I'm unable to eventually add the second letter, third letter, and so forth. Is anyone able to provide input? Thanks.
Since there hasn't been a solution so far that would literally "increment a string", here is one that does:
static string Increment(string s) {
if (s.All(c => c == 'z')) {
return new string('a', s.Length + 1);
}
var res = s.ToCharArray();
var pos = res.Length - 1;
do {
if (res[pos] != 'z') {
res[pos]++;
break;
}
res[pos--] = 'a';
} while (true);
return new string(res);
}
The idea is simple: pretend that letters are your digits, and do an increment the way they teach in an elementary school. Start from the rightmost "digit", and increment it. If you hit a nine (which is 'z' in our system), move on to the prior digit; otherwise, you are done incrementing.
The obvious special case is when the "number" is composed entirely of nines. This is when your "counter" needs to roll to the next size up, and add a "digit". This special condition is checked at the beginning of the method: if the string is composed of N letters 'z', a string of N+1 letter 'a's is returned.
Here is a link to a quick demonstration of this code on ideone.
Each iteration of Your for loop is completely
overwriting what is in "item" - the for loop is just assigning one character "i" at a time
If item is a String, Use something like this:
item = "";
for (int i = 97; i <= 122; i++)
{
item += (char)i;
}
something to the affect of
public string IncrementString(string value)
{
if (string.IsNullOrEmpty(value)) return "a";
var chars = value.ToArray();
var last = chars.Last();
if(char.ToByte() == 122)
return value + "a";
return value.SubString(0, value.Length) + (char)(char.ToByte()+1);
}
you'll probably need to convert the char to a byte. That can be encapsulated in an extension method like static int ToByte(this char);
StringBuilder is a better choice when building large amounts of strings. so you may want to consider using that instead of string concatenation.
Another way to look at this is that you want to count in base 26. The computer is very good at counting and since it always has to convert from base 2 (binary), which is the way it stores values, to base 10 (decimal--the number system you and I generally think in), converting to different number bases is also very easy.
There's a general base converter here https://stackoverflow.com/a/3265796/351385 which converts an array of bytes to an arbitrary base. Once you have a good understanding of number bases and can understand that code, it's a simple matter to create a base 26 counter that counts in binary, but converts to base 26 for display.
This is the problem I'm solving (it's a sample problem, not a real problem):
Given N numbers , [N<=10^5] we need to count the total pairs of
numbers that have a difference of K. [K>0 and K<1e9]
Input Format: 1st line contains N & K (integers). 2nd line contains N
numbers of the set. All the N numbers are assured to be distinct.
Output Format: One integer saying the no of pairs of numbers that have
a diff K.
Sample Input #00:
5 2
1 5 3 4 2
Sample Output #00:
3
Sample Input #01:
10 1
363374326 364147530 61825163 1073065718 1281246024 1399469912 428047635 491595254 879792181 1069262793
Sample Output #01:
0
I already have a solution (and I haven't been able to optimize it as well as I had hoped). Currently my solution gets a score of 12/15 when it is run, and I'm wondering why I can't get 15/15 (my solution to another problem wasn't nearly as efficient, but got all of the points). Apparently, the code is run using "Mono 2.10.1, C# 4".
So can anyone think of a better way to optimize this further? The VS profiler says to avoid calling String.Split and Int32.Parse. The calls to Int32.Parse can't be avoided, although I guess I could optimize tokenizing the array.
My current solution:
using System;
using System.Collections.Generic;
using System.Text;
using System.Linq;
namespace KDifference
{
class Solution
{
static void Main(string[] args)
{
char[] space = { ' ' };
string[] NK = Console.ReadLine().Split(space);
int N = Int32.Parse(NK[0]), K = Int32.Parse(NK[1]);
int[] nums = Console.ReadLine().Split(space, N).Select(x => Int32.Parse(x)).OrderBy(x => x).ToArray();
int KHits = 0;
for (int i = nums.Length - 1, j, k; i >= 1; i--)
{
for (j = 0; j < i; j++)
{
k = nums[i] - nums[j];
if (k == K)
{
KHits++;
}
else if (k < K)
{
break;
}
}
}
Console.Write(KHits);
}
}
}
Your algorithm is still O(n^2), even with the sorting and the early-out. And even if you eliminated the O(n^2) bit, the sort is still O(n lg n). You can use an O(n) algorithm to solve this problem. Here's one way to do it:
Suppose the set you have is S1 = { 1, 7, 4, 6, 3 } and the difference is 2.
Construct the set S2 = { 1 + 2, 7 + 2, 4 + 2, 6 + 2, 3 + 2 } = { 3, 9, 6, 8, 5 }.
The answer you seek is the cardinality of the intersection of S1 and S2. The intersection is {6, 3}, which has two elements, so the answer is 2.
You can implement this solution in a single line of code, provided that you have sequence of integers sequence, and integer difference:
int result = sequence.Intersect(from item in sequence select item + difference).Count();
The Intersect method will build an efficient hash table for you that is O(n) to determine the intersection.
Try this (note, untested):
Sort the array
Start two indexes at 0
If difference between the numbers at those two positions is equal to K, increase count, and increase one of the two indexes (if numbers aren't duplicated, increase both)
If difference is larger than K, increase index #1
If difference is less than K, increase index #2, if that would place it outside the array, you're done
Otherwise, go back to 3 and keep going
Basically, try to keep the two indexes apart by K value difference.
You should write up a series of unit-tests for your algorithm, and try to come up with edge cases.
This would allow you to do it in a single pass. Using hash sets is beneficial if there are many values to parse/check. You might also want to use a bloom filter in combination with hash sets to reduce lookups.
Initialize. Let A and B be two empty hash sets. Let c be zero.
Parse loop. Parse the next value v. If there are no more values the algorithm is done and the result is in c.
Back check. If v exists in A then increment c and jump back to 2.
Low match. If v - K > 0 then:
insert v - K into A
if v - K exists in B then increment c (and optionally remove v - K from B).
High match. If v + K < 1e9 then:
insert v + K into A
if v + K exists in B then increment c (and optionally remove v + K from B).
Remember. Insert v into B.
Jump back to 2.
// php solution for this k difference
function getEqualSumSubstring($l,$s) {
$s = str_replace(' ','',$s);
$l = str_replace(' ','',$l);
for($i=0;$i<strlen($s);$i++)
{
$array1[] = $s[$i];
}
for($i=0;$i<strlen($s);$i++)
{
$array2[] = $s[$i] + $l[1];
}
return count(array_intersect($array1,$array2));
}
echo getEqualSumSubstring("5 2","1 3 5 4 2");
Actually that's trivially to solve with a hashmap:
First put each number into a hashmap: dict((x, x) for x in numbers) in "pythony" pseudo code ;)
Now you just iterate through every number in the hashmap and check if number + K is in the hashmap. If yes, increase count by one.
The obvious improvement to the naive solution is to ONLY check for the higher (or lower) bound, otherwise you get the double results and have to divide by 2 afterwards - useless.
This is O(N) for creating the hashmap when reading the values in and O(N) when iterating through, i.e. O(N) and about 8loc in python (and it is correct, I just solved it ;-) )
Following Eric's answer, paste the implementation of Interscet method below, it is O(n):
private static IEnumerable<TSource> IntersectIterator<TSource>(IEnumerable<TSource> first, IEnumerable<TSource> second, IEqualityComparer<TSource> comparer)
{
Set<TSource> set = new Set<TSource>(comparer);
foreach (TSource current in second)
{
set.Add(current);
}
foreach (TSource current2 in first)
{
if (set.Remove(current2))
{
yield return current2;
}
}
yield break;
}
The following class parses through a very large string (an entire novel of text) and breaks it into consecutive 4-character strings that are stored as a Tuple. Then each tuple can be assigned a probability based on a calculation. I am using this as part of a monte carlo/ genetic algorithm to train the program to recognize a language based on syntax only (just the character transitions).
I am wondering if there is a faster way of doing this. It takes about 400ms to look up the probability of any given 4-character tuple. The relevant method _Probablity() is at the end of the class.
This is a computationally intensive problem related to another post of mine: Algorithm for computing the plausibility of a function / Monte Carlo Method
Ultimately I'd like to store these values in a 4d-matrix. But given that there are 26 letters in the alphabet that would be a HUGE task. (26x26x26x26). If I take only the first 15000 characters of the novel then performance improves a ton, but my data isn't as useful.
Here is the method that parses the text 'source':
private List<Tuple<char, char, char, char>> _Parse(string src)
{
var _map = new List<Tuple<char, char, char, char>>();
for (int i = 0; i < src.Length - 3; i++)
{
int j = i + 1;
int k = i + 2;
int l = i + 3;
_map.Add
(new Tuple<char, char, char, char>(src[i], src[j], src[k], src[l]));
}
return _map;
}
And here is the _Probability method:
private double _Probability(char x0, char x1, char x2, char x3)
{
var subset_x0 = map.Where(x => x.Item1 == x0);
var subset_x0_x1_following = subset_x0.Where(x => x.Item2 == x1);
var subset_x0_x2_following = subset_x0_x1_following.Where(x => x.Item3 == x2);
var subset_x0_x3_following = subset_x0_x2_following.Where(x => x.Item4 == x3);
int count_of_x0 = subset_x0.Count();
int count_of_x1_following = subset_x0_x1_following.Count();
int count_of_x2_following = subset_x0_x2_following.Count();
int count_of_x3_following = subset_x0_x3_following.Count();
decimal p1;
decimal p2;
decimal p3;
if (count_of_x0 <= 0 || count_of_x1_following <= 0 || count_of_x2_following <= 0 || count_of_x3_following <= 0)
{
p1 = e;
p2 = e;
p3 = e;
}
else
{
p1 = (decimal)count_of_x1_following / (decimal)count_of_x0;
p2 = (decimal)count_of_x2_following / (decimal)count_of_x1_following;
p3 = (decimal)count_of_x3_following / (decimal)count_of_x2_following;
p1 = (p1 * 100) + e;
p2 = (p2 * 100) + e;
p3 = (p3 * 100) + e;
}
//more calculations omitted
return _final;
}
}
EDIT - I'm providing more details to clear things up,
1) Strictly speaking I've only worked with English so far, but its true that different alphabets will have to be considered. Currently I only want the program to recognize English, similar to whats described in this paper: http://www-stat.stanford.edu/~cgates/PERSI/papers/MCMCRev.pdf
2) I am calculating the probabilities of n-tuples of characters where n <= 4. For instance if I am calculating the total probability of the string "that", I would break it down into these independent tuples and calculate the probability of each individually first:
[t][h]
[t][h][a]
[t][h][a][t]
[t][h] is given the most weight, then [t][h][a], then [t][h][a][t]. Since I am not just looking at the 4-character tuple as a single unit, I wouldn't be able to just divide the instances of [t][h][a][t] in the text by the total no. of 4-tuples in the next.
The value assigned to each 4-tuple can't overfit to the text, because by chance many real English words may never appear in the text and they shouldn't get disproportionally low scores. Emphasing first-order character transitions (2-tuples) ameliorates this issue. Moving to the 3-tuple then the 4-tuple just refines the calculation.
I came up with a Dictionary that simply tallies the count of how often the tuple occurs in the text (similar to what Vilx suggested), rather than repeating identical tuples which is a waste of memory. That got me from about ~400ms per lookup to about ~40ms per, which is a pretty great improvement. I still have to look into some of the other suggestions, however.
In yoiu probability method you are iterating the map 8 times. Each of your wheres iterates the entire list and so does the count. Adding a .ToList() ad the end would (potentially) speed things. That said I think your main problem is that the structure you've chossen to store the data in is not suited for the purpose of the probability method. You could create a one pass version where the structure you store you're data in calculates the tentative distribution on insert. That way when you're done with the insert (which shouldn't be slowed down too much) you're done or you could do as the code below have a cheap calculation of the probability when you need it.
As an aside you might want to take puntuation and whitespace into account. The first letter/word of a sentence and the first letter of a word gives clear indication on what language a given text is written in by taking punctuaion charaters and whitespace as part of you distribution you include those characteristics of the sample data. We did that some years back. Doing that we shown that using just three characters was almost as exact (we had no failures with three on our test data and almost as exact is an assumption given that there most be some weird text where the lack of information would yield an incorrect result). as using more (we test up till 7) but the speed of three letters made that the best case.
EDIT
Here's an example of how I think I would do it in C#
class TextParser{
private Node Parse(string src){
var top = new Node(null);
for (int i = 0; i < src.Length - 3; i++){
var first = src[i];
var second = src[i+1];
var third = src[i+2];
var fourth = src[i+3];
var firstLevelNode = top.AddChild(first);
var secondLevelNode = firstLevelNode.AddChild(second);
var thirdLevelNode = secondLevelNode.AddChild(third);
thirdLevelNode.AddChild(fourth);
}
return top;
}
}
public class Node{
private readonly Node _parent;
private readonly Dictionary<char,Node> _children
= new Dictionary<char, Node>();
private int _count;
public Node(Node parent){
_parent = parent;
}
public Node AddChild(char value){
if (!_children.ContainsKey(value))
{
_children.Add(value, new Node(this));
}
var levelNode = _children[value];
levelNode._count++;
return levelNode;
}
public decimal Probability(string substring){
var node = this;
foreach (var c in substring){
if(!node.Contains(c))
return 0m;
node = node[c];
}
return ((decimal) node._count)/node._parent._children.Count;
}
public Node this[char value]{
get { return _children[value]; }
}
private bool Contains(char c){
return _children.ContainsKey(c);
}
}
the usage would then be:
var top = Parse(src);
top.Probability("test");
I would suggest changing the data structure to make that faster...
I think a Dictionary<char,Dictionary<char,Dictionary<char,Dictionary<char,double>>>> would be much more efficient since you would be accessing each "level" (Item1...Item4) when calculating... and you would cache the result in the innermost Dictionary so next time you don't have to calculate at all..
Ok, I don't have time to work out details, but this really calls for
neural classifier nets (Just take any off the shelf, even the Controllable Regex Mutilator would do the job with way more scalability) -- heuristics over brute force
you could use tries (Patricia Tries a.k.a. Radix Trees to make a space optimized version of your datastructure that can be sparse (the Dictionary of Dictionaries of Dictionaries of Dictionaries... looks like an approximation of this to me)
There's not much you can do with the parse function as it stands. However, the tuples appear to be four consecutive characters from a large body of text. Why not just replace the tuple with an int and then use the int to index the large body of text when you need the character values. Your tuple based method is effectively consuming four times the memory the original text would use, and since memory is usually the bottleneck to performance, it's best to use as little as possible.
You then try to find the number of matches in the body of text against a set of characters. I wonder how a straightforward linear search over the original body of text would compare with the linq statements you're using? The .Where will be doing memory allocation (which is a slow operation) and the linq statement will have parsing overhead (but the compiler might do something clever here). Having a good understanding of the search space will make it easier to find an optimal algorithm.
But then, as has been mentioned in the comments, using a 264 matrix would be the most efficent. Parse the input text once and create the matrix as you parse. You'd probably want a set of dictionaries:
SortedDictionary <int,int> count_of_single_letters; // key = single character
SortedDictionary <int,int> count_of_double_letters; // key = char1 + char2 * 32
SortedDictionary <int,int> count_of_triple_letters; // key = char1 + char2 * 32 + char3 * 32 * 32
SortedDictionary <int,int> count_of_quad_letters; // key = char1 + char2 * 32 + char3 * 32 * 32 + char4 * 32 * 32 * 32
Finally, a note on data types. You're using the decimal type. This is not an efficient type as there is no direct mapping to CPU native type and there is overhead in processing the data. Use a double instead, I think the precision will be sufficient. The most precise way will be to store the probability as two integers, the numerator and denominator and then do the division as late as possible.
The best approach here is to using sparse storage and pruning after each each 10000 character for example. Best storage strucutre in this case is prefix tree, it will allow fast calculation of probability, updating and sparse storage. You can find out more theory in this javadoc http://alias-i.com/lingpipe/docs/api/com/aliasi/lm/NGramProcessLM.html
This is fairly 'math-y' but I'm posting here because it's a Project Euler problem, & I have working code that presumably has bugs in it.
The question Determing longest repeating cycle in a decimal expansion solves the problem using logarithms, but I'm interested in solving with simple brute force. More accurately, I'm interested in understanding why my algorithm and code is not returning the correct solution.
The algorithm is simple:
replicate a 'long division',
at each step record the divisor and the remainder
when a divisor / remainder tuple is repeated, infer that the decimal representation will repeat.
Here are private fields, as requested
private int numerator;
private int recurrence;
private int result;
private int resultRecurrence;
private List<dynamic> digits;
and here is the code:
private void Go()
{
foreach (var i in primes)
{
digits = new List<dynamic>();
numerator = 1;
recurrence = 0;
while (numerator != 0)
{
numerator *= 10;
// quotient
var q = numerator / i;
// remainder
var r = numerator % i;
digits.Add(new { Divisor = q, Remainder = r });
// if we've found a repetition then break out
var m = digits.Where(p => p.Divisor == q && p.Remainder == r).ToList();
if (m.Count > 1)
{
recurrence = digits.LastIndexOf(m[0]) - digits.IndexOf(m[0]);
break;
}
numerator = r;
}
if (recurrence > resultRecurrence)
{
resultRecurrence = recurrence;
result = i;
}
}}
When testing integers < 10 and < 20 I get the correct result; and I correctly identify the value of i as well. However the decimal represetation that I get is incorrect - I calculate i-1 whereas the correct result is far less (something like i-250).
So presumably I either have a programming bug - which I can't find - or a logic bug.
I'm confused because it feels like a multiplicative group over p to me, in which there would be p-1 elements. I'm sure I'm missing something, can anyone provide suggestions?
edit
I'm not going to include my prime number code - it's not relevant, as I explain above I correctly identify the value of i (from memory it is 983) but I'm having problems getting the correct value for resultRecurrence.
I'm confused because it feels like a multiplicative group over p to me, in which there would be p-1 elements. I'm sure I'm missing something, can anyone provide suggestions?
Close.
For all primes except 2 and 5 (which divide 10), the sequence of remainders is formed by starting with 1 and transforming by
remainder = (10 * remainder) % prime
thus the k-th remainder is 10k (mod prime) and the set of remainders forms a subgroup of the group of nonzero remainders modulo prime[1]. The length of the recurring cycle is the order of that subgroup, which is also known as the order of 10 modulo prime.
The order of the group of nonzero remainders modulo prime is prime-1, and there's a theorem by Fermat:
Let G be a finite group of order g and H be a subgroup of G. Then the order h of H divides g.
So the length of the cycle is always a divisor of prime-1, and sometimes it's prime-1, e.g. for 7 or 19.
[1] For composite numbers n coprime to 10, that would be the group of remainders modulo n that are coprime to n.
First off, you don’t need the divisors, you only need the remainders.
Secondly, I would split the function into multiple independent parts instead of having everything in one big method: The long division / finding of the cycle length is independent of the rest (= finding the longest cycle).
Your break on Where coupled with Count is unintuitive. Why not just use a while loop with the condition (! digits.Contains(r))? (This would require putting 0 as a remainder into the digits list before the loop start.)
This leaves us with a much cleaner code that should be straightforward to debug.
recurrence = digits.LastIndexOf(m[0]) - digits.IndexOf(m[0]);
Surely the value of resultRecurrence is always going to be i-1 ? Since for a fraction of the form 1/n, the decimal starts repeating exactly when the division-in-progress (the ith digit) gives the same quotient-remainder as the very first trial division (1, hence i-1).
(as a side note, may I introduce you to Math.DivRem).