How to measure similarity of 2 strings aside from computing distance - c#

I am creating a program that checks if the word is simplified word(txt, msg, etc.) and if it is simplified it finds the correct spelling like txt=text, msg=message. Iam using the NHunspell Suggest Method in c# which suggest all possible results.
The problem is if I inputted "txt" the result is text,tat, tot, etc. I dont know how to select the correct word. I used Levenshtein Distance (C# - Compare String Similarity) but the results still results to 1.
Input: txt
Result: text = 1, ext = 1 tit = 1
Can you help me how to get the meaning or the correct spelling of the simplified words?
Example: msg

I have tested your input with your sample data and only text has a distance of 25 whereas the other have a distance of 33. Here's my code:
string input = "TXT";
string[] words = new[]{"text","tat","tot"};
var levenshtein = new Levenshtein();
const int maxDistance = 30;
var distanceGroups = words
.Select(w => new
{
Word = w,
Distance = levenshtein.iLD(w.ToUpperInvariant(), input)
})
.Where(x => x.Distance <= maxDistance)
.GroupBy(x => x.Distance)
.OrderBy(g => g.Key)
.ToList();
foreach (var topCandidate in distanceGroups.First())
Console.WriteLine("Word:{0} Distance:{1}", topCandidate.Word, topCandidate.Distance);
and here is the levenshtein class:
public class Levenshtein
{
///*****************************
/// Compute Levenshtein distance
/// Memory efficient version
///*****************************
public int iLD(String sRow, String sCol)
{
int RowLen = sRow.Length; // length of sRow
int ColLen = sCol.Length; // length of sCol
int RowIdx; // iterates through sRow
int ColIdx; // iterates through sCol
char Row_i; // ith character of sRow
char Col_j; // jth character of sCol
int cost; // cost
/// Test string length
if (Math.Max(sRow.Length, sCol.Length) > Math.Pow(2, 31))
throw (new Exception("\nMaximum string length in Levenshtein.iLD is " + Math.Pow(2, 31) + ".\nYours is " + Math.Max(sRow.Length, sCol.Length) + "."));
// Step 1
if (RowLen == 0)
{
return ColLen;
}
if (ColLen == 0)
{
return RowLen;
}
/// Create the two vectors
int[] v0 = new int[RowLen + 1];
int[] v1 = new int[RowLen + 1];
int[] vTmp;
/// Step 2
/// Initialize the first vector
for (RowIdx = 1; RowIdx <= RowLen; RowIdx++)
{
v0[RowIdx] = RowIdx;
}
// Step 3
/// Fore each column
for (ColIdx = 1; ColIdx <= ColLen; ColIdx++)
{
/// Set the 0'th element to the column number
v1[0] = ColIdx;
Col_j = sCol[ColIdx - 1];
// Step 4
/// Fore each row
for (RowIdx = 1; RowIdx <= RowLen; RowIdx++)
{
Row_i = sRow[RowIdx - 1];
// Step 5
if (Row_i == Col_j)
{
cost = 0;
}
else
{
cost = 1;
}
// Step 6
/// Find minimum
int m_min = v0[RowIdx] + 1;
int b = v1[RowIdx - 1] + 1;
int c = v0[RowIdx - 1] + cost;
if (b < m_min)
{
m_min = b;
}
if (c < m_min)
{
m_min = c;
}
v1[RowIdx] = m_min;
}
/// Swap the vectors
vTmp = v0;
v0 = v1;
v1 = vTmp;
}
// Step 7
/// Value between 0 - 100
/// 0==perfect match 100==totaly different
///
/// The vectors where swaped one last time at the end of the last loop,
/// that is why the result is now in v0 rather than in v1
//System.Console.WriteLine("iDist=" + v0[RowLen]);
int max = System.Math.Max(RowLen, ColLen);
return ((100 * v0[RowLen]) / max);
}
///*****************************
/// Compute the min
///*****************************
private int Minimum(int a, int b, int c)
{
int mi = a;
if (b < mi)
{
mi = b;
}
if (c < mi)
{
mi = c;
}
return mi;
}
///*****************************
/// Compute Levenshtein distance
///*****************************
public int LD(String sNew, String sOld)
{
int[,] matrix; // matrix
int sNewLen = sNew.Length; // length of sNew
int sOldLen = sOld.Length; // length of sOld
int sNewIdx; // iterates through sNew
int sOldIdx; // iterates through sOld
char sNew_i; // ith character of sNew
char sOld_j; // jth character of sOld
int cost; // cost
/// Test string length
if (Math.Max(sNew.Length, sOld.Length) > Math.Pow(2, 31))
throw (new Exception("\nMaximum string length in Levenshtein.LD is " + Math.Pow(2, 31) + ".\nYours is " + Math.Max(sNew.Length, sOld.Length) + "."));
// Step 1
if (sNewLen == 0)
{
return sOldLen;
}
if (sOldLen == 0)
{
return sNewLen;
}
matrix = new int[sNewLen + 1, sOldLen + 1];
// Step 2
for (sNewIdx = 0; sNewIdx <= sNewLen; sNewIdx++)
{
matrix[sNewIdx, 0] = sNewIdx;
}
for (sOldIdx = 0; sOldIdx <= sOldLen; sOldIdx++)
{
matrix[0, sOldIdx] = sOldIdx;
}
// Step 3
for (sNewIdx = 1; sNewIdx <= sNewLen; sNewIdx++)
{
sNew_i = sNew[sNewIdx - 1];
// Step 4
for (sOldIdx = 1; sOldIdx <= sOldLen; sOldIdx++)
{
sOld_j = sOld[sOldIdx - 1];
// Step 5
if (sNew_i == sOld_j)
{
cost = 0;
}
else
{
cost = 1;
}
// Step 6
matrix[sNewIdx, sOldIdx] = Minimum(matrix[sNewIdx - 1, sOldIdx] + 1, matrix[sNewIdx, sOldIdx - 1] + 1, matrix[sNewIdx - 1, sOldIdx - 1] + cost);
}
}
// Step 7
/// Value between 0 - 100
/// 0==perfect match 100==totaly different
//System.Console.WriteLine("Dist=" + matrix[sNewLen, sOldLen]);
int max = System.Math.Max(sNewLen, sOldLen);
return (100 * matrix[sNewLen, sOldLen]) / max;
}
}

Not a complete solution, just a hopefully helpful suggestion...
It seems to me that people are unlikely to use simplifications that are as long as the correct word, so you could at least filter out all results whose length <= the input's length.

You really need to implement the SOUNDEX routine that exists in SQL. I've done that in the following code:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Soundex
{
class Program
{
static char[] ignoreChars = new char[] { 'a', 'e', 'h', 'i', 'o', 'u', 'w', 'y' };
static Dictionary<char, int> charVals = new Dictionary<char, int>()
{
{'b',1},
{'f',1},
{'p',1},
{'v',1},
{'c',2},
{'g',2},
{'j',2},
{'k',2},
{'q',2},
{'s',2},
{'x',2},
{'z',2},
{'d',3},
{'t',3},
{'l',4},
{'m',5},
{'n',5},
{'r',6}
};
static void Main(string[] args)
{
Console.WriteLine(Soundex("txt"));
Console.WriteLine(Soundex("text"));
Console.WriteLine(Soundex("ext"));
Console.WriteLine(Soundex("tit"));
Console.WriteLine(Soundex("Cammmppppbbbeeelll"));
}
static string Soundex(string s)
{
s = s.ToLower();
StringBuilder sb = new StringBuilder();
sb.Append(s.First());
foreach (var c in s.Substring(1))
{
if (ignoreChars.Contains(c)) { continue; }
// if the previous character yields the same integer then skip it
if ((int)char.GetNumericValue(sb[sb.Length - 1]) == charVals[c]) { continue; }
sb.Append(charVals[c]);
}
return string.Join("", sb.ToString().Take(4)).PadRight(4, '0');
}
}
}
See, with this code, the only match out of the examples you gave would be text. Run the console application and you'll see the output (i.e. txt would match text).

One method I think programs like word uses to correct spellings, is to use NLP (Natural Language Processing) techniques to get the order of Nouns/Adjectives used in the context of the spelling mistakes.. then comparing that to known sentence structures they can estimate 70% chance the spelling mistake was a noun and use that information to filter the corrected spellings.
SharpNLP looks like a good library but I haven't had a chance to fiddle with it yet. To build a library of known sentence structures BTW, in uni we applied our algorithms to public domain books.
check out sams simMetrics library I found on SO (download here, docs here) for loads more options for algorithms to use besides Levenshtein distance.

Expanding on my comment, you could use regex to search for a result that is an 'expansion' of the input. Something like this:
private int stringSimilarity(string input, string result)
{
string regexPattern = ""
foreach (char c in input)
regexPattern += input + ".*"
Match match = Regex.Match(result, regexPattern,
RegexOptions.IgnoreCase);
if (match.Success)
return 1;
else
return 0;
}
Ignore the 1 and the 0 - I don't know how similarity valuing works.

Related

How to build the heliocentric algorithm?

Given this problem:
Consider two of the planets in the orbital system: Earth and Mars.
Assume the Earth orbits the Sun in exactly 365 Earth days, and Mars
orbits the Sun in exactly 687 Earth days. Thus the Earth’s orbit
starts at day 0 and continues to day 364, and then starts over at day
0. Mars orbits similarly, but on a 687-day time scale.
We would like to find out how long it will take until both planets are
on day. 0 of their orbits simultaneously. Write a program that can
determine this.
Input Format:
The first line of input contains an integer N indicating the number of
test cases. N lines follow. Each test case contains two integers E and
M. These indicate which days Earth and Mars are at their respective
orbits.
Output Format:
For each case, display the case number followed by the smallest number
of days until the two planets will both be on day 0 of their orbits.
Follow the format of the sample output.
Sample Input 1
0 0
364 686
360 682
0 1
1 0
Sample Output 1
Case 1: 0
Case 2: 1
Case 3: 5
Case 4: 239075
Case 5: 11679
I tried solving the problem using modules but it doesn't seem correct
static string readInput;
static string firstStr = "";
static string secondStr = "";
static int firstInput;
static int secondInput;
static int testCases = 10;
static int caseNumber = 1;
static int outPut;
caseNumber <= testCases
static void Main(string[] args) {
//recall runProcess as long caseNumber is less or equal testCases
while (caseNumber <= testCases) {
runProcess();
Console.WriteLine("Case " + caseNumber + ": " + outPut);
caseNumber++;
}
}
Read input from console:
/// <summary>
/// This is the main process, is extracted to void so we can recall it.
/// </summary>
public static void runProcess() {
readInput = Console.ReadLine();
if (readInput != null) {
for (int i = 0; i < readInput.Length; i++) {
secondStr = secondStr + readInput[i];
if (readInput[i] == ' ') {
firstStr = secondStr;
secondStr = "";
continue;
}
}
}
firstInput = Convert.ToInt32(firstStr);
secondInput = Convert.ToInt32(secondStr);
outPut = atZero(firstInput, secondInput);
}
/// <summary>
/// This method takes the input data from the console to later determine the zero point
/// </summary>
/// <param name="earthDays"></param>
/// <param name="marsDays"></param>
/// <returns></returns>
public static int atZero(int earthDays, int marsDays) {
int earthOrbit = 365;
int marsOrbit = 687;
int modEarth = earthOrbit;
int modMars = marsOrbit;
int earthDistinction = earthOrbit - earthDays;
int marsDistinction = marsOrbit - marsDays;
if ((modInverse(earthDistinction, marsDistinction, modMars)) == 0) {
return (modInverse(marsDistinction, earthDistinction, modEarth)) * marsDistinction;
} else {
return (modInverse(earthDistinction, marsDistinction, modMars)) * earthDistinction;
}
}
mod invert
/// <summary>
/// The method below takes a denominator, numerator and a mod to later invert the mod.
/// </summary>
/// <param name="denominator"></param>
/// <param name="numerator"></param>
/// <param name="mod"></param>
/// <returns>modInverse</returns>
static int modInverse(int denominator, int numerator, int mod) {
int i = mod, outputAll = 0, d = numerator;
while (denominator > 0) {
int divided = i / denominator, x = denominator;
denominator = i % x;
i = x;
x = d;
d = outputAll - divided * x;
outputAll = x;
}
outputAll %= mod;
if (outputAll < 0) outputAll = (outputAll + mod) % mod;
return outputAll;
}
Is there any way to solve the problem without modules?
Thanks.
A straight forward way to calculate a solution could be this method:
private static int DaysTillBothAt0(int currentEarthDay, int currentMarsDay)
{
int result = 0, earth = currentEarthDay, mars = currentMarsDay;
while (earth != 0 || mars != 0)
{
result += 1;
earth = (earth + 1) % 365;
mars = (mars + 1) % 687;
}
return result;
}
This is of course not the fastest alogrithm or mathematically extraordinary elegant, but for the required data range performance doesn't matter here at all. (I don't know what your teacher expects, though).
It simply counts the orbits forward until they meet at 0.
You can use this for your test cases like this:
result = DaysTillBothAt0(0, 0); // 0
result = DaysTillBothAt0(364, 686); // 1
result = DaysTillBothAt0(360, 682); // 5
result = DaysTillBothAt0(0, 1); // 239075
result = DaysTillBothAt0(1, 0); // 11679
One more solution for this problem
private static int DaysTillBothAt0(int currentEarthday, int currentMarsday) {
int count = 365 - currentEarthday;
currentMarsday = (currentMarsday + count) % 687;
while (currentMarsday != 0) {
currentMarsday = (currentMarsday + 365) % 687;
count += 365;
}
return currentMarsday;
}

Match a sequence of bits in a number and then convert the match into zeroes?

My assignment is to search through the binary representation of a number and replace a matched pattern of another binary representation of a number. If I get a match, I convert the matching bits from the first integer into zeroes and move on.
For example the number 469 would be 111010101 and I have to match it with 5 (101). Here's the program I've written so far. Doesn't work as expected.
using System;
namespace Conductors
{
class Program
{
static void Main(string[] args)
{
//this is the number I'm searching for a match in
int binaryTicket = 469;
//This is the pattern I'm trying to match (101)
int binaryPerforator = 5;
string binaryTicket01 = Convert.ToString(binaryTicket, 2);
bool match = true;
//in a 32 bit integer, position 29 is the last one I would
//search in, since I'm searching for the next 3
for (int pos = 0; pos < 29; pos++)
{
for (int j = 0; j <= 3; j++)
{
var posInBinaryTicket = pos + j;
var posInPerforator = j;
int bitInBinaryTicket = (binaryTicket & (1 << posInBinaryTicket)) >> posInBinaryTicket;
int bitInPerforator = (binaryPerforator & (1 << posInPerforator)) >> posInPerforator;
if (bitInBinaryTicket != bitInPerforator)
{
match = false;
break;
}
else
{
//what would be the proper bitwise operator here?
bitInBinaryTicket = 0;
}
}
Console.WriteLine(binaryTicket01);
}
}
}
}
Few things:
Use uint for this. Makes things a hell of a lot easier when dealing with binary numbers.
You aren't really setting anything - you're simply storing information, which is why you're printing out the same number so often.
You should loop the x times where x = length of the binary string (not just 29). There's no need for inner loops
static void Main(string[] args)
{
//this is the number I'm searching for a match in
uint binaryTicket = 469;
//This is the pattern I'm trying to match (101)
uint binaryPerforator = 5;
var numBinaryDigits = Math.Ceiling(Math.Log(binaryTicket, 2));
for (var i = 0; i < numBinaryDigits; i++)
{
var perforatorShifted = binaryPerforator << i;
//We need to mask off the result (otherwise we fail for checking 101 -> 111)
//The mask will put 1s in each place the perforator is checking.
var perforDigits = (int)Math.Ceiling(Math.Log(perforatorShifted, 2));
uint mask = (uint)Math.Pow(2, perforDigits) - 1;
Console.WriteLine("Ticket:\t" + GetBinary(binaryTicket));
Console.WriteLine("Perfor:\t" + GetBinary(perforatorShifted));
Console.WriteLine("Mask :\t" + GetBinary(mask));
if ((binaryTicket & mask) == perforatorShifted)
{
Console.WriteLine("Match.");
//Imagine we have the case:
//Ticket:
//111010101
//Perforator:
//000000101
//Is a match. What binary operation can we do to 0-out the final 101?
//We need to AND it with
//111111010
//To get that value, we need to invert the perforatorShifted
//000000101
//XOR
//111111111
//EQUALS
//111111010
//Which would yield:
//111010101
//AND
//111110000
//Equals
//111010000
var flipped = perforatorShifted ^ ((uint)0xFFFFFFFF);
binaryTicket = binaryTicket & flipped;
}
}
string binaryTicket01 = Convert.ToString(binaryTicket, 2);
Console.WriteLine(binaryTicket01);
}
static string GetBinary(uint v)
{
return Convert.ToString(v, 2).PadLeft(32, '0');
}
Please read over the above code - if there's anything you don't understand, leave me a comment and I can run through it with you.

English Dictionary word matching from a string

I'm trying to get my head around a problem of identifying the best match of English words from a dictionary file to a given string.
For example ("lines" being a List of dictionary words):
string testStr = "cakeday";
for (int x= 0; x<= testStr.Length; x++)
{
string test = testStr.Substring(x);
if (test.Length > 0)
{
string test2 = testStr.Remove(counter);
int count = (from w in lines where w.Equals(test) || w.Equals(test2) select w).Count();
Console.WriteLine("Test: {0} / {1} : {2}", test, test2, count);
}
}
Gives the output:
Test: cakeday / : 0
Test: akeday / c : 1
Test: keday / ca : 0
Test: eday / cak : 0
Test: day / cake : 2
Test: ay / caked : 1
Test: y / cakeda : 1
Obviously "day / cake" is the best fit for the string however if I were to introduce a 3rd word into the string e.g "cakedaynow" it doesnt work so well.
I know the example is primitive, its more a proof of concept and was wondering if anyone had any experience with this type of string analysis?
Thanks!
You'll want to research the class of algorithms appropriate to what you're trying to do. Start with Approximate string matching on Wikipedia.
Also, here's a Levenshtein Edit Distance implementation in C# to get you started:
using System;
namespace StringMatching
{
/// <summary>
/// A class to extend the string type with a method to get Levenshtein Edit Distance.
/// </summary>
public static class LevenshteinDistanceStringExtension
{
/// <summary>
/// Get the Levenshtein Edit Distance.
/// </summary>
/// <param name="strA">The current string.</param>
/// <param name="strB">The string to determine the distance from.</param>
/// <returns>The Levenshtein Edit Distance.</returns>
public static int GetLevenshteinDistance(this string strA, string strB)
{
if (string.IsNullOrEmpty(strA) && string.IsNullOrEmpty(strB))
return 0;
if (string.IsNullOrEmpty(strA))
return strB.Length;
if (string.IsNullOrEmpty(strB))
return strA.Length;
int[,] deltas; // matrix
int lengthA;
int lengthB;
int indexA;
int indexB;
char charA;
char charB;
int cost; // cost
// Step 1
lengthA = strA.Length;
lengthB = strB.Length;
deltas = new int[lengthA + 1, lengthB + 1];
// Step 2
for (indexA = 0; indexA <= lengthA; indexA++)
{
deltas[indexA, 0] = indexA;
}
for (indexB = 0; indexB <= lengthB; indexB++)
{
deltas[0, indexB] = indexB;
}
// Step 3
for (indexA = 1; indexA <= lengthA; indexA++)
{
charA = strA[indexA - 1];
// Step 4
for (indexB = 1; indexB <= lengthB; indexB++)
{
charB = strB[indexB - 1];
// Step 5
if (charA == charB)
{
cost = 0;
}
else
{
cost = 1;
}
// Step 6
deltas[indexA, indexB] = Math.Min(deltas[indexA - 1, indexB] + 1, Math.Min(deltas[indexA, indexB - 1] + 1, deltas[indexA - 1, indexB - 1] + cost));
}
}
// Step 7
return deltas[lengthA, lengthB];
}
}
}
Why not:
Check all the strings inside the search word extracting from current search position to all possible lengths of the string and extract all discovered words. E.g.:
var list = new List<string>{"the", "me", "cat", "at", "theme"};
const string testStr = "themecat";
var words = new List<string>();
var len = testStr.Length;
for (int x = 0; x < len; x++)
{
for(int i = (len - 1); i > x; i--)
{
string test = testStr.Substring(x, i - x + 1);
if (list.Contains(test) && !words.Contains(test))
{
words.Add(test);
}
}
}
words.ForEach(n=> Console.WriteLine("{0}, ",n));//spit out current values
Output:
theme, the, me, cat, at
Edit
Live Scenario 1:
For instance let's say you want to always choose the longest word in a jumbled sentence, you could read from front forward, reducing the amount of text read till you are through. Using a dictionary makes it much easier, by storing the indexes of the discovered words, we can quickly check to see if we have stored a word containing another word we are evaluating before.
Example:
var list = new List<string>{"the", "me", "cat", "at", "theme", "crying", "them"};
const string testStr = "themecatcryingthem";
var words = new Dictionary<int, string>();
var len = testStr.Length;
for (int x = 0; x < len; x++)
{
int n = len > 28 ? 28 : len;//assuming 28 is the maximum length of an english word
for(int i = (n - 1); i > x; i--)
{
string test = testStr.Substring(x, i - x + 1);
if (list.Contains(test))
{
if (!words.ContainsValue(test))
{
bool found = false;//to check if there's a shorter item starting from same index
var key = testStr.IndexOf(test, x, len - x);
foreach (var w in words)
{
if (w.Value.Contains(test) && w.Key != key && key == (w.Key + w.Value.Length - test.Length))
{
found = true;
}
}
if (!found && !words.ContainsKey(key)) words.Add(key, test);
}
}
}
}
words.Values.ToList().ForEach(n=> Console.WriteLine("{0}, ",n));//spit out current values
Output:
theme, cat, crying, them

How to calculate distance similarity measure of given 2 strings?

I need to calculate the similarity between 2 strings. So what exactly do I mean? Let me explain with an example:
The real word: hospital
Mistaken word: haspita
Now my aim is to determine how many characters I need to modify the mistaken word to obtain the real word. In this example, I need to modify 2 letters. So what would be the percent? I take the length of the real word always. So it becomes 2 / 8 = 25% so these 2 given string DSM is 75%.
How can I achieve this with performance being a key consideration?
I just addressed this exact same issue a few weeks ago. Since someone is asking now, I'll share the code. In my exhaustive tests my code is about 10x faster than the C# example on Wikipedia even when no maximum distance is supplied. When a maximum distance is supplied, this performance gain increases to 30x - 100x +. Note a couple key points for performance:
If you need to compare the same words over and over, first convert the words to arrays of integers. The Damerau-Levenshtein algorithm includes many >, <, == comparisons, and ints compare much faster than chars.
It includes a short-circuiting mechanism to quit if the distance exceeds a provided maximum
Use a rotating set of three arrays rather than a massive matrix as in all the implementations I've see elsewhere
Make sure your arrays slice accross the shorter word width.
Code (it works the exact same if you replace int[] with String in the parameter declarations:
/// <summary>
/// Computes the Damerau-Levenshtein Distance between two strings, represented as arrays of
/// integers, where each integer represents the code point of a character in the source string.
/// Includes an optional threshhold which can be used to indicate the maximum allowable distance.
/// </summary>
/// <param name="source">An array of the code points of the first string</param>
/// <param name="target">An array of the code points of the second string</param>
/// <param name="threshold">Maximum allowable distance</param>
/// <returns>Int.MaxValue if threshhold exceeded; otherwise the Damerau-Leveshteim distance between the strings</returns>
public static int DamerauLevenshteinDistance(int[] source, int[] target, int threshold) {
int length1 = source.Length;
int length2 = target.Length;
// Return trivial case - difference in string lengths exceeds threshhold
if (Math.Abs(length1 - length2) > threshold) { return int.MaxValue; }
// Ensure arrays [i] / length1 use shorter length
if (length1 > length2) {
Swap(ref target, ref source);
Swap(ref length1, ref length2);
}
int maxi = length1;
int maxj = length2;
int[] dCurrent = new int[maxi + 1];
int[] dMinus1 = new int[maxi + 1];
int[] dMinus2 = new int[maxi + 1];
int[] dSwap;
for (int i = 0; i <= maxi; i++) { dCurrent[i] = i; }
int jm1 = 0, im1 = 0, im2 = -1;
for (int j = 1; j <= maxj; j++) {
// Rotate
dSwap = dMinus2;
dMinus2 = dMinus1;
dMinus1 = dCurrent;
dCurrent = dSwap;
// Initialize
int minDistance = int.MaxValue;
dCurrent[0] = j;
im1 = 0;
im2 = -1;
for (int i = 1; i <= maxi; i++) {
int cost = source[im1] == target[jm1] ? 0 : 1;
int del = dCurrent[im1] + 1;
int ins = dMinus1[i] + 1;
int sub = dMinus1[im1] + cost;
//Fastest execution for min value of 3 integers
int min = (del > ins) ? (ins > sub ? sub : ins) : (del > sub ? sub : del);
if (i > 1 && j > 1 && source[im2] == target[jm1] && source[im1] == target[j - 2])
min = Math.Min(min, dMinus2[im2] + cost);
dCurrent[i] = min;
if (min < minDistance) { minDistance = min; }
im1++;
im2++;
}
jm1++;
if (minDistance > threshold) { return int.MaxValue; }
}
int result = dCurrent[maxi];
return (result > threshold) ? int.MaxValue : result;
}
Where Swap is:
static void Swap<T>(ref T arg1,ref T arg2) {
T temp = arg1;
arg1 = arg2;
arg2 = temp;
}
What you are looking for is called edit distance or Levenshtein distance. The wikipedia article explains how it is calculated, and has a nice piece of pseudocode at the bottom to help you code this algorithm in C# very easily.
Here's an implementation from the first site linked below:
private static int CalcLevenshteinDistance(string a, string b)
{
if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
return 0;
}
if (String.IsNullOrEmpty(a)) {
return b.Length;
}
if (String.IsNullOrEmpty(b)) {
return a.Length;
}
int lengthA = a.Length;
int lengthB = b.Length;
var distances = new int[lengthA + 1, lengthB + 1];
for (int i = 0; i <= lengthA; distances[i, 0] = i++);
for (int j = 0; j <= lengthB; distances[0, j] = j++);
for (int i = 1; i <= lengthA; i++)
for (int j = 1; j <= lengthB; j++)
{
int cost = b[j - 1] == a[i - 1] ? 0 : 1;
distances[i, j] = Math.Min
(
Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
distances[i - 1, j - 1] + cost
);
}
return distances[lengthA, lengthB];
}
There is a big number of string similarity distance algorithms that can be used. Some listed here (but not exhaustively listed are):
Levenstein
Needleman Wunch
Smith Waterman
Smith Waterman Gotoh
Jaro, Jaro Winkler
Jaccard Similarity
Euclidean Distance
Dice Similarity
Cosine Similarity
Monge Elkan
A library that contains implementation to all of these is called SimMetrics
which has both java and c# implementations.
I have found that Levenshtein and Jaro Winkler are great for small differences betwen strings such as:
Spelling mistakes; or
ö instead of o in a persons name.
However when comparing something like article titles where significant chunks of the text would be the same but with "noise" around the edges, Smith-Waterman-Gotoh has been fantastic:
compare these 2 titles (that are the same but worded differently from different sources):
An endonuclease from Escherichia coli that introduces single polynucleotide chain scissions in ultraviolet-irradiated DNA
Endonuclease III: An Endonuclease from Escherichia coli That Introduces Single Polynucleotide Chain Scissions in Ultraviolet-Irradiated DNA
This site that provides algorithm comparison of the strings shows:
Levenshtein: 81
Smith-Waterman Gotoh 94
Jaro Winkler 78
Jaro Winkler and Levenshtein are not as competent as Smith Waterman Gotoh in detecting the similarity. If we compare two titles that are not the same article, but have some matching text:
Fat metabolism in higher plants. The function of acyl thioesterases in the metabolism of acyl-coenzymes A and acyl-acyl carrier proteins
Fat metabolism in higher plants. The determination of acyl-acyl carrier protein and acyl coenzyme A in a complex lipid mixture
Jaro Winkler gives a false positive, but Smith Waterman Gotoh does not:
Levenshtein: 54
Smith-Waterman Gotoh 49
Jaro Winkler 89
As Anastasiosyal pointed out, SimMetrics has the java code for these algorithms. I had success using the SmithWatermanGotoh java code from SimMetrics.
Here is my implementation of Damerau Levenshtein Distance, which returns not only similarity coefficient, but also returns error locations in corrected word (this feature can be used in text editors). Also my implementation supports different weights of errors (substitution, deletion, insertion, transposition).
public static List<Mistake> OptimalStringAlignmentDistance(
string word, string correctedWord,
bool transposition = true,
int substitutionCost = 1,
int insertionCost = 1,
int deletionCost = 1,
int transpositionCost = 1)
{
int w_length = word.Length;
int cw_length = correctedWord.Length;
var d = new KeyValuePair<int, CharMistakeType>[w_length + 1, cw_length + 1];
var result = new List<Mistake>(Math.Max(w_length, cw_length));
if (w_length == 0)
{
for (int i = 0; i < cw_length; i++)
result.Add(new Mistake(i, CharMistakeType.Insertion));
return result;
}
for (int i = 0; i <= w_length; i++)
d[i, 0] = new KeyValuePair<int, CharMistakeType>(i, CharMistakeType.None);
for (int j = 0; j <= cw_length; j++)
d[0, j] = new KeyValuePair<int, CharMistakeType>(j, CharMistakeType.None);
for (int i = 1; i <= w_length; i++)
{
for (int j = 1; j <= cw_length; j++)
{
bool equal = correctedWord[j - 1] == word[i - 1];
int delCost = d[i - 1, j].Key + deletionCost;
int insCost = d[i, j - 1].Key + insertionCost;
int subCost = d[i - 1, j - 1].Key;
if (!equal)
subCost += substitutionCost;
int transCost = int.MaxValue;
if (transposition && i > 1 && j > 1 && word[i - 1] == correctedWord[j - 2] && word[i - 2] == correctedWord[j - 1])
{
transCost = d[i - 2, j - 2].Key;
if (!equal)
transCost += transpositionCost;
}
int min = delCost;
CharMistakeType mistakeType = CharMistakeType.Deletion;
if (insCost < min)
{
min = insCost;
mistakeType = CharMistakeType.Insertion;
}
if (subCost < min)
{
min = subCost;
mistakeType = equal ? CharMistakeType.None : CharMistakeType.Substitution;
}
if (transCost < min)
{
min = transCost;
mistakeType = CharMistakeType.Transposition;
}
d[i, j] = new KeyValuePair<int, CharMistakeType>(min, mistakeType);
}
}
int w_ind = w_length;
int cw_ind = cw_length;
while (w_ind >= 0 && cw_ind >= 0)
{
switch (d[w_ind, cw_ind].Value)
{
case CharMistakeType.None:
w_ind--;
cw_ind--;
break;
case CharMistakeType.Substitution:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Substitution));
w_ind--;
cw_ind--;
break;
case CharMistakeType.Deletion:
result.Add(new Mistake(cw_ind, CharMistakeType.Deletion));
w_ind--;
break;
case CharMistakeType.Insertion:
result.Add(new Mistake(cw_ind - 1, CharMistakeType.Insertion));
cw_ind--;
break;
case CharMistakeType.Transposition:
result.Add(new Mistake(cw_ind - 2, CharMistakeType.Transposition));
w_ind -= 2;
cw_ind -= 2;
break;
}
}
if (d[w_length, cw_length].Key > result.Count)
{
int delMistakesCount = d[w_length, cw_length].Key - result.Count;
for (int i = 0; i < delMistakesCount; i++)
result.Add(new Mistake(0, CharMistakeType.Deletion));
}
result.Reverse();
return result;
}
public struct Mistake
{
public int Position;
public CharMistakeType Type;
public Mistake(int position, CharMistakeType type)
{
Position = position;
Type = type;
}
public override string ToString()
{
return Position + ", " + Type;
}
}
public enum CharMistakeType
{
None,
Substitution,
Insertion,
Deletion,
Transposition
}
This code is a part of my project: Yandex-Linguistics.NET.
I wrote some tests and it's seems to me that method is working.
But comments and remarks are welcome.
Here is an alternative approach:
A typical method for finding similarity is Levenshtein distance, and there is no doubt a library with code available.
Unfortunately, this requires comparing to every string. You might be able to write a specialized version of the code to short-circuit the calculation if the distance is greater than some threshold, you would still have to do all the comparisons.
Another idea is to use some variant of trigrams or n-grams. These are sequences of n characters (or n words or n genomic sequences or n whatever). Keep a mapping of trigrams to strings and choose the ones that have the biggest overlap. A typical choice of n is "3", hence the name.
For instance, English would have these trigrams:
Eng
ngl
gli
lis
ish
And England would have:
Eng
ngl
gla
lan
and
Well, 2 out of 7 (or 4 out of 10) match. If this works for you, and you can index the trigram/string table and get a faster search.
You can also combine this with Levenshtein to reduce the set of comparison to those that have some minimum number of n-grams in common.
Here's a VB.net implementation:
Public Shared Function LevenshteinDistance(ByVal v1 As String, ByVal v2 As String) As Integer
Dim cost(v1.Length, v2.Length) As Integer
If v1.Length = 0 Then
Return v2.Length 'if string 1 is empty, the number of edits will be the insertion of all characters in string 2
ElseIf v2.Length = 0 Then
Return v1.Length 'if string 2 is empty, the number of edits will be the insertion of all characters in string 1
Else
'setup the base costs for inserting the correct characters
For v1Count As Integer = 0 To v1.Length
cost(v1Count, 0) = v1Count
Next v1Count
For v2Count As Integer = 0 To v2.Length
cost(0, v2Count) = v2Count
Next v2Count
'now work out the cheapest route to having the correct characters
For v1Count As Integer = 1 To v1.Length
For v2Count As Integer = 1 To v2.Length
'the first min term is the cost of editing the character in place (which will be the cost-to-date or the cost-to-date + 1 (depending on whether a change is required)
'the second min term is the cost of inserting the correct character into string 1 (cost-to-date + 1),
'the third min term is the cost of inserting the correct character into string 2 (cost-to-date + 1) and
cost(v1Count, v2Count) = Math.Min(
cost(v1Count - 1, v2Count - 1) + If(v1.Chars(v1Count - 1) = v2.Chars(v2Count - 1), 0, 1),
Math.Min(
cost(v1Count - 1, v2Count) + 1,
cost(v1Count, v2Count - 1) + 1
)
)
Next v2Count
Next v1Count
'the final result is the cheapest cost to get the two strings to match, which is the bottom right cell in the matrix
'in the event of strings being equal, this will be the result of zipping diagonally down the matrix (which will be square as the strings are the same length)
Return cost(v1.Length, v2.Length)
End If
End Function

How to convert a column number (e.g. 127) into an Excel column (e.g. AA)

How do you convert a numerical number to an Excel column name in C# without using automation getting the value directly from Excel.
Excel 2007 has a possible range of 1 to 16384, which is the number of columns that it supports. The resulting values should be in the form of excel column names, e.g. A, AA, AAA etc.
Here's how I do it:
private string GetExcelColumnName(int columnNumber)
{
string columnName = "";
while (columnNumber > 0)
{
int modulo = (columnNumber - 1) % 26;
columnName = Convert.ToChar('A' + modulo) + columnName;
columnNumber = (columnNumber - modulo) / 26;
}
return columnName;
}
If anyone needs to do this in Excel without VBA, here is a way:
=SUBSTITUTE(ADDRESS(1;colNum;4);"1";"")
where colNum is the column number
And in VBA:
Function GetColumnName(colNum As Integer) As String
Dim d As Integer
Dim m As Integer
Dim name As String
d = colNum
name = ""
Do While (d > 0)
m = (d - 1) Mod 26
name = Chr(65 + m) + name
d = Int((d - m) / 26)
Loop
GetColumnName = name
End Function
You might need conversion both ways, e.g from Excel column adress like AAZ to integer and from any integer to Excel. The two methods below will do just that. Assumes 1 based indexing, first element in your "arrays" are element number 1.
No limits on size here, so you can use adresses like ERROR and that would be column number 2613824 ...
public static string ColumnAdress(int col)
{
if (col <= 26) {
return Convert.ToChar(col + 64).ToString();
}
int div = col / 26;
int mod = col % 26;
if (mod == 0) {mod = 26;div--;}
return ColumnAdress(div) + ColumnAdress(mod);
}
public static int ColumnNumber(string colAdress)
{
int[] digits = new int[colAdress.Length];
for (int i = 0; i < colAdress.Length; ++i)
{
digits[i] = Convert.ToInt32(colAdress[i]) - 64;
}
int mul=1;int res=0;
for (int pos = digits.Length - 1; pos >= 0; --pos)
{
res += digits[pos] * mul;
mul *= 26;
}
return res;
}
Sorry, this is Python instead of C#, but at least the results are correct:
def ColIdxToXlName(idx):
if idx < 1:
raise ValueError("Index is too small")
result = ""
while True:
if idx > 26:
idx, r = divmod(idx - 1, 26)
result = chr(r + ord('A')) + result
else:
return chr(idx + ord('A') - 1) + result
for i in xrange(1, 1024):
print "%4d : %s" % (i, ColIdxToXlName(i))
I discovered an error in my first post, so I decided to sit down and do the the math. What I found is that the number system used to identify Excel columns is not a base 26 system, as another person posted. Consider the following in base 10. You can also do this with the letters of the alphabet.
Space:.........................S1, S2, S3 : S1, S2, S3
....................................0, 00, 000 :.. A, AA, AAA
....................................1, 01, 001 :.. B, AB, AAB
.................................... …, …, … :.. …, …, …
....................................9, 99, 999 :.. Z, ZZ, ZZZ
Total states in space: 10, 100, 1000 : 26, 676, 17576
Total States:...............1110................18278
Excel numbers columns in the individual alphabetical spaces using base 26. You can see that in general, the state space progression is a, a^2, a^3, … for some base a, and the total number of states is a + a^2 + a^3 + … .
Suppose you want to find the total number of states A in the first N spaces. The formula for doing so is A = (a)(a^N - 1 )/(a-1). This is important because we need to find the space N that corresponds to our index K. If I want to find out where K lies in the number system I need to replace A with K and solve for N. The solution is N = log{base a} (A (a-1)/a +1). If I use the example of a = 10 and K = 192, I know that N = 2.23804… . This tells me that K lies at the beginning of the third space since it is a little greater than two.
The next step is to find exactly how far in the current space we are. To find this, subtract from K the A generated using the floor of N. In this example, the floor of N is two. So, A = (10)(10^2 – 1)/(10-1) = 110, as is expected when you combine the states of the first two spaces. This needs to be subtracted from K because these first 110 states would have already been accounted for in the first two spaces. This leaves us with 82 states. So, in this number system, the representation of 192 in base 10 is 082.
The C# code using a base index of zero is
private string ExcelColumnIndexToName(int Index)
{
string range = string.Empty;
if (Index < 0 ) return range;
int a = 26;
int x = (int)Math.Floor(Math.Log((Index) * (a - 1) / a + 1, a));
Index -= (int)(Math.Pow(a, x) - 1) * a / (a - 1);
for (int i = x+1; Index + i > 0; i--)
{
range = ((char)(65 + Index % a)).ToString() + range;
Index /= a;
}
return range;
}
//Old Post
A zero-based solution in C#.
private string ExcelColumnIndexToName(int Index)
{
string range = "";
if (Index < 0 ) return range;
for(int i=1;Index + i > 0;i=0)
{
range = ((char)(65 + Index % 26)).ToString() + range;
Index /= 26;
}
if (range.Length > 1) range = ((char)((int)range[0] - 1)).ToString() + range.Substring(1);
return range;
}
This answer is in javaScript:
function getCharFromNumber(columnNumber){
var dividend = columnNumber;
var columnName = "";
var modulo;
while (dividend > 0)
{
modulo = (dividend - 1) % 26;
columnName = String.fromCharCode(65 + modulo).toString() + columnName;
dividend = parseInt((dividend - modulo) / 26);
}
return columnName;
}
Easy with recursion.
public static string GetStandardExcelColumnName(int columnNumberOneBased)
{
int baseValue = Convert.ToInt32('A');
int columnNumberZeroBased = columnNumberOneBased - 1;
string ret = "";
if (columnNumberOneBased > 26)
{
ret = GetStandardExcelColumnName(columnNumberZeroBased / 26) ;
}
return ret + Convert.ToChar(baseValue + (columnNumberZeroBased % 26) );
}
I'm surprised all of the solutions so far contain either iteration or recursion.
Here's my solution that runs in constant time (no loops). This solution works for all possible Excel columns and checks that the input can be turned into an Excel column. Possible columns are in the range [A, XFD] or [1, 16384]. (This is dependent on your version of Excel)
private static string Turn(uint col)
{
if (col < 1 || col > 16384) //Excel columns are one-based (one = 'A')
throw new ArgumentException("col must be >= 1 and <= 16384");
if (col <= 26) //one character
return ((char)(col + 'A' - 1)).ToString();
else if (col <= 702) //two characters
{
char firstChar = (char)((int)((col - 1) / 26) + 'A' - 1);
char secondChar = (char)(col % 26 + 'A' - 1);
if (secondChar == '#') //Excel is one-based, but modulo operations are zero-based
secondChar = 'Z'; //convert one-based to zero-based
return string.Format("{0}{1}", firstChar, secondChar);
}
else //three characters
{
char firstChar = (char)((int)((col - 1) / 702) + 'A' - 1);
char secondChar = (char)((col - 1) / 26 % 26 + 'A' - 1);
char thirdChar = (char)(col % 26 + 'A' - 1);
if (thirdChar == '#') //Excel is one-based, but modulo operations are zero-based
thirdChar = 'Z'; //convert one-based to zero-based
return string.Format("{0}{1}{2}", firstChar, secondChar, thirdChar);
}
}
Same implementation in Java
public String getExcelColumnName (int columnNumber)
{
int dividend = columnNumber;
int i;
String columnName = "";
int modulo;
while (dividend > 0)
{
modulo = (dividend - 1) % 26;
i = 65 + modulo;
columnName = new Character((char)i).toString() + columnName;
dividend = (int)((dividend - modulo) / 26);
}
return columnName;
}
int nCol = 127;
string sChars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
string sCol = "";
while (nCol >= 26)
{
int nChar = nCol % 26;
nCol = (nCol - nChar) / 26;
// You could do some trick with using nChar as offset from 'A', but I am lazy to do it right now.
sCol = sChars[nChar] + sCol;
}
sCol = sChars[nCol] + sCol;
Update: Peter's comment is right. That's what I get for writing code in the browser. :-) My solution was not compiling, it was missing the left-most letter and it was building the string in reverse order - all now fixed.
Bugs aside, the algorithm is basically converting a number from base 10 to base 26.
Update 2: Joel Coehoorn is right - the code above will return AB for 27. If it was real base 26 number, AA would be equal to A and the next number after Z would be BA.
int nCol = 127;
string sChars = "0ABCDEFGHIJKLMNOPQRSTUVWXYZ";
string sCol = "";
while (nCol > 26)
{
int nChar = nCol % 26;
if (nChar == 0)
nChar = 26;
nCol = (nCol - nChar) / 26;
sCol = sChars[nChar] + sCol;
}
if (nCol != 0)
sCol = sChars[nCol] + sCol;
..And converted to php:
function GetExcelColumnName($columnNumber) {
$columnName = '';
while ($columnNumber > 0) {
$modulo = ($columnNumber - 1) % 26;
$columnName = chr(65 + $modulo) . $columnName;
$columnNumber = (int)(($columnNumber - $modulo) / 26);
}
return $columnName;
}
Just throwing in a simple two-line C# implementation using recursion, because all the answers here seem far more complicated than necessary.
/// <summary>
/// Gets the column letter(s) corresponding to the given column number.
/// </summary>
/// <param name="column">The one-based column index. Must be greater than zero.</param>
/// <returns>The desired column letter, or an empty string if the column number was invalid.</returns>
public static string GetColumnLetter(int column) {
if (column < 1) return String.Empty;
return GetColumnLetter((column - 1) / 26) + (char)('A' + (column - 1) % 26);
}
Although there are already a bunch of valid answers1, none get into the theory behind it.
Excel column names are bijective base-26 representations of their number. This is quite different than an ordinary base 26 (there is no leading zero), and I really recommend reading the Wikipedia entry to grasp the differences. For example, the decimal value 702 (decomposed in 26*26 + 26) is represented in "ordinary" base 26 by 110 (i.e. 1x26^2 + 1x26^1 + 0x26^0) and in bijective base-26 by ZZ (i.e. 26x26^1 + 26x26^0).
Differences aside, bijective numeration is a positional notation, and as such we can perform conversions using an iterative (or recursive) algorithm which on each iteration finds the digit of the next position (similarly to an ordinary base conversion algorithm).
The general formula to get the digit at the last position (the one indexed 0) of the bijective base-k representation of a decimal number m is (f being the ceiling function minus 1):
m - (f(m / k) * k)
The digit at the next position (i.e. the one indexed 1) is found by applying the same formula to the result of f(m / k). We know that for the last digit (i.e. the one with the highest index) f(m / k) is 0.
This forms the basis for an iteration that finds each successive digit in bijective base-k of a decimal number. In pseudo-code it would look like this (digit() maps a decimal integer to its representation in the bijective base -- e.g. digit(1) would return A in bijective base-26):
fun conv(m)
q = f(m / k)
a = m - (q * k)
if (q == 0)
return digit(a)
else
return conv(q) + digit(a);
So we can translate this to C#2 to get a generic3 "conversion to bijective base-k" ToBijective() routine:
class BijectiveNumeration {
private int baseK;
private Func<int, char> getDigit;
public BijectiveNumeration(int baseK, Func<int, char> getDigit) {
this.baseK = baseK;
this.getDigit = getDigit;
}
public string ToBijective(double decimalValue) {
double q = f(decimalValue / baseK);
double a = decimalValue - (q * baseK);
return ((q > 0) ? ToBijective(q) : "") + getDigit((int)a);
}
private static double f(double i) {
return (Math.Ceiling(i) - 1);
}
}
Now for conversion to bijective base-26 (our "Excel column name" use case):
static void Main(string[] args)
{
BijectiveNumeration bijBase26 = new BijectiveNumeration(
26,
(value) => Convert.ToChar('A' + (value - 1))
);
Console.WriteLine(bijBase26.ToBijective(1)); // prints "A"
Console.WriteLine(bijBase26.ToBijective(26)); // prints "Z"
Console.WriteLine(bijBase26.ToBijective(27)); // prints "AA"
Console.WriteLine(bijBase26.ToBijective(702)); // prints "ZZ"
Console.WriteLine(bijBase26.ToBijective(16384)); // prints "XFD"
}
Excel's maximum column index is 16384 / XFD, but this code will convert any positive number.
As an added bonus, we can now easily convert to any bijective base. For example for bijective base-10:
static void Main(string[] args)
{
BijectiveNumeration bijBase10 = new BijectiveNumeration(
10,
(value) => value < 10 ? Convert.ToChar('0'+value) : 'A'
);
Console.WriteLine(bijBase10.ToBijective(1)); // prints "1"
Console.WriteLine(bijBase10.ToBijective(10)); // prints "A"
Console.WriteLine(bijBase10.ToBijective(123)); // prints "123"
Console.WriteLine(bijBase10.ToBijective(20)); // prints "1A"
Console.WriteLine(bijBase10.ToBijective(100)); // prints "9A"
Console.WriteLine(bijBase10.ToBijective(101)); // prints "A1"
Console.WriteLine(bijBase10.ToBijective(2010)); // prints "19AA"
}
1 This generic answer can eventually be reduced to the other, correct, specific answers, but I find it hard to fully grasp the logic of the solutions without the formal theory behind bijective numeration in general. It also proves its correctness nicely. Additionally, several similar questions link back to this one, some being language-agnostic or more generic. That's why I thought the addition of this answer was warranted, and that this question was a good place to put it.
2 C# disclaimer: I implemented an example in C# because this is what is asked here, but I have never learned nor used the language. I have verified it does compile and run, but please adapt it to fit the language best practices / general conventions, if necessary.
3 This example only aims to be correct and understandable ; it could and should be optimized would performance matter (e.g. with tail-recursion -- but that seems to require trampolining in C#), and made safer (e.g. by validating parameters).
I wanted to throw in my static class I use, for interoping between col index and col Label. I use a modified accepted answer for my ColumnLabel Method
public static class Extensions
{
public static string ColumnLabel(this int col)
{
var dividend = col;
var columnLabel = string.Empty;
int modulo;
while (dividend > 0)
{
modulo = (dividend - 1) % 26;
columnLabel = Convert.ToChar(65 + modulo).ToString() + columnLabel;
dividend = (int)((dividend - modulo) / 26);
}
return columnLabel;
}
public static int ColumnIndex(this string colLabel)
{
// "AD" (1 * 26^1) + (4 * 26^0) ...
var colIndex = 0;
for(int ind = 0, pow = colLabel.Count()-1; ind < colLabel.Count(); ++ind, --pow)
{
var cVal = Convert.ToInt32(colLabel[ind]) - 64; //col A is index 1
colIndex += cVal * ((int)Math.Pow(26, pow));
}
return colIndex;
}
}
Use this like...
30.ColumnLabel(); // "AD"
"AD".ColumnIndex(); // 30
private String getColumn(int c) {
String s = "";
do {
s = (char)('A' + (c % 26)) + s;
c /= 26;
} while (c-- > 0);
return s;
}
Its not exactly base 26, there is no 0 in the system. If there was, 'Z' would be followed by 'BA' not by 'AA'.
if you just want it for a cell formula without code, here's a formula for it:
IF(COLUMN()>=26,CHAR(ROUND(COLUMN()/26,1)+64)&CHAR(MOD(COLUMN(),26)+64),CHAR(COLUMN()+64))
In Delphi (Pascal):
function GetExcelColumnName(columnNumber: integer): string;
var
dividend, modulo: integer;
begin
Result := '';
dividend := columnNumber;
while dividend > 0 do begin
modulo := (dividend - 1) mod 26;
Result := Chr(65 + modulo) + Result;
dividend := (dividend - modulo) div 26;
end;
end;
A little late to the game, but here's the code I use (in C#):
private static readonly string _Alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
public static int ColumnNameParse(string value)
{
// assumes value.Length is [1,3]
// assumes value is uppercase
var digits = value.PadLeft(3).Select(x => _Alphabet.IndexOf(x));
return digits.Aggregate(0, (current, index) => (current * 26) + (index + 1));
}
In perl, for an input of 1 (A), 27 (AA), etc.
sub excel_colname {
my ($idx) = #_; # one-based column number
--$idx; # zero-based column index
my $name = "";
while ($idx >= 0) {
$name .= chr(ord("A") + ($idx % 26));
$idx = int($idx / 26) - 1;
}
return scalar reverse $name;
}
Though I am late to the game, Graham's answer is far from being optimal. Particularly, you don't have to use the modulo, call ToString() and apply (int) cast. Considering that in most cases in C# world you would start numbering from 0, here is my revision:
public static string GetColumnName(int index) // zero-based
{
const byte BASE = 'Z' - 'A' + 1;
string name = String.Empty;
do
{
name = Convert.ToChar('A' + index % BASE) + name;
index = index / BASE - 1;
}
while (index >= 0);
return name;
}
More than 30 solutions already, but here's my one-line C# solution...
public string IntToExcelColumn(int i)
{
return ((i<16926? "" : ((char)((((i/26)-1)%26)+65)).ToString()) + (i<2730? "" : ((char)((((i/26)-1)%26)+65)).ToString()) + (i<26? "" : ((char)((((i/26)-1)%26)+65)).ToString()) + ((char)((i%26)+65)));
}
After looking at all the supplied Versions here, I decided to do one myself, using recursion.
Here is my vb.net Version:
Function CL(ByVal x As Integer) As String
If x >= 1 And x <= 26 Then
CL = Chr(x + 64)
Else
CL = CL((x - x Mod 26) / 26) & Chr((x Mod 26) + 1 + 64)
End If
End Function
Refining the original solution (in C#):
public static class ExcelHelper
{
private static Dictionary<UInt16, String> l_DictionaryOfColumns;
public static ExcelHelper() {
l_DictionaryOfColumns = new Dictionary<ushort, string>(256);
}
public static String GetExcelColumnName(UInt16 l_Column)
{
UInt16 l_ColumnCopy = l_Column;
String l_Chars = "0ABCDEFGHIJKLMNOPQRSTUVWXYZ";
String l_rVal = "";
UInt16 l_Char;
if (l_DictionaryOfColumns.ContainsKey(l_Column) == true)
{
l_rVal = l_DictionaryOfColumns[l_Column];
}
else
{
while (l_ColumnCopy > 26)
{
l_Char = l_ColumnCopy % 26;
if (l_Char == 0)
l_Char = 26;
l_ColumnCopy = (l_ColumnCopy - l_Char) / 26;
l_rVal = l_Chars[l_Char] + l_rVal;
}
if (l_ColumnCopy != 0)
l_rVal = l_Chars[l_ColumnCopy] + l_rVal;
l_DictionaryOfColumns.ContainsKey(l_Column) = l_rVal;
}
return l_rVal;
}
}
Here is an Actionscript version:
private var columnNumbers:Array = ['A', 'B', 'C', 'D', 'E', 'F' , 'G', 'H', 'I', 'J', 'K' ,'L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z'];
private function getExcelColumnName(columnNumber:int) : String{
var dividend:int = columnNumber;
var columnName:String = "";
var modulo:int;
while (dividend > 0)
{
modulo = (dividend - 1) % 26;
columnName = columnNumbers[modulo] + columnName;
dividend = int((dividend - modulo) / 26);
}
return columnName;
}
JavaScript Solution
/**
* Calculate the column letter abbreviation from a 1 based index
* #param {Number} value
* #returns {string}
*/
getColumnFromIndex = function (value) {
var base = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'.split('');
var remainder, result = "";
do {
remainder = value % 26;
result = base[(remainder || 26) - 1] + result;
value = Math.floor(value / 26);
} while (value > 0);
return result;
};
These my codes to convert specific number (index start from 1) to Excel Column.
public static string NumberToExcelColumn(uint number)
{
uint originalNumber = number;
uint numChars = 1;
while (Math.Pow(26, numChars) < number)
{
numChars++;
if (Math.Pow(26, numChars) + 26 >= number)
{
break;
}
}
string toRet = "";
uint lastValue = 0;
do
{
number -= lastValue;
double powerVal = Math.Pow(26, numChars - 1);
byte thisCharIdx = (byte)Math.Truncate((columnNumber - 1) / powerVal);
lastValue = (int)powerVal * thisCharIdx;
if (numChars - 2 >= 0)
{
double powerVal_next = Math.Pow(26, numChars - 2);
byte thisCharIdx_next = (byte)Math.Truncate((columnNumber - lastValue - 1) / powerVal_next);
int lastValue_next = (int)Math.Pow(26, numChars - 2) * thisCharIdx_next;
if (thisCharIdx_next == 0 && lastValue_next == 0 && powerVal_next == 26)
{
thisCharIdx--;
lastValue = (int)powerVal * thisCharIdx;
}
}
toRet += (char)((byte)'A' + thisCharIdx + ((numChars > 1) ? -1 : 0));
numChars--;
} while (numChars > 0);
return toRet;
}
My Unit Test:
[TestMethod]
public void Test()
{
Assert.AreEqual("A", NumberToExcelColumn(1));
Assert.AreEqual("Z", NumberToExcelColumn(26));
Assert.AreEqual("AA", NumberToExcelColumn(27));
Assert.AreEqual("AO", NumberToExcelColumn(41));
Assert.AreEqual("AZ", NumberToExcelColumn(52));
Assert.AreEqual("BA", NumberToExcelColumn(53));
Assert.AreEqual("ZZ", NumberToExcelColumn(702));
Assert.AreEqual("AAA", NumberToExcelColumn(703));
Assert.AreEqual("ABC", NumberToExcelColumn(731));
Assert.AreEqual("ACQ", NumberToExcelColumn(771));
Assert.AreEqual("AYZ", NumberToExcelColumn(1352));
Assert.AreEqual("AZA", NumberToExcelColumn(1353));
Assert.AreEqual("AZB", NumberToExcelColumn(1354));
Assert.AreEqual("BAA", NumberToExcelColumn(1379));
Assert.AreEqual("CNU", NumberToExcelColumn(2413));
Assert.AreEqual("GCM", NumberToExcelColumn(4823));
Assert.AreEqual("MSR", NumberToExcelColumn(9300));
Assert.AreEqual("OMB", NumberToExcelColumn(10480));
Assert.AreEqual("ULV", NumberToExcelColumn(14530));
Assert.AreEqual("XFD", NumberToExcelColumn(16384));
}
Sorry, this is Python instead of C#, but at least the results are correct:
def excel_column_number_to_name(column_number):
output = ""
index = column_number-1
while index >= 0:
character = chr((index%26)+ord('A'))
output = output + character
index = index/26 - 1
return output[::-1]
for i in xrange(1, 1024):
print "%4d : %s" % (i, excel_column_number_to_name(i))
Passed these test cases:
Column Number: 494286 => ABCDZ
Column Number: 27 => AA
Column Number: 52 => AZ
For what it is worth, here is Graham's code in Powershell:
function ConvertTo-ExcelColumnID {
param (
[parameter(Position = 0,
HelpMessage = "A 1-based index to convert to an excel column ID. e.g. 2 => 'B', 29 => 'AC'",
Mandatory = $true)]
[int]$index
);
[string]$result = '';
if ($index -le 0 ) {
return $result;
}
while ($index -gt 0) {
[int]$modulo = ($index - 1) % 26;
$character = [char]($modulo + [int][char]'A');
$result = $character + $result;
[int]$index = ($index - $modulo) / 26;
}
return $result;
}
Another VBA way
Public Function GetColumnName(TargetCell As Range) As String
GetColumnName = Split(CStr(TargetCell.Cells(1, 1).Address), "$")(1)
End Function
Here's my super late implementation in PHP. This one's recursive. I wrote it just before I found this post. I wanted to see if others had solved this problem already...
public function GetColumn($intNumber, $strCol = null) {
if ($intNumber > 0) {
$intRem = ($intNumber - 1) % 26;
$strCol = $this->GetColumn(intval(($intNumber - $intRem) / 26), sprintf('%s%s', chr(65 + $intRem), $strCol));
}
return $strCol;
}

Categories