Optimizing counting characters within a string - c#

I just created a simple method to count occurences of each character within a string, without taking caps into account.
static List<int> charactercount(string input)
{
char[] characters = "abcdefghijklmnopqrstuvwxyz".ToCharArray();
input = input.ToLower();
List<int> counts = new List<int>();
foreach (char c in characters)
{
int count = 0;
foreach (char c2 in input) if (c2 == c)
{
count++;
}
counts.Add(count);
}
return counts;
}
Is there a cleaner way to do this (i.e. without creating a character array to hold every character in the alphabet) that would also take into account numbers, other characters, caps, etc?

Conceptually, I would prefer to return a Dictionary<string,int> of counts. I'll assume that it's ok to know by omission rather than an explicit count of 0 that a character occurs zero times, you can do it via LINQ. #Oded's given you a good start on how to do that. All you would need to do is replace the Select() with ToDictionary( k => k.Key, v => v.Count() ). See my comment on his answer about doing the case insensitive grouping. Note: you should decide if you care about cultural differences in characters or not and adjust the ToLower method accordingly.
You can also do this without LINQ;
public static Dictionary<string,int> CountCharacters(string input)
{
var counts = new Dictionary<char,int>(StringComparer.OrdinalIgnoreCase);
foreach (var c in input)
{
int count = 0;
if (counts.ContainsKey(c))
{
count = counts[c];
}
counts[c] = counts + 1;
}
return counts;
}
Note if you wanted a Dictionary<char,int>, you could easily do that by creating a case invariant character comparer and using that as the IEqualityComparer<T> for a dictionary of the required type. I've used string for simplicity in the example.
Again, adjust the type of the comparer to be consistent with how you want to handle culture.

Using GroupBy and Select:
aString.GroupBy(c => c).Select(g => new { Character = g.Key, Num = g.Count() })
The returned anonymous type list will contain each character and the number of times it appears in the string.
You can then filter it in any way you wish, using the static methods defined on Char.

Your code is kind of slow because you are looping through the range a-z instead of just looping through the input.
If you only need to count letters (like your code suggests), the fastest way to do it would be:
int[] CountCharacters(string text)
{
var counts = new int[26];
for (var i = 0; i < text.Length; i++)
{
var charIndex - text[index] - (int)'a';
counts[charIndex] = counts[charindex] + 1;
}
return counts;
}
Note that you need to add some thing like verify the character is in the range, and convert it to lowercase when needed, or this code might throw exceptions. I'll leave those for you to add. :)

Based on +Ran's answer to avoiding IndexOutOfRangeException:
static readonly int differ = 'a';
int[] CountCharacters(string text) {
text = text.ToLower();
var counts = new int[26];
for (var i = 0; i < text.Length; i++) {
var charIndex = text[i] - differ;
// to counting chars between 'a' and 'z' we have to do this:
if(charIndex >= 0 && charIndex < 26)
counts[charIndex] += 1;
}
return counts;
}
Actually using Dictionary and/or LINQ is not optimized enough as counting chars and working with a low level array.

Related

C# Type of String Index

I need to access a very large number in the index of the string which int and long can't handle. I had to use ulong but the problem is that the indexer can only handle the type int.
This is my code and I have marked the line where the error is located. Any ideas how to solve this?
string s = Console.ReadLine();
long n = Convert.ToInt64(Console.ReadLine());
var cont = s.Count(x => x == 'a');
Console.WriteLine(cont);
Console.ReadKey();
The main idea of the code is to identify how many 'a's there are in the string. What are some other ways I can do this?
EDIT:
i didn't know that is the string index Capicity cant exceed the int type. and i fixed my for loop by replacing it with this linq line
var cont = s.Count(x => x == 'a');
now since my string can't exceed certain amount. so how i can repeat my string to append its char for 1,000,000,000,000 times rather than using this code
for (int i = 0; i < 20; i++)
{
s += s;
}
since this code is generating random char numbers in the string and if i raised the 20 may cause to overflow so i need to adjust it to repeat itself to make the string[index] = n // the long i declared above.
so for example if my string input is "aba" and n is 10 so the string will be "abaabaabaa" // total chars 10
PS: I Edited the original code
I assume you got a programming assignment or online coding challenge, where the requirement was "Count all instances of the letter 'a' in this > 2 GB file". You solution is to read the file in memory at once, and loop over it with a variable type that allows values over 2GB.
This causes an XY problem. You cannot have an array that large in memory in the first place, so you're not going to reach the point where you need a uint, long or ulong to index into it.
Instead, use a StreamReader to read the file in chunks, as explained in for example Reading large file in chunks c#.
You can repeat your string using an infinite sequence. I haven't added any check for valid arguments, etc.
static void Main(string[] args)
{
long count = countCharacters("aba", 'a', 10);
Console.WriteLine("Count is {0}", count);
Console.WriteLine("Press ENTER to exit...");
Console.ReadLine();
}
private static long countCharacters(string baseString, char c, long limit)
{
long result = 0;
if (baseString.Length == 1)
{
result = baseString[0] == c ? limit : 0;
}
else
{
long n = 0;
foreach (var ch in getInfiniteSequence(baseString))
{
if (n >= limit)
break;
if (ch == c)
{
result++;
}
n++;
}
}
return result;
}
//This method iterates through a base string infinitely
private static IEnumerable<char> getInfiniteSequence(string baseString)
{
int stringIndex = 0;
while (true)
{
yield return baseString[stringIndex++ % baseString.Length];
}
}
For the given inputs, the result is 7
I highly recommend you rethink the way you are doing this, but a quick fix would be to use a foreach loop instead:
foreach(char c in s)
{
if (c == 'a')
cont++;
}
Alternative using Linq:
cont = s.Count(c => c == 'a');
I'm not sure about what n is supposed to do. According to your code it limits the string length but your question never mentions why or to what end.
i need to access a very large number in the index of the string which
int, long can't handle
this statement is not true
c# string's max length is int.Max since string.Length is an integer and it is limited by that. You should be able to do
for (int i = 0; i <= n; i++)
The maximum length of a string cannot exceed the size of an int so there really is no point in using ulong or long to index into the string.
Simply put, you're trying to solve the wrong problem.
If we disregard the fact that the program is likely to cause an out of memory exception when building such a long string, you can simply fix your code by switching to an int instead of a ulong:
for (int i = 0; i <= n; i++)
Having said that you can also use LINQ to do this:
int cont = s.Take(n + 1).Count(c => c == 'a');
Now, in the first sentence of your question you state this:
I need to access a very large number in the index of the string which int and long can't handle.
This is wholly unnecessary because any legal index of a string will fit inside an int.
If you need to do this on some input that's longer than the maximum length of a string in .NET, you'll need to change your approach; use a Stream instead trying to read all input into a string.
char seeking = 'a';
ulong count = 0;
char[] buffer = new char[4096];
using (var reader = new StreamReader(inStream))
{
int length;
while ((length = reader.Read(buffer, 0, buffer.Length)) > 0)
{
count += (ulong)buffer.Count(c => c == seeking);
}
}

Efficient Replace Characters in a string from one array for another

The specific problem I have is that I have to replace the numbers in chemical formulae with the equivalent Unicode subscripts, so H2SO4 => H₂SO₄. (Those subscripts are not font adjustments, they are special unicode characters.)
So my initial cut was:
return unit.Replace("2", "₂").
Replace("3", "₃").
Replace("4", "₄").
Replace("5", "₅").
Replace("6", "₆").
Replace("7", "₇");
Which works, but obviously isn't particularly efficient. Any suggestions for a more optimal algorithm?
There are only 10 possible subscript characters that need replacement and most chemical formulas are not too long. For this reason, I think your implementation is not horribly inefficient and I would suggest benchmarking your code before trying to optimize it.
But here's my attempt to create a method that does what you need:
public string ToSubscriptFormula(string input)
{
var characters = input.ToCharArray();
for (var i = 0; i < characters.Length; i++)
{
switch (characters[i])
{
case '2':
characters[i] = '₂';
break;
case '3':
characters[i] = '₃';
break;
// case statements omitted
}
}
return new string(characters);
}
I would recommend avoiding the use of StringBuilder unless you're appending a large amount of strings, as the overhead of creating an instance would actually make your code less efficient. See this post by Jon Skeet for a detailed explanation of when it should be used.
Also, given the limited number of case statements, I personally don't think using a Dictionary<char,char> would add any readability or performance benefit, but under different scenarios it might be useful to consider using one.
But if you really had to super-optimize your method, you could replace the case statement with the following code (thanks to andrew for the suggestion):
public string ToSubscriptFormula(string input)
{
var characters = input.ToCharArray();
const int distance = '₀' - '0'; // distance of subscript from digit
for (var i = 0; i < characters.Length; i++)
{
if(char.IsDigit(characters[i]))
{
characters[i] = (char) (characters[i] + distance);
}
}
return new string(characters);
}
The trick here is that all subscript characters are successive and that casting an int to char will give you the corresponding character.
Finally, as #nwellnhof has suggested in the comments, char.IsDigit() would return true for some non-latin digit characters in the Unicode Nd Category.
If your chemical formula contains such characters, the statement should be replaced with c >= '0' && c<='9'. This will probably be slightly faster than char.IsDigit but I'm not sure if it would make a difference in most practical scenarios.
I would be tempted to do something like this:
public string replace(string input)
{
StringBuilder sb = new StringBuilder();
Dictionary<char, char> map = new Dictionary<char, char>();
map.Add('2', '₂');
map.Add('3', '₃');
map.Add('4', '₄');
map.Add('5', '₅');
map.Add('6', '₆');
map.Add('7', '₇');
char tmp;
foreach(char c in input)
{
if (map.TryGetValue(c, out tmp))
sb.Append(tmp);
else
sb.Append(c);
}
return sb.ToString();
}
The Dictionary is defined inside the method here for simplicity, but should be defined somewhere else in scope.
So, very simply, iterate the input string only once. For every character, find the matching Dictionary entry if it exists, and append either that or the original character to a StringBuilder in order to avoid creating multiple string objects.
My first thought was what about formulae with balancing prefix numbers:
E.g. 2H₂(g) + O₂(g) → 2H₂O(g)
Presumably you don't want this to replace the leading numbers?
Also, I'm not sure why it is mentioned above that only 8 digits (or even only 6 digits) need replacement - aren't all digits required (0-9)? Sure, you don't have 0 and 1 by themselves, but you need them for, e.g., 10.
Anyway, notwithstanding the above (which I didn't attempt to implement since it wasn't the question), avoiding StringBuilder and operating on a char array seemed to make sense, and I preferred to avoid a large switch statement.
public class Program
{
public static void Main()
{
Console.WriteLine(SubscriptNums("C6H12O6"));
}
public static string SubscriptNums(string input)
{
char[] replacementChars = { '₀', '₁', '₂', '₃', '₄', '₅', '₆', '₇', '₈', '₉' };
int zeroCharIndex = (int)'0';
char[] inputCharArray = input.ToCharArray();
for(int i = 0; i < inputCharArray.Length; i++)
{
if (inputCharArray[i] >= '0' && inputCharArray[i] <= '9')
{
inputCharArray[i] = replacementChars[(int)inputCharArray[i] - zeroCharIndex];
}
}
return new string(inputCharArray);
}
}
Edit 1 - removed magic number for numeric value of '0'.
Edit 2 - removed use of IsDigit.
You could iterate over the string and check each char. If it is to replace, append the according character to the StringBuilder. If not, just add the original character. This way, you only have to iterate over the string once, and not once for each replacement. Furthermore, as strings are immutable, each call of String.Replace() will create a new copy of the string for the result, which will immediately be GC'ed again.
StringBuilder sb = new StringBuilder();
for (int i = 0; i < unit.Length; i++) {
switch(unit[i]) {
case '2': sb.Append('₂'); break;
case '3': sb.Append('₃'); break;
...
default: sb.Append(unit[i]); break;
}
}
output = sb.ToString();
You could also introduce some replacement dictionary, like Abdullah Nehir suggested
StringBuilder sb = new StringBuilder();
Dictionary<char, char> replacements = new Dictionary<char, char>();
//put in the pairs
for (int i = 0; i < unit.Length; i++) {
if (replacements.ContainsKey(unit[i]))
sb.Append(replacement[unit[i]];
else
sb.Append(unit[i]);
}
Instead of accessing the values via index, you can also iterate the string with a foreach loop
foreach (char c in unit) {
if (replacements.ContainsKey(c))
sb.Append(replacements[c]);
else
sb.Append(c);
}
If you were looking for some elegant code where you don't have to type string.Replace for each character, then this would help you:
public static string Replace(string input)
{
char[] inputCharArr = input.ToCharArray();
StringBuilder sb = new StringBuilder();
foreach (var c in inputCharArr)
{
int intC = (int)c;
//If the digit was a number ([0-9] are [48-57] in unicode),
//replace the old char with the new char
//(8272 when added to the unicode of [0-9] gives the desired result)
if (intC > 47 && intC < 58)
sb.Append((char)(intC + 8272));
else sb.Append(c);
}
return sb.ToString();
}
See the edit history if you wonder what the comments are talking about.

How to get index of all matching chars in char array?

I am working on a simple hangman program. I have most of the code working, but I am having trouble figuring out how to get the index of multiple matching chars in a char array. For example, I have a word "sushi" converted to a char array. If the user guesses an "s" then all "s" in the char array should be shown. The way that my code is, I actually have two char arrays of the same length. The first array holds the chars of the word to guess, while the second array holds question marks. My code should iterate through the first array and return the index of each element in the array that matches the user guess. Then, the code inserts the user guess at each specified index and displays the second array for the user. Unfortunately, only the first occurrence is changed in the second array, so only the first index is returned from the match query. The problematic code is as follows:
//check if char array contains the user input
if (guessThis.Contains(Convert.ToChar(textBox1.Text)))
{
//save user input as char
char userGuess = Convert.ToChar(textBox1.Text);
//iterate through first char array
for (int i = 0; i < guessThis.Length - 1; i++ )
{
//check each element in the array
//probably don't need both for and foreach loops
foreach (char c in guessThis)
{
//get index of any element that contains the userinput
var getIndex = Array.IndexOf(guessThis, c);
//check if the element matches the user guess
if (c == userGuess)
{
//insert the userguess into the index
displayAnswer[getIndex] = userGuess;
}
}
}
//update the display label
answerLabel.Text = new string(displayAnswer);
SOLVED:
Working off of the example in the selected answer, I updated my code as:
//check if char array contains the user input
if (guessThis.Contains(Convert.ToChar(textBox1.Text)))
{
//save user input as char
char userGuess = Convert.ToChar(textBox1.Text);
string maybeThis = textBox1.Text;
string tryThis = new string(guessThis);
foreach (Match m in Regex.Matches(tryThis, maybeThis))
{
displayAnswer[m.Index] = userGuess;
}
answerLabel.Text = new string(displayAnswer);
}
Try Regex.Matches to build regex expression and find all matches and their positions.
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = "a*";
string input = "abaabb";
foreach (Match m in Regex.Matches(input, pattern))
Console.WriteLine("'{0}' found at index {1}.",
m.Value, m.Index);
}
}
// The example displays the following output:
// 'a' found at index 0.
// '' found at index 1.
// 'aa' found at index 2.
// '' found at index 4.
// '' found at index 5.
// '' found at index 6.`
http://msdn.microsoft.com/en-gb/library/system.text.regularexpressions.regex.matches(v=vs.110).aspx
Because your:
var getIndex = Array.IndexOf(guessThis, c);
always return the first occurance of character.
Your second for loop and the getIndex is useless, a clearer code could be:
for (int i = 0; i < guessThis.Length - 1; i++ )
{
//check if the element matches the user guess
if (c == userGuess[i])
{
//insert the userguess into the index
displayAnswer[i] = userGuess;
}
}
Also, your code allows user to input multiple characters into the textBox1. I think in this kind of game, only one character should be guessed each time. So I suggest your confine your input.
I think the problem is with the method:
var getIndex = Array.IndexOf(guessThis, c);
because it return the first location.
You can just use the index you have in the for loop
Try Something like this:
char userGuess = Convert.ToChar(textBox1.Text);
char[] displayAnswer = answerLabel.Text.ToCharArray();
for (int n = 0; n < displayAnswer.Length; n++)
{
if (guessThis[n] == userGuess)
{
displayAnswer[n] = userGuess;
}
}
answerLabel.Text = new string(displayAnswer);
A simple approach
You could do something like this (made generic for clarity):
public static IEnumerable<int> AllIndexesOfAny<T>(this IList<T> list, IEnumerable<T> ofAny)
{
return Enumerable.Range(0, list.Count).Where(i => ofAny.Contains(list[i]));
}
If performance becomes an issue, you could replace the IEnumerable<T> ofAny with a HashSet.
Update
Just read your code more closely and realized you are indexing through guessThis, and for each character in guessThis, finding its index in guessThis and checking whether it matches thisGuess. This is unnecessary. The simplest non-generic way to find all character indices in guessThis matching userGuess is probably:
var matches = Enumerable.Range(0, guessThis.Length).Where(i => guessThis[i] == userGuess);
Additional note
By the way, it probably doesn't matter for your application, but some non-ASCII Unicode characters in .Net are actually represented by surrogate pairs of chars. (There are also diacritical combining characters that modify the preceding character.) In an "internationalized" hangman game you might want to handle surrogate pairs by converting them to UTF32 code points:
public static IEnumerable<KeyValuePair<int, int>> Utf32IndexedCodePoints(this string s, int index)
{
for (int length = s.Length; index < length; index++)
{
yield return new KeyValuePair<int, int>(index, char.ConvertToUtf32(s, index));
if (char.IsSurrogatePair(s, index))
index++;
}
}

Expand C# string with control characters

I need to expand the Control Characters (non-Printable) in a Bar Code Scan.
This is what I've got. At the moment I am hung up on how to get the integer index from the substring on my input string.
// A method for expanding ASCII Control Character into Human Readable format
string ExpandCtrlString(string inStr)
{
// String Builder Capacity may need to be lengthened...
StringBuilder outStr = new StringBuilder(128);
// 0 Based Array representing the expansion of ASCII Control Characters
string[] CtrlChars = new string[]
{
"<nul>",
"<soh>",
"<stx>",
"<etx>",
"<eot>",
"<enq>",
"<ack>",
"<bel>",
"<bs>",
"<tab>",
"<lf>",
"<vt>",
"<ff>",
"<cr>",
"<so>",
"<si>",
"<dle>",
"<dc1>",
"<dc2>",
"<dc3>",
"<dc4>",
"<nak>",
"<syn>",
"<etb>",
"<can>",
"<em>",
"<sub>",
"<esc>",
"<fs>",
"<gs>",
"<rs>",
"<us>"
};
for (int n = 0; n < inStr.Length; n++)
{
if (Char.IsControl(inStr, n))
{
//char q = inStr.Substring(n, 1);
int x = (int)inStr[n] ();
outStr.Append(CtrlChars[x]);
}
else
{
outStr.Append(inStr.Substring(n, 1));
}
}
return outStr.ToString();
}
Edited:
The thought came as I was musing...
I double cast the substring and it worked...
That is I casted the cast... :)
for (int n = 0; n < inStr.Length; n++)
{
if (Char.IsControl(inStr, n))
{
int x = (int)(char)inStr[n];
outStr.Append(CtrlChars[x]);
}
else
{
outStr.Append(inStr.Substring(n, 1));
}
}
return outStr.ToString();
And, it works in CF NET 2.0 in Windows Mobile 5
I had thought about using foreach, but, that presented other problems for an old VB6 guy. :)
I can see where you are are coming from in your approach, but I can see some flaws in what you are trying to do as well.
First of all .. I'm doing this without testing .. so buyer beware!
Secondly, from Microsoft's Char.IsControl Method (String, Int32), The Unicode standard assigns code points from \U0000 to \U001F, \U007F, and from \U0080 to \U009F to control characters. Your array doesn't cover all of these ranges, so potentially (depending on the source of your string) your code will actually blow up as you would be supplying an index that is outside the range of the array. This could be fixed by using a dictionary to map from a number to string, rather than the array.
dictionary<int, string> mapping = new dictionary<int, string>() {
{ 0, "<nul>" },
{ 1, "<soh>" },
.. etc filled with every possible control char
}
NOTE that this dictionary really should be defined/constructed in the class that holds your ExpandCtrlString method, and not in the method itself.
class expand
{
private dictionary<int, string> mapping …
public string ExpandCtrlString(string inStr)
{
...
}
}
And to get your char you can simply do:
for (int n = 0; n < inStr.Length; n++)
{
char ch = instr[n];
...
}
Or if you want to be more modern about it you can directly get the character from the string:
foreach(var ch in instr)
{
...
}
Then check if the char is a control char, and if so is it in the dictionary
if (Char.IsControl(ch))
{
// The value is not needed until you know its a control char
int val = (int)Char.GetNumericValue(ch);
if (mapping.ContainsKey(val))
{
// You know what the control char is, so output its name
outStr.Append(mapping[val]);
}
else
{
// You might have missed a control char in your dictionary of names
// So by default output its hex value instead
outStr.AppendFormat("<{0}>", val.ToString("X"));
}
}
else
{
// Char is not a control char
outStr.Append(ch);
}
Then finally return your outStr as you already do.
A char can be used like a number. Here's a simpler loop:
foreach (char c in inStr)
{
// Check to see if c (as a number here) is less than the CtrlChars array size
// If it is, then print the CtrlChars text
if (c < CtrlChars.Length)
{
outStr.Append(CtrlChars[c]);
}
else
{
outStr.Append(c);
}
}

Testing for repeated characters in a string

I'm doing some work with strings, and I have a scenario where I need to determine if a string (usually a small one < 10 characters) contains repeated characters.
`ABCDE` // does not contain repeats
`AABCD` // does contain repeats, ie A is repeated
I can loop through the string.ToCharArray() and test each character against every other character in the char[], but I feel like I am missing something obvious.... maybe I just need coffee. Can anyone help?
EDIT:
The string will be sorted, so order is not important so ABCDA => AABCD
The frequency of repeats is also important, so I need to know if the repeat is pair or triplet etc.
If the string is sorted, you could just remember each character in turn and check to make sure the next character is never identical to the last character.
Other than that, for strings under ten characters, just testing each character against all the rest is probably as fast or faster than most other things. A bit vector, as suggested by another commenter, may be faster (helps if you have a small set of legal characters.)
Bonus: here's a slick LINQ solution to implement Jon's functionality:
int longestRun =
s.Select((c, i) => s.Substring(i).TakeWhile(x => x == c).Count()).Max();
So, OK, it's not very fast! You got a problem with that?!
:-)
If the string is short, then just looping and testing may well be the simplest and most efficient way. I mean you could create a hash set (in whatever platform you're using) and iterate through the characters, failing if the character is already in the set and adding it to the set otherwise - but that's only likely to provide any benefit when the strings are longer.
EDIT: Now that we know it's sorted, mquander's answer is the best one IMO. Here's an implementation:
public static bool IsSortedNoRepeats(string text)
{
if (text.Length == 0)
{
return true;
}
char current = text[0];
for (int i=1; i < text.Length; i++)
{
char next = text[i];
if (next <= current)
{
return false;
}
current = next;
}
return true;
}
A shorter alternative if you don't mind repeating the indexer use:
public static bool IsSortedNoRepeats(string text)
{
for (int i=1; i < text.Length; i++)
{
if (text[i] <= text[i-1])
{
return false;
}
}
return true;
}
EDIT: Okay, with the "frequency" side, I'll turn the problem round a bit. I'm still going to assume that the string is sorted, so what we want to know is the length of the longest run. When there are no repeats, the longest run length will be 0 (for an empty string) or 1 (for a non-empty string). Otherwise, it'll be 2 or more.
First a string-specific version:
public static int LongestRun(string text)
{
if (text.Length == 0)
{
return 0;
}
char current = text[0];
int currentRun = 1;
int bestRun = 0;
for (int i=1; i < text.Length; i++)
{
if (current != text[i])
{
bestRun = Math.Max(currentRun, bestRun);
currentRun = 0;
current = text[i];
}
currentRun++;
}
// It's possible that the final run is the best one
return Math.Max(currentRun, bestRun);
}
Now we can also do this as a general extension method on IEnumerable<T>:
public static int LongestRun(this IEnumerable<T> source)
{
bool first = true;
T current = default(T);
int currentRun = 0;
int bestRun = 0;
foreach (T element in source)
{
if (first || !EqualityComparer<T>.Default(element, current))
{
first = false;
bestRun = Math.Max(currentRun, bestRun);
currentRun = 0;
current = element;
}
}
// It's possible that the final run is the best one
return Math.Max(currentRun, bestRun);
}
Then you can call "AABCD".LongestRun() for example.
This will tell you very quickly if a string contains duplicates:
bool containsDups = "ABCDEA".Length != s.Distinct().Count();
It just checks the number of distinct characters against the original length. If they're different, you've got duplicates...
Edit: I guess this doesn't take care of the frequency of dups you noted in your edit though... but some other suggestions here already take care of that, so I won't post the code as I note a number of them already give you a reasonably elegant solution. I particularly like Joe's implementation using LINQ extensions.
Since you're using 3.5, you could do this in one LINQ query:
var results = stringInput
.ToCharArray() // not actually needed, I've left it here to show what's actually happening
.GroupBy(c=>c)
.Where(g=>g.Count()>1)
.Select(g=>new {Letter=g.First(),Count=g.Count()})
;
For each character that appears more than once in the input, this will give you the character and the count of occurances.
I think the easiest way to achieve that is to use this simple regex
bool foundMatch = false;
foundMatch = Regex.IsMatch(yourString, #"(\w)\1");
If you need more information about the match (start, length etc)
Match match = null;
string testString = "ABCDE AABCD";
match = Regex.Match(testString, #"(\w)\1+?");
if (match.Success)
{
string matchText = match.Value; // AA
int matchIndnex = match.Index; // 6
int matchLength = match.Length; // 2
}
How about something like:
string strString = "AA BRA KA DABRA";
var grp = from c in strString.ToCharArray()
group c by c into m
select new { Key = m.Key, Count = m.Count() };
foreach (var item in grp)
{
Console.WriteLine(
string.Format("Character:{0} Appears {1} times",
item.Key.ToString(), item.Count));
}
Update Now, you'd need an array of counters to maintain a count.
Keep a bit array, with one bit representing a unique character. Turn the bit on when you encounter a character, and run over the string once. A mapping of the bit array index and the character set is upto you to decide. Break if you see that a particular bit is on already.
/(.).*\1/
(or whatever the equivalent is in your regex library's syntax)
Not the most efficient, since it will probably backtrack to every character in the string and then scan forward again. And I don't usually advocate regular expressions. But if you want brevity...
I started looking for some info on the net and I got to the following solution.
string input = "aaaaabbcbbbcccddefgg";
char[] chars = input.ToCharArray();
Dictionary<char, int> dictionary = new Dictionary<char,int>();
foreach (char c in chars)
{
if (!dictionary.ContainsKey(c))
{
dictionary[c] = 1; //
}
else
{
dictionary[c]++;
}
}
foreach (KeyValuePair<char, int> combo in dictionary)
{
if (combo.Value > 1) //If the vale of the key is greater than 1 it means the letter is repeated
{
Console.WriteLine("Letter " + combo.Key + " " + "is repeated " + combo.Value.ToString() + " times");
}
}
I hope it helps, I had a job interview in which the interviewer asked me to solve this and I understand it is a common question.
When there is no order to work on you could use a dictionary to keep the counts:
String input = "AABCD";
var result = new Dictionary<Char, int>(26);
var chars = input.ToCharArray();
foreach (var c in chars)
{
if (!result.ContainsKey(c))
{
result[c] = 0; // initialize the counter in the result
}
result[c]++;
}
foreach (var charCombo in result)
{
Console.WriteLine("{0}: {1}",charCombo.Key, charCombo.Value);
}
The hash solution Jon was describing is probably the best. You could use a HybridDictionary since that works well with small and large data sets. Where the letter is the key and the value is the frequency. (Update the frequency every time the add fails or the HybridDictionary returns true for .Contains(key))

Categories