Counting Lines, is this Unsafe Method valid? - c#

SO I tried to make an unsafe fast method to count lines.
I previously used StringReader, but wanted to see if I could make something faster.
So is this code valid, it seems to work but it looks a bit confusing,
and I am very new to C# pointers so I might be doing something bad.
Original Method:
//Return number of (non Empty) lines
private static int getLineCount(string input)
{
int lines = 0;
string line = null;
//Don't count Empty lines
using (StringReader reader = new StringReader(input))
while ((line = reader.ReadLine()) != null)
if (!string.IsNullOrWhiteSpace(line))
lines++;
return lines;
}
Unsafe Method:
//Return number of (non Empty) lines (fast method using pointers)
private unsafe static int getLineCountUnsafe(string input)
{
int lines = 0;
fixed (char* strptr = input)
{
char* charptr = strptr;
int length = input.Length;
//Don't count Empty lines
for (int i = 0; i < length; i++)
{
char c = *charptr;
//If char is an empty line, look if it's empty
if (c == '\n' || c == '\r')
{
//If char is empty, continue till it's not
while (c == '\n' || c == '\r')
{
if (i >= length)
return lines;
i++;
charptr++;
c = *charptr;
}
//Add a line when line is not just a new line (empty)
lines++;
}
charptr++;
}
return lines;
}
}
Benchmark:
(Looped through 100000, 10 times)
Total Milliseconds used.
Safe(Original) - AVG = 770.10334, MIN = 765.678, MAX = 778.0017 , TOTAL 07.701
Unsafe - AVG = 406.91843, MIN = 405.7931, MAX = 408.5505 , TOTAL 04.069
EDIT:
It seems that the Unsafe version isn't always correct,
if it's one line it won't count it, been trying to solve it without making it count too many;(

Your second implementation seems okay, but don't bother too much with learning unsafe, it is not so widely used in C#, neither pointers. This is getting close to C++. The time difference between the both approaches might come from the avoiding garbage collector to collect any memory inside the method until it is done (because of the fixed keyword).
The reason why one should rarely use unsafe is because C# provides much readability and ease of use within it's already defined methods, like in your case:
//Return number of (non Empty) lines
private static int getLineCount(string input)
{
return Regex.Matches(input, Environment.NewLine).Count;
}
which may be even faster because of the evaluating at once of the entire string.

Related

C# Extension method slower than chained Replace unless in tight loop. Why?

I have an extension method to remove certain characters from a string (a phone number) which is performing much slower than I think it should vs chained Replace calls. The weird bit, is that in a loop it overtakes the Replace thing if the loop runs for around 3000 iterations, and after that it's faster. Lower than that and chaining Replace is faster. It's like there's a fixed overhead to my code which Replace doesn't have. What could this be!?
Quick look. When only testing 10 numbers, mine takes about 0.3ms, while Replace takes only 0.01ms. A massive difference! But when running 5 million, mine takes around 1700ms while Replace takes about 2500ms.
Phone numbers will only have 0-9, +, -, (, )
Here's the relevant code:
Building test cases, I'm playing with testNums.
int testNums = 5_000_000;
Console.WriteLine("Building " + testNums + " tests");
Random rand = new Random();
string[] tests = new string[testNums];
char[] letters =
{
'0','1','2','3','4','5','6','7','8','9',
'+','-','(',')'
};
for(int t = 0; t < tests.Length; t++)
{
int length = rand.Next(5, 20);
char[] word = new char[length];
for(int c = 0; c < word.Length; c++)
{
word[c] = letters[rand.Next(letters.Length)];
}
tests[t] = new string(word);
}
Console.WriteLine("Tests built");
string[] stripped = new string[tests.Length];
Using my extension method:
Stopwatch stopwatch = Stopwatch.StartNew();
for (int i = 0; i < stripped.Length; i++)
{
stripped[i] = tests[i].CleanNumberString();
}
stopwatch.Stop();
Console.WriteLine("Clean: " + stopwatch.Elapsed.TotalMilliseconds + "ms");
Using chained Replace:
stripped = new string[tests.Length];
stopwatch = Stopwatch.StartNew();
for (int i = 0; i < stripped.Length; i++)
{
stripped[i] = tests[i].Replace(" ", string.Empty)
.Replace("-", string.Empty)
.Replace("(", string.Empty)
.Replace(")", string.Empty)
.Replace("+", string.Empty);
}
stopwatch.Stop();
Console.WriteLine("Replace: " + stopwatch.Elapsed.TotalMilliseconds + "ms");
Extension method in question:
public static string CleanNumberString(this string s)
{
Span<char> letters = stackalloc char[s.Length];
int count = 0;
for (int i = 0; i < s.Length; i++)
{
if (s[i] >= '0' && s[i] <= '9')
letters[count++] = s[i];
}
return new string(letters.Slice(0, count));
}
What I've tried:
I've run them around the other way. Makes a tiny difference, but not enough.
Make it a normal static method, which was significantly slower than extension. As a ref parameter was slightly slower, and in parameter was about the same as extension method.
Aggressive Inlining. Doesn't make any real difference. I'm in release mode, so I suspect the compiler inlines it anyway. Either way, not much change.
I have also looked at memory allocations, and that's as I expect. My one allocates on the managed heap only one string per iteration (the new string at the end) which Replace allocates a new object for each Replace. So the memory used by the Replace one is much, higher. But it's still faster!
Is it calling native C code and doing something crafty there? Is the higher memory usage triggering the GC and slowing it down (still doesn't explane the insanely fast time on only one or two iterations)
Any ideas?
(Yes, I know not to bother optimising things like this, it's just bugging me because I don't know why it's doing this)
After doing some benchmarks, I think can safely assert that your initial statement is wrong for the exact reason you mentionned in your deleted answer: the loading time of the method is the only thing that misguided you.
Here's the full benchmark on a simplified version of the problem:
static void Main(string[] args)
{
// Build string of n consecutive "ab"
int n = 1000;
Console.WriteLine("N: " + n);
char[] c = new char[n];
for (int i = 0; i < n; i+=2)
c[i] = 'a';
for (int i = 1; i < n; i += 2)
c[i] = 'b';
string s = new string(c);
Stopwatch stopwatch;
// Make sure everything is loaded
s.CleanNumberString();
s.Replace("a", "");
s.UnsafeRemove();
// Tests to remove all 'a' from the string
// Unsafe remove
stopwatch = Stopwatch.StartNew();
string a1 = s.UnsafeRemove();
stopwatch.Stop();
Console.WriteLine("Unsafe remove:\t" + stopwatch.Elapsed.TotalMilliseconds + "ms");
// Extension method
stopwatch = Stopwatch.StartNew();
string a2 = s.CleanNumberString();
stopwatch.Stop();
Console.WriteLine("Clean method:\t" + stopwatch.Elapsed.TotalMilliseconds + "ms");
// String replace
stopwatch = Stopwatch.StartNew();
string a3 = s.Replace("a", "");
stopwatch.Stop();
Console.WriteLine("String.Replace:\t" + stopwatch.Elapsed.TotalMilliseconds + "ms");
// Make sure the returned strings are identical
Console.WriteLine(a1.Equals(a2) && a2.Equals(a3));
Console.ReadKey();
}
public static string CleanNumberString(this string s)
{
char[] letters = new char[s.Length];
int count = 0;
for (int i = 0; i < s.Length; i++)
if (s[i] == 'b')
letters[count++] = 'b';
return new string(letters.SubArray(0, count));
}
public static T[] SubArray<T>(this T[] data, int index, int length)
{
T[] result = new T[length];
Array.Copy(data, index, result, 0, length);
return result;
}
// Taken from https://stackoverflow.com/a/2183442/6923568
public static unsafe string UnsafeRemove(this string s)
{
int len = s.Length;
char* newChars = stackalloc char[len];
char* currentChar = newChars;
for (int i = 0; i < len; ++i)
{
char c = s[i];
switch (c)
{
case 'a':
continue;
default:
*currentChar++ = c;
break;
}
}
return new string(newChars, 0, (int)(currentChar - newChars));
}
When ran with different values of n, it is clear that your extension method (or at least my somewhat equivalent version of it) has a logic that makes it faster than String.Replace(). In fact, it is more performant on either small or big strings:
N: 100
Unsafe remove: 0,0024ms
Clean method: 0,0015ms
String.Replace: 0,0021ms
True
N: 100000
Unsafe remove: 0,3889ms
Clean method: 0,5308ms
String.Replace: 1,3993ms
True
I highly suspect optimizations for the replacement of strings (not to be compared to removal) in String.Replace() to be the culprit here. I also added a method from this answer to have another comparison on removal of characters. That one's times behave similarly to your method but gets faster on higher values (80k+ on my tests) of n.
With all that being said, since your question is based on an assumption that we found was false, if you need more explanation on why the opposite is true (i.e. "Why is String.Replace() slower than my method"), plenty of in-depth benchmarks about string manipulation already do so.
I ran the clean method a couple more. interestingly, it is a lot faster than the Replace. Only the first time run was slower. Sorry that I couldn't explain why it's slower the first time but I ran more of the method then the result was expected.
Building 100 tests
Tests built
Replace: 0.0528ms
Clean: 0.4526ms
Clean: 0.0413ms
Clean: 0.0294ms
Replace: 0.0679ms
Replace: 0.0523ms
used dotnet core 2.1
So I've found with help from daehee Kim and Mat below that it's only the first iteration, but it's for the whole first loop. Every loop after there is ok.
I use the following line to force the JIT to do its thing and initialise this method:
RuntimeHelpers.PrepareMethod(typeof(CleanExtension).GetMethod("CleanNumberString", BindingFlags.Public | BindingFlags.Static).MethodHandle);
I find the JIT usually takes about 2-3ms to do its thing here (including Reflection time of about 0.1ms). Note that you should probably not be doing this because you're now getting the Reflection cost as well, and the JIT will be called right after this anyway, but it's probably a good idea for benchmarks to fairly compare.
The more you know!
My benchmark for a loop of 5000 iterations, repeated 5000 times with random strings and averaged is:
Clean: 0.41078ms
Replace: 1.4974ms

C# Type of String Index

I need to access a very large number in the index of the string which int and long can't handle. I had to use ulong but the problem is that the indexer can only handle the type int.
This is my code and I have marked the line where the error is located. Any ideas how to solve this?
string s = Console.ReadLine();
long n = Convert.ToInt64(Console.ReadLine());
var cont = s.Count(x => x == 'a');
Console.WriteLine(cont);
Console.ReadKey();
The main idea of the code is to identify how many 'a's there are in the string. What are some other ways I can do this?
EDIT:
i didn't know that is the string index Capicity cant exceed the int type. and i fixed my for loop by replacing it with this linq line
var cont = s.Count(x => x == 'a');
now since my string can't exceed certain amount. so how i can repeat my string to append its char for 1,000,000,000,000 times rather than using this code
for (int i = 0; i < 20; i++)
{
s += s;
}
since this code is generating random char numbers in the string and if i raised the 20 may cause to overflow so i need to adjust it to repeat itself to make the string[index] = n // the long i declared above.
so for example if my string input is "aba" and n is 10 so the string will be "abaabaabaa" // total chars 10
PS: I Edited the original code
I assume you got a programming assignment or online coding challenge, where the requirement was "Count all instances of the letter 'a' in this > 2 GB file". You solution is to read the file in memory at once, and loop over it with a variable type that allows values over 2GB.
This causes an XY problem. You cannot have an array that large in memory in the first place, so you're not going to reach the point where you need a uint, long or ulong to index into it.
Instead, use a StreamReader to read the file in chunks, as explained in for example Reading large file in chunks c#.
You can repeat your string using an infinite sequence. I haven't added any check for valid arguments, etc.
static void Main(string[] args)
{
long count = countCharacters("aba", 'a', 10);
Console.WriteLine("Count is {0}", count);
Console.WriteLine("Press ENTER to exit...");
Console.ReadLine();
}
private static long countCharacters(string baseString, char c, long limit)
{
long result = 0;
if (baseString.Length == 1)
{
result = baseString[0] == c ? limit : 0;
}
else
{
long n = 0;
foreach (var ch in getInfiniteSequence(baseString))
{
if (n >= limit)
break;
if (ch == c)
{
result++;
}
n++;
}
}
return result;
}
//This method iterates through a base string infinitely
private static IEnumerable<char> getInfiniteSequence(string baseString)
{
int stringIndex = 0;
while (true)
{
yield return baseString[stringIndex++ % baseString.Length];
}
}
For the given inputs, the result is 7
I highly recommend you rethink the way you are doing this, but a quick fix would be to use a foreach loop instead:
foreach(char c in s)
{
if (c == 'a')
cont++;
}
Alternative using Linq:
cont = s.Count(c => c == 'a');
I'm not sure about what n is supposed to do. According to your code it limits the string length but your question never mentions why or to what end.
i need to access a very large number in the index of the string which
int, long can't handle
this statement is not true
c# string's max length is int.Max since string.Length is an integer and it is limited by that. You should be able to do
for (int i = 0; i <= n; i++)
The maximum length of a string cannot exceed the size of an int so there really is no point in using ulong or long to index into the string.
Simply put, you're trying to solve the wrong problem.
If we disregard the fact that the program is likely to cause an out of memory exception when building such a long string, you can simply fix your code by switching to an int instead of a ulong:
for (int i = 0; i <= n; i++)
Having said that you can also use LINQ to do this:
int cont = s.Take(n + 1).Count(c => c == 'a');
Now, in the first sentence of your question you state this:
I need to access a very large number in the index of the string which int and long can't handle.
This is wholly unnecessary because any legal index of a string will fit inside an int.
If you need to do this on some input that's longer than the maximum length of a string in .NET, you'll need to change your approach; use a Stream instead trying to read all input into a string.
char seeking = 'a';
ulong count = 0;
char[] buffer = new char[4096];
using (var reader = new StreamReader(inStream))
{
int length;
while ((length = reader.Read(buffer, 0, buffer.Length)) > 0)
{
count += (ulong)buffer.Count(c => c == seeking);
}
}

Finding the index of a blank line within a string

Suppose I have a string that contains a text file, carriage returns and tabs and all. How do I find the index of the first blank line (to include lines-containing-only-whitespace) in that string?
What I've tried:
In this case, I have a working function that leverages a bunch of ugly code to find the index of the blank line. There must be a more elegant/readable way to do it than this.
To be clear, the below function returns the section from a string from a supplied 'title' to the index of the first blank line after the title. Supplied in full, since most of it is consumed by the search for that index, and to avoid any 'Why in the WORLD do you need the index of a blank line' questions. Also to counteract the XY Problem, if it's happening here.
The (apparently working, haven't tested all edge cases) code:
// Get subsection indicated by supplied title from supplied section
private static string GetSubSectionText(string section, string subSectionTitle)
{
int indexSubSectionBgn = section.IndexOf(subSectionTitle);
if (indexSubSectionBgn == -1)
return String.Empty;
int indexSubSectionEnd = section.Length;
// Find first blank line after found sub-section
bool blankLineFound = false;
int lineStartIndex = 0;
int lineEndIndex = 0;
do
{
string temp;
lineEndIndex = section.IndexOf(Environment.NewLine, lineStartIndex);
if (lineEndIndex == -1)
temp = section.Substring(lineStartIndex);
else
temp = section.Substring(lineStartIndex, (lineEndIndex - lineStartIndex));
temp = temp.Trim();
if (temp.Length == 0)
{
if (lineEndIndex == -1)
indexSubSectionEnd = section.Length;
else
indexSubSectionEnd = lineEndIndex;
blankLineFound = true;
}
else
{
lineStartIndex = lineEndIndex + 1;
}
} while (!blankLineFound && (lineEndIndex != -1));
if (blankLineFound)
return section.Substring(indexSubSectionBgn, indexSubSectionEnd);
else
return null;
}
FOLLOW-UP EDIT:
The result (based heavily on Konstantin's answer):
// Get subsection indicated by supplied title from supplied section
private static string GetSubSectionText(string section, string subSectionTitle)
{
string[] lines = section.Split(new string[] { Environment.NewLine }, StringSplitOptions.None);
int subsectStart = 0;
int subsectEnd = lines.Length;
// Find subsection start
for (int i = 0; i < lines.Length; i++)
{
if (lines[i].Trim() == subSectionTitle)
{
subsectStart = i;
break;
}
}
// Find subsection end (ie, first blank line)
for (int i = subsectStart; i < lines.Length; i++)
{
if (lines[i].Trim().Length == 0)
{
subsectEnd = i;
break;
}
}
return string.Join(Environment.NewLine, lines, subsectStart, subsectEnd - subsectStart);
}
The primary differences between the result and Konstantin's answer are due to framework version (I'm working with .NET 2.0, and it doesn't support string[].Take), and leveraging Environment.NewLine instead of the hardcoded '\n'. Much, much prettier and more readable than the original pass. Thanks all!
Have you tried using String.Split Method :
string s = "safsadfd\r\ndfgfdg\r\n\r\ndfgfgg";
string[] lines = s.Split('\n');
int i;
for (i = 0; i < lines.Length; i++)
{
if (string.IsNullOrWhiteSpace(lines[i]))
//if (lines[i].Length == 0) //or maybe this suits better..
//if (lines[i].Equals(string.Empty)) //or this
{
Console.WriteLine(i);
break;
}
}
Console.WriteLine(string.Join("\n",lines.Take(i)));
EDIT: responding the OP's edit.
By "blank line", you mean a line that contains only whitespace? Yes, you should use regex; the syntax you're looking for is #"(?<=\r?\n)[ \t]*(\r?\n|$)".
(?<=…) indicates a lookahead, something that should precede what you're looking for.
\r?\n indicates a newline, supporting both the Unix and Windows conventions.
(?<=\r?\n) is therefore a lookahead for the preceding newline.
[ \t]* means zero or more space or tab characters; these will match the content (if any) of your blank line.
(\r?\n|$) means newline or end-of-file.
Example:
string source = "Line 1\r\nLine 2\r\n \r\nLine 4\r\n";
Match firstBlankLineMatch = Regex.Match(source, #"(?<=\r?\n)[ \t]*(\r?\n|$)");
int firstBlankLineIndex =
firstBlankLineMatch.Success ? firstBlankLineMatch.Index : -1;
Just for fun: It seems like you're OK with re-allocating strings once per line. It would be possible then, to write an iterator that would lazily evaluate the string and return each line. For example:
IEnumerable<string> BreakIntoLines(string theWholeThing)
{
int startIndex = 0;
int endIndex = 0;
for(;;)
{
endIndex = theWholeThing.IndexOf(Environment.NewLine,startIndex) + Environment.NewLine.Count; //Remember to pick up the newline character(s) too!
if(endIndex = -1) //Didn't find a newline
{
//Return the end part of the string and finish
yield return theWholeThing.SubString(startIndex);
yield break;
}
else //Found a newline
{
//Return where we're at up to the newline
yield return theWholeThing.SubString(startIndex, endIndex - startIndex);
startIndex = endIndex;
}
}
}
You could then wrap that iterator in another one that only returns the lines you care about and discards the others.
IEnumerable<string> GetSubsectionLines(string theWholeThing, string subsectionTitle)
{
bool foundSubsectionTitle = false;
foreach(var line in BreakIntoLines(theWholeThing))
{
if(line.Contains(subSectionTitle))
{
foundSubsectionTitle = true; //Start capturing
}
if(foundSubsectionTitle)
{
yield return line;
} //Implicit "else" - Just discard the line if we haven't found the subsection title yet
if(String.IsNullOrWhiteSpace(line))
{
//This will stop iterating after returning the empty line, if there is one
yield break;
}
}
}
Now, this method (along with some of the others posted) do not do EXACTLY what your original code does. For example, if the text in subsectionTitle happens to span a line, it won't get found. We'll assume that the spec is written in such a way that this isn't allowed. This code will also make a copy of every line that gets returned which the original code did too, so that's probably OK.
The only benefit of doing it this way vs string.split, is that when you're done returning the SubSection, the rest of the string doesn't get evaluated. For most reasonably sized string you probably don't care. Any "performance gains" are likely to be non-existent. If you really cared about performance, you wouldn't be copying each line in the first place!
The other thing that you get (that actually could be valuable) is code re-use. If you're writing a program that parses documents, it's probably helpful to be able to operate on individual lines.

Is there a better way than String.Replace to remove backspaces from a string?

I have a string read from another source such as "\b\bfoo\bx". In this case, it would translate to the word "fox" as the first 2 \b's are ignored, and the last 'o' is erased, and then replaced with 'x'. Also another case would be "patt\b\b\b\b\b\b\b\b\b\bfoo" should be translated to "foo"
I have come up with something using String.Replace, but it is complex and I am worried it is not working correctly, also it is creating a lot of new string objects which I would like to avoid.
Any ideas?
Probably the easiest is to just iterate over the entire string. Given your inputs, the following code does the trick in 1-pass
public string ReplaceBackspace(string hasBackspace)
{
if( string.IsNullOrEmpty(hasBackspace) )
return hasBackspace;
StringBuilder result = new StringBuilder(hasBackspace.Length);
foreach (char c in hasBackspace)
{
if (c == '\b')
{
if (result.Length > 0)
result.Length--;
}
else
{
result.Append(c);
}
}
return result.ToString();
}
The way I would do it is low-tech, but easy to understand.
Create a stack of characters. Then iterate through the string from beginning to end. If the character is a normal character (non-slash), push it onto the stack. If it is a slash, and the next character is a 'b', pop the top of the stack. If the stack is empty, ignore it.
At the end, pop each character in turn, add it to a StringBuilder, and reverse the result.
Regular expressions version:
var data = #"patt\b\b\b\b\b\b\b\b\b\bfoo";
var regex = new Regex(#"(^|[^\\b])\\b");
while (regex.IsMatch(data))
{
data = regex.Replace(data, "");
}
Optimized version (and this one works with backspace '\b' and not with string "\b"):
var data = "patt\b\b\b\b\b\b\b\b\b\bfoo";
var regex = new Regex(#"[^\x08]\x08", RegexOptions.Compiled);
while (data.Contains('\b'))
{
data = regex.Replace(data.TrimStart('\b'), "");
}
public static string ProcessBackspaces(string source)
{
char[] buffer = new char[source.Length];
int idx = 0;
foreach (char c in source)
{
if (c != '\b')
{
buffer[idx] = c;
idx++;
}
else if (idx > 0)
{
idx--;
}
}
return new string(buffer, 0, idx);
}
EDIT
I've done a quick, rough benchmark of the code posted in answers so far (processing the two example strings from the question, one million times each):
ANSWER | TIME (ms)
------------------------|-----------
Luke (this one) | 318
Alexander Taran | 567
Robert Paulson | 683
Markus Nigbur | 2100
Kamarey (new version) | 7075
Kamarey (old version) | 30902
You could iterate through the string backward, making a character array as you go. Every time you hit a backspace, increment a counter, and every time you hit a normal character, skip it if your counter is non-zero and decrement the counter.
I'm not sure what the best C# data structure is to manage this and then be able to get the string in the right order afterward quickly. StringBuilder has an Insert method but I don't know if it will be performant to keep inserting characters at the start or not. You could put the characters in a stack and hit ToArray() at the end -- that might or might not be faster.
String myString = "patt\b\b\b\b\b\b\b\b\b\bfoo";
List<char> chars = myString.ToCharArray().ToList();
int delCount = 0;
for (int i = chars.Count -1; i >= 0; i--)
{
if (chars[i] == '\b')
{
delCount++;
chars.RemoveAt(i);
} else {
if (delCount > 0 && chars[i] != null) {
chars.RemoveAt(i);
delCount--;
}
}
}
i'd go like this:
code is not tested
char[] result = new char[input.Length()];
int r =0;
for (i=0; i<input.Length(); i++){
if (input[i] == '\b' && r>0) r--;
else result[r]=input[i];
}
string resultsring = result.take(r);
Create a StringBuilder and copy over everything but backspace chars.

Testing for repeated characters in a string

I'm doing some work with strings, and I have a scenario where I need to determine if a string (usually a small one < 10 characters) contains repeated characters.
`ABCDE` // does not contain repeats
`AABCD` // does contain repeats, ie A is repeated
I can loop through the string.ToCharArray() and test each character against every other character in the char[], but I feel like I am missing something obvious.... maybe I just need coffee. Can anyone help?
EDIT:
The string will be sorted, so order is not important so ABCDA => AABCD
The frequency of repeats is also important, so I need to know if the repeat is pair or triplet etc.
If the string is sorted, you could just remember each character in turn and check to make sure the next character is never identical to the last character.
Other than that, for strings under ten characters, just testing each character against all the rest is probably as fast or faster than most other things. A bit vector, as suggested by another commenter, may be faster (helps if you have a small set of legal characters.)
Bonus: here's a slick LINQ solution to implement Jon's functionality:
int longestRun =
s.Select((c, i) => s.Substring(i).TakeWhile(x => x == c).Count()).Max();
So, OK, it's not very fast! You got a problem with that?!
:-)
If the string is short, then just looping and testing may well be the simplest and most efficient way. I mean you could create a hash set (in whatever platform you're using) and iterate through the characters, failing if the character is already in the set and adding it to the set otherwise - but that's only likely to provide any benefit when the strings are longer.
EDIT: Now that we know it's sorted, mquander's answer is the best one IMO. Here's an implementation:
public static bool IsSortedNoRepeats(string text)
{
if (text.Length == 0)
{
return true;
}
char current = text[0];
for (int i=1; i < text.Length; i++)
{
char next = text[i];
if (next <= current)
{
return false;
}
current = next;
}
return true;
}
A shorter alternative if you don't mind repeating the indexer use:
public static bool IsSortedNoRepeats(string text)
{
for (int i=1; i < text.Length; i++)
{
if (text[i] <= text[i-1])
{
return false;
}
}
return true;
}
EDIT: Okay, with the "frequency" side, I'll turn the problem round a bit. I'm still going to assume that the string is sorted, so what we want to know is the length of the longest run. When there are no repeats, the longest run length will be 0 (for an empty string) or 1 (for a non-empty string). Otherwise, it'll be 2 or more.
First a string-specific version:
public static int LongestRun(string text)
{
if (text.Length == 0)
{
return 0;
}
char current = text[0];
int currentRun = 1;
int bestRun = 0;
for (int i=1; i < text.Length; i++)
{
if (current != text[i])
{
bestRun = Math.Max(currentRun, bestRun);
currentRun = 0;
current = text[i];
}
currentRun++;
}
// It's possible that the final run is the best one
return Math.Max(currentRun, bestRun);
}
Now we can also do this as a general extension method on IEnumerable<T>:
public static int LongestRun(this IEnumerable<T> source)
{
bool first = true;
T current = default(T);
int currentRun = 0;
int bestRun = 0;
foreach (T element in source)
{
if (first || !EqualityComparer<T>.Default(element, current))
{
first = false;
bestRun = Math.Max(currentRun, bestRun);
currentRun = 0;
current = element;
}
}
// It's possible that the final run is the best one
return Math.Max(currentRun, bestRun);
}
Then you can call "AABCD".LongestRun() for example.
This will tell you very quickly if a string contains duplicates:
bool containsDups = "ABCDEA".Length != s.Distinct().Count();
It just checks the number of distinct characters against the original length. If they're different, you've got duplicates...
Edit: I guess this doesn't take care of the frequency of dups you noted in your edit though... but some other suggestions here already take care of that, so I won't post the code as I note a number of them already give you a reasonably elegant solution. I particularly like Joe's implementation using LINQ extensions.
Since you're using 3.5, you could do this in one LINQ query:
var results = stringInput
.ToCharArray() // not actually needed, I've left it here to show what's actually happening
.GroupBy(c=>c)
.Where(g=>g.Count()>1)
.Select(g=>new {Letter=g.First(),Count=g.Count()})
;
For each character that appears more than once in the input, this will give you the character and the count of occurances.
I think the easiest way to achieve that is to use this simple regex
bool foundMatch = false;
foundMatch = Regex.IsMatch(yourString, #"(\w)\1");
If you need more information about the match (start, length etc)
Match match = null;
string testString = "ABCDE AABCD";
match = Regex.Match(testString, #"(\w)\1+?");
if (match.Success)
{
string matchText = match.Value; // AA
int matchIndnex = match.Index; // 6
int matchLength = match.Length; // 2
}
How about something like:
string strString = "AA BRA KA DABRA";
var grp = from c in strString.ToCharArray()
group c by c into m
select new { Key = m.Key, Count = m.Count() };
foreach (var item in grp)
{
Console.WriteLine(
string.Format("Character:{0} Appears {1} times",
item.Key.ToString(), item.Count));
}
Update Now, you'd need an array of counters to maintain a count.
Keep a bit array, with one bit representing a unique character. Turn the bit on when you encounter a character, and run over the string once. A mapping of the bit array index and the character set is upto you to decide. Break if you see that a particular bit is on already.
/(.).*\1/
(or whatever the equivalent is in your regex library's syntax)
Not the most efficient, since it will probably backtrack to every character in the string and then scan forward again. And I don't usually advocate regular expressions. But if you want brevity...
I started looking for some info on the net and I got to the following solution.
string input = "aaaaabbcbbbcccddefgg";
char[] chars = input.ToCharArray();
Dictionary<char, int> dictionary = new Dictionary<char,int>();
foreach (char c in chars)
{
if (!dictionary.ContainsKey(c))
{
dictionary[c] = 1; //
}
else
{
dictionary[c]++;
}
}
foreach (KeyValuePair<char, int> combo in dictionary)
{
if (combo.Value > 1) //If the vale of the key is greater than 1 it means the letter is repeated
{
Console.WriteLine("Letter " + combo.Key + " " + "is repeated " + combo.Value.ToString() + " times");
}
}
I hope it helps, I had a job interview in which the interviewer asked me to solve this and I understand it is a common question.
When there is no order to work on you could use a dictionary to keep the counts:
String input = "AABCD";
var result = new Dictionary<Char, int>(26);
var chars = input.ToCharArray();
foreach (var c in chars)
{
if (!result.ContainsKey(c))
{
result[c] = 0; // initialize the counter in the result
}
result[c]++;
}
foreach (var charCombo in result)
{
Console.WriteLine("{0}: {1}",charCombo.Key, charCombo.Value);
}
The hash solution Jon was describing is probably the best. You could use a HybridDictionary since that works well with small and large data sets. Where the letter is the key and the value is the frequency. (Update the frequency every time the add fails or the HybridDictionary returns true for .Contains(key))

Categories