Is there a performance difference between String.Replace(char, char) and String.Replace(string, string) when I just need to replace once character with another?
Yes, there is: I ran a quick experiment, and it looks like the string version is about 3 times slower.
string a = "quickbrownfoxjumpsoverthelazydog";
DateTime t1 = DateTime.Now;
for (int i = 0; i != 10000000; i++) {
var b = a.Replace('o', 'b');
if (b.Length == 0) {
break;
}
}
DateTime t2 = DateTime.Now;
for (int i = 0; i != 10000000; i++) {
var b = a.Replace("o", "b");
if (b.Length == 0) {
break;
}
}
DateTime te = DateTime.Now;
Console.WriteLine("{0} {1}", t2-t1, te-t2);
1.466s vs 4.583s
This is not surprising, because the overload with strings needs an extra loop to go through all characters of the oldString. This loop runs exactly one time, but the overhead is still there.
I would expect string.Replace(char, char) to potentially be faster, as it can allocate exactly the right amount of space. I doubt that it'll make a significant performance difference in many real world apps though.
More importantly, I'd say it's more readable - it's clearer that you really will end up with a string of the same length.
String.Replace(char, char) is faster. The reason is simple:
Char replacement does not need to allocate a string with a different size, String replacement needs to find out the new size first or use a StringBuilder for the replacement
Char replacement doesn't need to do a check with a range of a string. Imagine you have a string like ABCACABCAC and you want to replace ABC. You need to find out if 3 chars are matching, when working with chars you need only to find one char.
Related
I want to know if I need to do String.Replace/StringBuilder.Replace on my string.
So I have two ways to do that.
The first way:
var myString = new StringBuilder("abcd");
var copyMyString = myString;
myString = myString.Replace("a", "b");
if (!myString.Equals(copyMyString))//If the string Is changed
{
//My Code
}
And the second:
var pos = myString.ToString().IndexOf("a");
if (pos > 0)
{
myString = myString.Replace("a", "b");
//After this line the string is replaced.
//My Code
}
What is a faster way to do this (performance)?
Is there another way to do that?
The string length sometimes can be 1MB and more.
You can speed this up a little by modifying your second method like so:
var pos = myString.ToString().IndexOf("a");
if (pos > 0)
{
myString = myString.Replace("a", "b", pos, myString.Length - pos);
//After this line the string is replaced.
//My Code
}
We now call the overload of StringBuilder.Replace() which specifies a starting index.
Now it doesn't need to search the first part of the string again. This is unlikely to save much time though - but it will save a little.
It depends how often pos > 0 (note that should probably be pos >= 0) is true. .IndexOf() will cycle through each character until it finds what you are looking for so it's O(n), this is a pretty cheap operation since it's only a single search.
The high cost here is String.Replace(). For strings modifying them often under can be overwriting the string, the larger the string the more costly that becomes. This also can have several replaces since it finds all occurrences.
So unless pos >= 0 is almost always true the second case will be more efficient but you should drop .ToString() as it's doing nothing.
The specific problem I have is that I have to replace the numbers in chemical formulae with the equivalent Unicode subscripts, so H2SO4 => H₂SO₄. (Those subscripts are not font adjustments, they are special unicode characters.)
So my initial cut was:
return unit.Replace("2", "₂").
Replace("3", "₃").
Replace("4", "₄").
Replace("5", "₅").
Replace("6", "₆").
Replace("7", "₇");
Which works, but obviously isn't particularly efficient. Any suggestions for a more optimal algorithm?
There are only 10 possible subscript characters that need replacement and most chemical formulas are not too long. For this reason, I think your implementation is not horribly inefficient and I would suggest benchmarking your code before trying to optimize it.
But here's my attempt to create a method that does what you need:
public string ToSubscriptFormula(string input)
{
var characters = input.ToCharArray();
for (var i = 0; i < characters.Length; i++)
{
switch (characters[i])
{
case '2':
characters[i] = '₂';
break;
case '3':
characters[i] = '₃';
break;
// case statements omitted
}
}
return new string(characters);
}
I would recommend avoiding the use of StringBuilder unless you're appending a large amount of strings, as the overhead of creating an instance would actually make your code less efficient. See this post by Jon Skeet for a detailed explanation of when it should be used.
Also, given the limited number of case statements, I personally don't think using a Dictionary<char,char> would add any readability or performance benefit, but under different scenarios it might be useful to consider using one.
But if you really had to super-optimize your method, you could replace the case statement with the following code (thanks to andrew for the suggestion):
public string ToSubscriptFormula(string input)
{
var characters = input.ToCharArray();
const int distance = '₀' - '0'; // distance of subscript from digit
for (var i = 0; i < characters.Length; i++)
{
if(char.IsDigit(characters[i]))
{
characters[i] = (char) (characters[i] + distance);
}
}
return new string(characters);
}
The trick here is that all subscript characters are successive and that casting an int to char will give you the corresponding character.
Finally, as #nwellnhof has suggested in the comments, char.IsDigit() would return true for some non-latin digit characters in the Unicode Nd Category.
If your chemical formula contains such characters, the statement should be replaced with c >= '0' && c<='9'. This will probably be slightly faster than char.IsDigit but I'm not sure if it would make a difference in most practical scenarios.
I would be tempted to do something like this:
public string replace(string input)
{
StringBuilder sb = new StringBuilder();
Dictionary<char, char> map = new Dictionary<char, char>();
map.Add('2', '₂');
map.Add('3', '₃');
map.Add('4', '₄');
map.Add('5', '₅');
map.Add('6', '₆');
map.Add('7', '₇');
char tmp;
foreach(char c in input)
{
if (map.TryGetValue(c, out tmp))
sb.Append(tmp);
else
sb.Append(c);
}
return sb.ToString();
}
The Dictionary is defined inside the method here for simplicity, but should be defined somewhere else in scope.
So, very simply, iterate the input string only once. For every character, find the matching Dictionary entry if it exists, and append either that or the original character to a StringBuilder in order to avoid creating multiple string objects.
My first thought was what about formulae with balancing prefix numbers:
E.g. 2H₂(g) + O₂(g) → 2H₂O(g)
Presumably you don't want this to replace the leading numbers?
Also, I'm not sure why it is mentioned above that only 8 digits (or even only 6 digits) need replacement - aren't all digits required (0-9)? Sure, you don't have 0 and 1 by themselves, but you need them for, e.g., 10.
Anyway, notwithstanding the above (which I didn't attempt to implement since it wasn't the question), avoiding StringBuilder and operating on a char array seemed to make sense, and I preferred to avoid a large switch statement.
public class Program
{
public static void Main()
{
Console.WriteLine(SubscriptNums("C6H12O6"));
}
public static string SubscriptNums(string input)
{
char[] replacementChars = { '₀', '₁', '₂', '₃', '₄', '₅', '₆', '₇', '₈', '₉' };
int zeroCharIndex = (int)'0';
char[] inputCharArray = input.ToCharArray();
for(int i = 0; i < inputCharArray.Length; i++)
{
if (inputCharArray[i] >= '0' && inputCharArray[i] <= '9')
{
inputCharArray[i] = replacementChars[(int)inputCharArray[i] - zeroCharIndex];
}
}
return new string(inputCharArray);
}
}
Edit 1 - removed magic number for numeric value of '0'.
Edit 2 - removed use of IsDigit.
You could iterate over the string and check each char. If it is to replace, append the according character to the StringBuilder. If not, just add the original character. This way, you only have to iterate over the string once, and not once for each replacement. Furthermore, as strings are immutable, each call of String.Replace() will create a new copy of the string for the result, which will immediately be GC'ed again.
StringBuilder sb = new StringBuilder();
for (int i = 0; i < unit.Length; i++) {
switch(unit[i]) {
case '2': sb.Append('₂'); break;
case '3': sb.Append('₃'); break;
...
default: sb.Append(unit[i]); break;
}
}
output = sb.ToString();
You could also introduce some replacement dictionary, like Abdullah Nehir suggested
StringBuilder sb = new StringBuilder();
Dictionary<char, char> replacements = new Dictionary<char, char>();
//put in the pairs
for (int i = 0; i < unit.Length; i++) {
if (replacements.ContainsKey(unit[i]))
sb.Append(replacement[unit[i]];
else
sb.Append(unit[i]);
}
Instead of accessing the values via index, you can also iterate the string with a foreach loop
foreach (char c in unit) {
if (replacements.ContainsKey(c))
sb.Append(replacements[c]);
else
sb.Append(c);
}
If you were looking for some elegant code where you don't have to type string.Replace for each character, then this would help you:
public static string Replace(string input)
{
char[] inputCharArr = input.ToCharArray();
StringBuilder sb = new StringBuilder();
foreach (var c in inputCharArr)
{
int intC = (int)c;
//If the digit was a number ([0-9] are [48-57] in unicode),
//replace the old char with the new char
//(8272 when added to the unicode of [0-9] gives the desired result)
if (intC > 47 && intC < 58)
sb.Append((char)(intC + 8272));
else sb.Append(c);
}
return sb.ToString();
}
See the edit history if you wonder what the comments are talking about.
So, what I'm trying to do this something like this: (example)
a,b,c,d.. etc. aa,ab,ac.. etc. ba,bb,bc, etc.
So, this can essentially be explained as generally increasing and just printing all possible variations, starting at a. So far, I've been able to do it with one letter, starting out like this:
for (int i = 97; i <= 122; i++)
{
item = (char)i
}
But, I'm unable to eventually add the second letter, third letter, and so forth. Is anyone able to provide input? Thanks.
Since there hasn't been a solution so far that would literally "increment a string", here is one that does:
static string Increment(string s) {
if (s.All(c => c == 'z')) {
return new string('a', s.Length + 1);
}
var res = s.ToCharArray();
var pos = res.Length - 1;
do {
if (res[pos] != 'z') {
res[pos]++;
break;
}
res[pos--] = 'a';
} while (true);
return new string(res);
}
The idea is simple: pretend that letters are your digits, and do an increment the way they teach in an elementary school. Start from the rightmost "digit", and increment it. If you hit a nine (which is 'z' in our system), move on to the prior digit; otherwise, you are done incrementing.
The obvious special case is when the "number" is composed entirely of nines. This is when your "counter" needs to roll to the next size up, and add a "digit". This special condition is checked at the beginning of the method: if the string is composed of N letters 'z', a string of N+1 letter 'a's is returned.
Here is a link to a quick demonstration of this code on ideone.
Each iteration of Your for loop is completely
overwriting what is in "item" - the for loop is just assigning one character "i" at a time
If item is a String, Use something like this:
item = "";
for (int i = 97; i <= 122; i++)
{
item += (char)i;
}
something to the affect of
public string IncrementString(string value)
{
if (string.IsNullOrEmpty(value)) return "a";
var chars = value.ToArray();
var last = chars.Last();
if(char.ToByte() == 122)
return value + "a";
return value.SubString(0, value.Length) + (char)(char.ToByte()+1);
}
you'll probably need to convert the char to a byte. That can be encapsulated in an extension method like static int ToByte(this char);
StringBuilder is a better choice when building large amounts of strings. so you may want to consider using that instead of string concatenation.
Another way to look at this is that you want to count in base 26. The computer is very good at counting and since it always has to convert from base 2 (binary), which is the way it stores values, to base 10 (decimal--the number system you and I generally think in), converting to different number bases is also very easy.
There's a general base converter here https://stackoverflow.com/a/3265796/351385 which converts an array of bytes to an arbitrary base. Once you have a good understanding of number bases and can understand that code, it's a simple matter to create a base 26 counter that counts in binary, but converts to base 26 for display.
I'm doing some work with strings, and I have a scenario where I need to determine if a string (usually a small one < 10 characters) contains repeated characters.
`ABCDE` // does not contain repeats
`AABCD` // does contain repeats, ie A is repeated
I can loop through the string.ToCharArray() and test each character against every other character in the char[], but I feel like I am missing something obvious.... maybe I just need coffee. Can anyone help?
EDIT:
The string will be sorted, so order is not important so ABCDA => AABCD
The frequency of repeats is also important, so I need to know if the repeat is pair or triplet etc.
If the string is sorted, you could just remember each character in turn and check to make sure the next character is never identical to the last character.
Other than that, for strings under ten characters, just testing each character against all the rest is probably as fast or faster than most other things. A bit vector, as suggested by another commenter, may be faster (helps if you have a small set of legal characters.)
Bonus: here's a slick LINQ solution to implement Jon's functionality:
int longestRun =
s.Select((c, i) => s.Substring(i).TakeWhile(x => x == c).Count()).Max();
So, OK, it's not very fast! You got a problem with that?!
:-)
If the string is short, then just looping and testing may well be the simplest and most efficient way. I mean you could create a hash set (in whatever platform you're using) and iterate through the characters, failing if the character is already in the set and adding it to the set otherwise - but that's only likely to provide any benefit when the strings are longer.
EDIT: Now that we know it's sorted, mquander's answer is the best one IMO. Here's an implementation:
public static bool IsSortedNoRepeats(string text)
{
if (text.Length == 0)
{
return true;
}
char current = text[0];
for (int i=1; i < text.Length; i++)
{
char next = text[i];
if (next <= current)
{
return false;
}
current = next;
}
return true;
}
A shorter alternative if you don't mind repeating the indexer use:
public static bool IsSortedNoRepeats(string text)
{
for (int i=1; i < text.Length; i++)
{
if (text[i] <= text[i-1])
{
return false;
}
}
return true;
}
EDIT: Okay, with the "frequency" side, I'll turn the problem round a bit. I'm still going to assume that the string is sorted, so what we want to know is the length of the longest run. When there are no repeats, the longest run length will be 0 (for an empty string) or 1 (for a non-empty string). Otherwise, it'll be 2 or more.
First a string-specific version:
public static int LongestRun(string text)
{
if (text.Length == 0)
{
return 0;
}
char current = text[0];
int currentRun = 1;
int bestRun = 0;
for (int i=1; i < text.Length; i++)
{
if (current != text[i])
{
bestRun = Math.Max(currentRun, bestRun);
currentRun = 0;
current = text[i];
}
currentRun++;
}
// It's possible that the final run is the best one
return Math.Max(currentRun, bestRun);
}
Now we can also do this as a general extension method on IEnumerable<T>:
public static int LongestRun(this IEnumerable<T> source)
{
bool first = true;
T current = default(T);
int currentRun = 0;
int bestRun = 0;
foreach (T element in source)
{
if (first || !EqualityComparer<T>.Default(element, current))
{
first = false;
bestRun = Math.Max(currentRun, bestRun);
currentRun = 0;
current = element;
}
}
// It's possible that the final run is the best one
return Math.Max(currentRun, bestRun);
}
Then you can call "AABCD".LongestRun() for example.
This will tell you very quickly if a string contains duplicates:
bool containsDups = "ABCDEA".Length != s.Distinct().Count();
It just checks the number of distinct characters against the original length. If they're different, you've got duplicates...
Edit: I guess this doesn't take care of the frequency of dups you noted in your edit though... but some other suggestions here already take care of that, so I won't post the code as I note a number of them already give you a reasonably elegant solution. I particularly like Joe's implementation using LINQ extensions.
Since you're using 3.5, you could do this in one LINQ query:
var results = stringInput
.ToCharArray() // not actually needed, I've left it here to show what's actually happening
.GroupBy(c=>c)
.Where(g=>g.Count()>1)
.Select(g=>new {Letter=g.First(),Count=g.Count()})
;
For each character that appears more than once in the input, this will give you the character and the count of occurances.
I think the easiest way to achieve that is to use this simple regex
bool foundMatch = false;
foundMatch = Regex.IsMatch(yourString, #"(\w)\1");
If you need more information about the match (start, length etc)
Match match = null;
string testString = "ABCDE AABCD";
match = Regex.Match(testString, #"(\w)\1+?");
if (match.Success)
{
string matchText = match.Value; // AA
int matchIndnex = match.Index; // 6
int matchLength = match.Length; // 2
}
How about something like:
string strString = "AA BRA KA DABRA";
var grp = from c in strString.ToCharArray()
group c by c into m
select new { Key = m.Key, Count = m.Count() };
foreach (var item in grp)
{
Console.WriteLine(
string.Format("Character:{0} Appears {1} times",
item.Key.ToString(), item.Count));
}
Update Now, you'd need an array of counters to maintain a count.
Keep a bit array, with one bit representing a unique character. Turn the bit on when you encounter a character, and run over the string once. A mapping of the bit array index and the character set is upto you to decide. Break if you see that a particular bit is on already.
/(.).*\1/
(or whatever the equivalent is in your regex library's syntax)
Not the most efficient, since it will probably backtrack to every character in the string and then scan forward again. And I don't usually advocate regular expressions. But if you want brevity...
I started looking for some info on the net and I got to the following solution.
string input = "aaaaabbcbbbcccddefgg";
char[] chars = input.ToCharArray();
Dictionary<char, int> dictionary = new Dictionary<char,int>();
foreach (char c in chars)
{
if (!dictionary.ContainsKey(c))
{
dictionary[c] = 1; //
}
else
{
dictionary[c]++;
}
}
foreach (KeyValuePair<char, int> combo in dictionary)
{
if (combo.Value > 1) //If the vale of the key is greater than 1 it means the letter is repeated
{
Console.WriteLine("Letter " + combo.Key + " " + "is repeated " + combo.Value.ToString() + " times");
}
}
I hope it helps, I had a job interview in which the interviewer asked me to solve this and I understand it is a common question.
When there is no order to work on you could use a dictionary to keep the counts:
String input = "AABCD";
var result = new Dictionary<Char, int>(26);
var chars = input.ToCharArray();
foreach (var c in chars)
{
if (!result.ContainsKey(c))
{
result[c] = 0; // initialize the counter in the result
}
result[c]++;
}
foreach (var charCombo in result)
{
Console.WriteLine("{0}: {1}",charCombo.Key, charCombo.Value);
}
The hash solution Jon was describing is probably the best. You could use a HybridDictionary since that works well with small and large data sets. Where the letter is the key and the value is the frequency. (Update the frequency every time the add fails or the HybridDictionary returns true for .Contains(key))
I find that my program is searching through lots of lengthy strings (20,000+) trying to find a particular unique phrase.
What is the most efficent method for doing this in C#?
Below is the current code which works like this:
The search begins at startPos because the target area is somewhat removed from the start
It loops through the string, at each step it checks if the substring from that point starts with the startMatchString, which is an indicator that the start of the target string has been found. (The length of the target string varys).
From here it creates a new substring (chopping off the 11 characters that mark the start of the target string) and searches for the endMatchString
I already know that this is a horribly complex and possibly very inefficent algorithm.
What is a better way to accomplish the same result?
string result = string.Empty;
for (int i = startPos; i <= response.Length - 1; i++)
{
if (response.Substring(i).StartsWith(startMatchString))
{
string result = response.Substring(i).Substring(11);
for (int j = 0; j <= result.Length - 1; j++)
{
if (result.Substring(j).StartsWith(endMatchString))
{
return result.Remove(j)
}
}
}
}
return result;
You can use String.IndexOf, but make sure you use StringComparison.Ordinal or it may be one order of magnitude slower.
private string Search2(int startPos, string startMatchString, string endMatchString, string response) {
int startMarch = response.IndexOf(startMatchString, startPos, StringComparison.Ordinal);
if (startMarch != -1) {
startMarch += startMatchString.Length;
int endMatch = response.IndexOf(endMatchString, startMarch, StringComparison.Ordinal);
if (endMatch != -1) { return response.Substring(startMarch, endMatch - startMarch); }
}
return string.Empty;
}
Searching 1000 times a string at about the 40% of a 183 KB file took about 270 milliseconds. Without StringComparison.Ordinal it took about 2000 milliseconds.
Searching 1 time with your method took over 60 seconds as it creates a new string (O(n)) each iteration, making your method O(n^2).
There are a whole bunch of algorithms,
boyer and moore
Sunday
Knuth-Morris-Pratt
Rabin-Karp
I would recommend to use the simplified Boyer-Moore, called Boyer–Moore–Horspool.
The C-code appears at the wikipedia.
For the java code look at
http://www.fmi.uni-sofia.bg/fmi/logic/vboutchkova/sources/BoyerMoore_java.html
A nice article about these is available under
http://www.ibm.com/developerworks/java/library/j-text-searching.html
If you want to use built-in stuff go for regular expressions.
It depends on what you're trying to find in the string. If you're looking for a specific sequence IndexOf/Contains are fast, but if you're looking for wild card patterns Regex is optimized for this kind of search.
I would try to use a Regular Expression instead of rolling my own string search algorithm. You can precompile the regular expression to make it run faster.
For very long strings you cannot beat the boyer-moore search algorithm. It is more complex than I might try to explain here, but The CodeProject site has a pretty good article on it.
You could use a regex; it’s optimized for this kind of searching and manipulation.
You could also try IndexOf ...
string result = string.Empty;
if (startPos >= response.Length)
return result;
int startingIndex = response.IndexOf(startMatchString, startPos);
int rightOfStartIndex = startingIndex + startMatchString.Length;
if (startingIndex > -1 && rightOfStartIndex < response.Length)
{
int endingIndex = response.IndexOf(endMatchString, rightOfStartIndex);
if (endingIndex > -1)
result = response.Substring(rightOfStartIndex, endingIndex - rightOfStartIndex);
}
return result;
Here's an example using IndexOf (beware: written from the top of my head, didn't test it):
int skip = 11;
int start = response.IndexOf(startMatchString, startPos);
if (start >= 0)
{
int end = response.IndexOf(startMatchString, start + skip);
if (end >= 0)
return response.Substring(start + skip, end - start - skip);
else
return response.Substring(start + skip);
}
return string.Empty;
As said before regex is your friend.
You might want to look at RegularExpressions.Group.
This way you can name part of the matched resultset.
Here is an example