I have this code below where I loop through string and compare everything char by char and it's very slow process I wonder how I can improve this code.
//delete anti-xss junk ")]}'\n" (5 chars);
if (trim)
{
googlejson = googlejson.Substring(5);
}
//pass through result and turn empty elements into nulls
//echo strlen( $googlejson ) . '<br>';
bool instring = false;
bool inescape = false;
string lastchar = "";
string output = "";
for ( int x=0; x< googlejson.Length; x++ ) {
string ch = googlejson.Substring(x, 1);
//toss unnecessary whitespace
if ( !instring && ( Regex.IsMatch(ch, #"/\s/"))) {
continue;
}
//handle strings
if ( instring ) {
if (inescape) {
output += ch;
inescape = false;
} else if ( ch == "\\" ) {
output += ch;
inescape = true;
} else if ( ch == "\"") {
output += ch;
instring = false;
} else {
output += ch;
}
lastchar = ch;
continue;
}
switch ( ch ) {
case "\"":
output += ch;
instring = true;
break;
case ",":
if ( lastchar == "," || lastchar == "[" || lastchar == "{" ) {
output += "null";
}
output += ch;
break;
case "]":
case "}":
if ( lastchar == "," ) {
output += "null";
}
output += ch;
break;
default:
output += ch;
break;
}
lastchar = ch;
}
return output;
This is just amazing.
I have changed 2 following lines and gain phenomenal performance increase like 1000% or something
First change this
string ch = googlejson.Substring(x, 1);
to that
string ch = googlejson[x].ToString();
Second I replaced all += ch with String Builder
output.Append(ch);
So those 2 changes had maximum performance impact.
First, you shouldn't use Substrings, when only dealing with single characters. Use
char ch = googlejson[x];
instead.
You could also consider using a StringBuilder for your output variable. If you're working with string, you should always have in mind, that strings are immutable in .NET, so for every
output += ch;
there is a new string instance created.
Use
StringBuilder output = new StringBuilder();
and
output.append(ch);
instead.
As per the other comments, this code's use of strings as characters and Substring() is pretty dire - in terms of performance.
Also, the use of Regex to check for whitespace going to be very inefficient.
If you want to operate on characters, use characters (char) not strings.
The for loop is a bit inefficient, but the JIT compiler probably optimises that away. It would be slightly better to use a local variable instead of accessing Length property.
Doing a switch on strings is pretty inefficient too, when a switch on characters is darn fast.
And as MartinStettner suggested, StringBuilder append will be better for building the result. (#Tom Squires - This question is all about performance, so yes it does matter, and it isn't more complex - it may be a few more characters but that's not complexity.
Finally, I would say that if you have performance problems (apart from this dire code), you should consider measuring it with a profiler before getting carried away with optimisation.
PS This looks like an interview question ... tut tut if this is the case, that's not what SO is for.
Why not use the StringReader instead of SubString
var output = new StringBuilder();
using (var reader = new StringReader(googleJson)
{
var buffer = new char[1]
while (reader.Read(buffer, 0, 1) == 1)
{
var ch = buffer[0];
//your stuff
output.Append(ch);
}
}
return output.ToString();
You could use StringReader.Read() and do all you logic on the integer code value of the charachter which would be fast but a little brittle.
What about:
if ( !instring && ( Regex.IsMatch(ch, #"/\s/")))
to
if ( !instring && ch < 33)
or even better:
if ( !instring && Char.IsWhiteSpace(ch))
Related
I am looking to see if there is a better way to figure out if which of two characters appears first in a string.
my current code for this is
string UserInput = Console.Readline;
char FirstFound;
if (UserInput.IndexOf('+') > UserInput.IndexOf('-') )
{
FirstFound = '+';
}
else
{
FirstFound = '-';
}
Is there a method that allows more than 1 input so can simplify this? Or anything else to make this shorter?
You can shorten it a little bit by understanding the code effectively has the - character as the default value, because it's the result of the else block. With that in mind, we can do this to remove the else block:
string UserInput = Console.Readline();
char FirstFound = '-';
if (UserInput.IndexOf('+') > UserInput.IndexOf('-') )
{
FirstFound = '+';
}
We could also do this, which is not shorter but will perform better:
string UserInput = Console.ReadLine();
char FirstFound;
foreach(char c in UserInput)
{
if (c == '+' || c == '-')
{
FirstFound = c;
break;
}
}
Which we can shorten to use the linq FirstOrDefault() method:
string UserInput = Console.ReadLine();
char FirstFound = UserInput.FirstOrDefault(c => "-+".Contains(c));
If you want to able to expand this to allow more than two search targets, you can add the targets to the string like so, with no additional lines of code:
string UserInput = Console.ReadLine();
char FirstFound = UserInput.FirstOrDefault(c => "-+*/x÷".Contains(c));
I use VS2019 in Windows7.
I want to remove string between "|" and "," in a StringBuilder.
That is , I want to convert StringBuilder from
"578.552|0,37.986|317,38.451|356,23"
to
"578.552,37.986,38.451,23"
I have tried Substring but failed, what other method I could use to achieve this?
If you have a huge StringBuilder and that's why converting it into String and applying regular expression is not the option,
you can try implementing Finite State Machine (FSM):
StringBuilder source = new StringBuilder("578.552|0,37.986|317,38.451|356,23");
int state = 0; // 0 - keep character, 1 - discard character
int index = 0;
for (int i = 0; i < source.Length; ++i) {
char c = source[i];
if (state == 0)
if (c == '|')
state = 1;
else
source[index++] = c;
else if (c == ',') {
state = 0;
source[index++] = c;
}
}
source.Length = index;
StringBuilder isn't really setup for much by way of inspection and mutation in the middle. It would be pretty easy to do once you have a string (probably via a Regex), but StringBuilder? not so much. In reality, StringBuilder is mostly intended for forwards-only append, so the answer would be:
if you didn't want those characters, why did you add them?
Maybe just use the string version here; then:
var s = "578.552|0,37.986|317,38.451|356,23";
var t = Regex.Replace(s, #"\|.*?(?=,)", ""); // 578.552,37.986,38.451,23
The regex translation here is "pipe (\|), non-greedy anything (.*?), followed by a comma where the following comma isn't part of the match ((?=,)).
If you don't know very much of Regex patterns, you can write your own custom method to filter out data; its always instructive and a good practicing exercise:
public static String RemoveDelimitedSubstrings(
this StringBuilder s,
char startDelimitter,
char endDelimitter,
char newDelimitter)
{
var buffer = new StringBuilder(s.Length);
var ignore = false;
for (var i = 0; i < s.Length; i++)
{
var currentChar = s[i];
if (currentChar == startDelimitter && !ignore)
{
ignore = true;
}
else if (currentChar == endDelimitter && ignore)
{
ignore = false;
buffer.Append(newDelimitter);
}
else if (!ignore)
buffer.Append(currentChar);
}
return buffer.ToString();
}
And youd obvisouly use it like:
var buffer= new StringBuilder("578.552|0,37.986|317,38.451|356,23");
var filteredBuffer = b.RemoveDelimitedSubstrings('|', ',', ','));
My project: Basically, I've written a small encrypting program in the workshop which takes user input and checks whether the character position in the loop is even, if so it will at the front of the string, else at the end. It looks something like this;
string userInput = "", encodedInput = "", evenChar = "", oddChar = "";
int charCount = 0;
Console.WriteLine("Input your text: ");
userInput = Console.ReadLine();
foreach(char character in userInput)
{
charCount++;
if ((charCount % 2) == 0)
{
evenChar = evenChar + character;
}
else
{
oddChar = character + oddChar;
}
encodedInput = evenChar + oddChar;
}
Console.WriteLine(encodedInput);
Now this works fine, when i type in "hi my name is jeff!" I get "im aei ef!fjs mny h".
Now I'm trying to write a deciphering loop. The method I chose for deciphering is basically taking the last character from the string adding it to a new empty string and then taking the first character from the string and also adding it to the same empty string and then simply decrements the overall length of the encrypted string and increments the position of the first character.
char lastChar = ' ';
char firstChar = ' ';
StringBuilder decodedInput = new StringBuilder();
int len = encodedInput.Length;
int len2 = 0;
foreach(char character in encodedInput)
{
lastChar = encodedInput[len - 1];
decodedInput.Append(lastChar.ToString());
len--;
firstChar = encodedInput[len2];
len2++;
decodedInput.Append(firstChar.ToString());
}
Console.WriteLine(decodedInput.ToString());
Now this works fine for the most part. It takes the same "im aei ef!fjs mny h" and outputs "hi my name is jeff!!ffej si eman ym ih". It mirrors the string because for each loop i produce to characters so "hi my name is jeff" turns into 36 characters. I've tried halving the loop, but you still get some mirroring.
I'm well aware that there are better or easier methods for deciphering this, but I want to do it this way for the educational purposes.
Kind regards,
Vocaloidas.
Don't loop over each character of the encoded input as you will end up processing each character twice. You are already counting up and down the string with the len and len2 variables so if you replace the foreach with:
while (len > len2)
this will only process each character of the string once
You will have to do some special casing when the string is an odd number of characters to deal with the middle character - i.e. when len and len2 are equal. To this end add the following:
if (len == len2)
break;
in middle of the loop so that it becomes:
while (len > len2)
{
lastChar = encodedInput[len - 1];
decodedInput.Append(lastChar.ToString());
len--;
if (len == len2)
break;
firstChar = encodedInput[len2];
len2++;
decodedInput.Append(firstChar.ToString());
}
EDIT : Here's my current code (21233664 chars)
string str = myInput.Text;
StringBuilder sb = new StringBuilder();
foreach (char c in str)
{
if ((c >= 'a' && c <= 'z') || c == '_' || c==' ')
{
sb.Append(c);
}
}
output.Text = sb.ToString();
Let's say I have a huge text file which contains special characters and normal expressions with underscores.
Here are a few examples of the strings that I'm looking for :
super_test
test
another_super_test
As you can see, only lower case letters are allowed with underscores.
Now, if I have those strings in a text file that looks like this :
> §> ˜;# ®> l? super_test D>ÿÿÿÿ “G? tI> €[> €? È
The problem I'm facing is that some lonely letters are still saved. In the example given above, the output would be :
l super_test t
To get ridden of those chars, I must go through the whole file again but here's my question : how can I know whether a letter is lonely or not?
I'm not sure I understand the possibilities with regex, so if anyone can give me a hint I'd really appreciate it.
You clearly need a regular expression. A simple one would be [a-z_]{2,}, which takes all strings of lowercase a to z letters and underscore that are at least 2 characters long.
Just be careful when you are parsing the big file. Being huge, I imagine you use some sort of buffers. You need to make sure you don't get half of a word in one buffer and the other in the next.
You can't treat the space just like the other acceptable characters. In addition to being acceptable, the space also serves as a delimiter for your lonesome characters. (This might be a problem with the proposed regular expressions as well; I couldn't say for sure.) Anyway, this does what (I think) you want:
string str = "> §> ˜;# ®> l? super_test D>ÿÿÿÿ “G? tI> €[> €? È";
StringBuilder sb = new StringBuilder();
char? firstLetterOfWord = null;
foreach (char c in str)
{
if ((c >= 'a' && c <= 'z') || c == '_')
{
int length = sb.Length;
if (firstLetterOfWord != null)
{
// c is the second character of a word
sb.Append(firstLetterOfWord);
sb.Append(c);
firstLetterOfWord = null;
}
else if (length == 0 || sb[length - 1] == ' ')
{
// c is the first character of a word; save for next iteration
firstLetterOfWord = c;
}
else
{
// c is part of a word; we're not first, and prev != space
sb.Append(c);
}
}
else if (c == ' ')
{
// If you want to eliminate multiple spaces in a row,
// this is the place to do so
sb.Append(' ');
firstLetterOfWord = null;
}
else
{
firstLetterOfWord = null;
}
}
Console.WriteLine(sb.ToString());
It works with singletons and full words at both start and end of string.
If your input contains something like one#two, the output will run together (onetwo with no intervening space). Assuming that's not what you want, and also assuming that you have no need for multiple spaces in a row:
StringBuilder sb = new StringBuilder();
bool previousWasSpace = true;
char? firstLetterOfWord = null;
foreach (char c in str)
{
if ((c >= 'a' && c <= 'z') || c == '_')
{
if (firstLetterOfWord != null)
{
sb.Append(firstLetterOfWord).Append(c);
firstLetterOfWord = null;
previousWasSpace = false;
}
else if (previousWasSpace)
{
firstLetterOfWord = c;
}
else
{
sb.Append(c);
}
}
else
{
firstLetterOfWord = null;
if (!previousWasSpace)
{
sb.Append(' ');
previousWasSpace = true;
}
}
}
Console.WriteLine(sb.ToString());
How would you normalize all new-line sequences in a string to one type?
I'm looking to make them all CRLF for the purpose of email (MIME documents). Ideally this would be wrapped in a static method, executing very quickly, and not using regular expressions (since the variances of line breaks, carriage returns, etc. are limited). Perhaps there's even a BCL method I've overlooked?
ASSUMPTION: After giving this a bit more thought, I think it's a safe assumption to say that CR's are either stand-alone or part of the CRLF sequence. That is, if you see CRLF then you know all CR's can be removed. Otherwise it's difficult to tell how many lines should come out of something like "\r\n\n\r".
input.Replace("\r\n", "\n").Replace("\r", "\n").Replace("\n", "\r\n")
This will work if the input contains only one type of line breaks - either CR, or LF, or CR+LF.
It depends on exactly what the requirements are. In particular, how do you want to handle "\r" on its own? Should that count as a line break or not? As an example, how should "a\n\rb" be treated? Is that one very odd line break, one "\n" break and then a rogue "\r", or two separate linebreaks? If "\r" and "\n" can both be linebreaks on their own, why should "\r\n" not be treated as two linebreaks?
Here's some code which I suspect is reasonably efficient.
using System;
using System.Text;
class LineBreaks
{
static void Main()
{
Test("a\nb");
Test("a\nb\r\nc");
Test("a\r\nb\r\nc");
Test("a\rb\nc");
Test("a\r");
Test("a\n");
Test("a\r\n");
}
static void Test(string input)
{
string normalized = NormalizeLineBreaks(input);
string debug = normalized.Replace("\r", "\\r")
.Replace("\n", "\\n");
Console.WriteLine(debug);
}
static string NormalizeLineBreaks(string input)
{
// Allow 10% as a rough guess of how much the string may grow.
// If we're wrong we'll either waste space or have extra copies -
// it will still work
StringBuilder builder = new StringBuilder((int) (input.Length * 1.1));
bool lastWasCR = false;
foreach (char c in input)
{
if (lastWasCR)
{
lastWasCR = false;
if (c == '\n')
{
continue; // Already written \r\n
}
}
switch (c)
{
case '\r':
builder.Append("\r\n");
lastWasCR = true;
break;
case '\n':
builder.Append("\r\n");
break;
default:
builder.Append(c);
break;
}
}
return builder.ToString();
}
}
Simple variant:
Regex.Replace(input, #"\r\n|\r|\n", "\r\n")
For better performance:
static Regex newline_pattern = new Regex(#"\r\n|\r|\n", RegexOptions.Compiled);
[...]
newline_pattern.Replace(input, "\r\n");
string nonNormalized = "\r\n\n\r";
string normalized = nonNormalized.Replace("\r", "\n").Replace("\n", "\r\n");
This is a quick way to do that, I mean.
It does not use an expensive regex function.
It also does not use multiple replacement functions that each individually did loop over the data with several checks, allocations, etc.
So the search is done directly in one for loop. For the number of times that the capacity of the result array has to be increased, a loop is also used within the Array.Copy function. That are all the loops.
In some cases, a larger page size might be more efficient.
public static string NormalizeNewLine(this string val)
{
if (string.IsNullOrEmpty(val))
return val;
const int page = 6;
int a = page;
int j = 0;
int len = val.Length;
char[] res = new char[len];
for (int i = 0; i < len; i++)
{
char ch = val[i];
if (ch == '\r')
{
int ni = i + 1;
if (ni < len && val[ni] == '\n')
{
res[j++] = '\r';
res[j++] = '\n';
i++;
}
else
{
if (a == page) // Ensure capacity
{
char[] nres = new char[res.Length + page];
Array.Copy(res, 0, nres, 0, res.Length);
res = nres;
a = 0;
}
res[j++] = '\r';
res[j++] = '\n';
a++;
}
}
else if (ch == '\n')
{
int ni = i + 1;
if (ni < len && val[ni] == '\r')
{
res[j++] = '\r';
res[j++] = '\n';
i++;
}
else
{
if (a == page) // Ensure capacity
{
char[] nres = new char[res.Length + page];
Array.Copy(res, 0, nres, 0, res.Length);
res = nres;
a = 0;
}
res[j++] = '\r';
res[j++] = '\n';
a++;
}
}
else
{
res[j++] = ch;
}
}
return new string(res, 0, j);
}
I now that '\n\r' is not actually used on basic platforms. But who would use two types of linebreaks in succession to indicate two linebreaks?
If you want to know that, then you need to take a look before to know if the \n and \r both are used separately in the same document.
Environment.NewLine;
A string containing "\r\n" for non-Unix platforms, or a string containing "\n" for Unix platforms.
str.Replace("\r", "").Replace("\n", "\r\n");
Converts both types of line breaks (\n and \n\r's) into CRLFs
on .NET 6 it's 35% faster than regex (Benchmarked using BenchmarkDotNet)