Optimising HTML tag removal in C#

Optimising HTML tag removal in C# - c#

I have some code that removes HTML tags from text. I don't care about the content (script, css, text etc), the important thing, at least for now, is that the tags themselves are stripped out.
This may be entering the theatre of micro-optimisation, however this code is among a small number of functions that will be running very often against large amounts of data, so any percentage saving may carry through to a useful saving from the overall application's perspective.
The code at present looks like this:
public static string StripTags(string html)
{
var currentIndex = 0;
var insideTag = false;
var output = new char[html.Length];
for (int i = 0; i < html.Length; i++)
{
var c = html[i];
if (c == '>')
{
insideTag = false;
continue;
}
if (!insideTag)
{
if (c == '<')
{
insideTag = true;
continue;
}
output[currentIndex] = c;
currentIndex++;
}
}
return new string(output, 0, currentIndex);
}
Are there any obvious .net tricks I'm missing out on here? For info this is using .net 4.
Many thanks.

In this code you copy chars one by one. You might be able to speed it up considerably by only checking where the current section (inside or outside html) ends and then use Array.copy to move that whole chunk in one go, this would enable lower level optimizations. (for instance on 64 bit it could copy 4 unicode chars (4 * 2* 8 bit) in one processor cycle). The bits of text in between the tags are probably quite large so this could add up.
Also the stringbuilder documentation mentioned somewhere that becuase it's implemented in the framework and not in C# it has perfomance that you can't replicate in managed C#. Not sure how you could append a chunk you might look into that.
Regards Gert-Jan

You should take a look at the following library as it seems to be the best way to interact with html files in .NET: http://htmlagilitypack.codeplex.com/

Do not solve a non existing problem.
How many times will this method be called? Many! How many? Several thousands? Not enough to warrant optimization.
Can you just do a Parallel.For and speed it up 3-5 times depending on machine? Possibly.
Is your code dependent on lots of other code? Certainly.
Is it possible that you have this:
// Some slow code
StripTags(s); // Super fast version
// Some more slow code here
Will it matter then how fast is your StripTags?
Are you getting them from a file? Are you getting them from a network? Very rarely the bottleneck is your raw CPU power.
Let me repeat myself:
Do not solve a non existing problem!

You can also encode it:
string encodedString = Server.HtmlEncode(stringToEncode);
Have a look here: http://msdn.microsoft.com/en-us/library/ms525347%28v=vs.90%29.aspx

Googling for remove html from string yields many links that talk about using Regular Expressions all similar to the following:
public string Strip(string text)
{
return Regex.Replace(text, #”<(.|\n)*?>”, string.Empty);
}

Related

Variable replacement in string

Developing an application in c# to replace variables across strings.
Need suggestions how to do that very efficiently?
private string MyStrTr(string source, string frm, string to)
{
char[] input = source.ToCharArray();
bool[] replaced = new bool[input.Length];
for (int j = 0; j < input.Length; j++)
replaced[j] = false;
for (int i = 0; i < frm.Length; i++)
{
for(int j = 0; j<input.Length;j++)
if (replaced[j] == false && input[j]==frm[i])
{
input[j] = to[i];
replaced[j] = true;
}
}
return new string(input);
}
Above code works fine but each variables need to traverse across string according to variable count.
Exact requirement.
parent.name = am;
parent.number = good;
I {parent.name} a {parent.number} boy.
output should be I am a good boy.
think like source will be huge.
For example if i have 5 different variable need to traverse full string 5 times.
Need a suggestions how to process the variables parellely during first time traversal?

I think you're suffering from premature optimization. Write the simplest thing first and see if it works for you. If you don't suffer a performance problem, then you're done. Don't waste time trying to make it faster when you don't know how fast it is, or if it's the cause of your performance problem.
By the way, Facebook's Terms of Service, the entire HTML page that includes a lot of Javascript, is only 164 kilobytes. That's not especially large.
String.Replace should work quite well, even if you have multiple strings to replace. That is, you can write:
string result = source.Replace("{parent.name}", "am");
result = result.Replace("{parent.number}", "good");
// more replacements here
return result;
That will exercise the garbage collector a little bit, but it shouldn't be a problem unless you have a truly massive page or a whole mess of replacements.
You can potentially save yourself some garbage collection by converting the string to a StringBuilder, and calling StringBuilder.Replace multiple times. I honestly don't know, though, whether that will have any appreciable effect. I don't know how StringBuilder.Replace is implemented.
There is a way to make this faster, by writing code that will parse the string and do all the replacements in a single pass. It's a lot of code, though. You have to build a state machine from the multiple search strings and go through the source text one character at a time. It's doable, but it's a difficult enough task that you probably don't want to do it unless the simple method flat doesn't work quickly enough.

how to convert part of the string to int/float/vector3 etc. without creating a temp string?

in C#, I have a string like this:
"1 3.14 (23, 23.2, 43,88) 8.27"
I need to convert this string to other types according to the value like int/float/vector3, now i have some code like this:
public static int ReadInt(this string s, ref string op)
{
s = s.Trim();
string ss = "";
int idx = s.IndexOf(" ");
if (idx > 0)
{
ss = s.Substring(0, idx);
op = s.Substring(idx);
}
else
{
ss = s;
op = "";
}
return Convert.ToInt32(ss);
}
this will read the first int value out, i have some similar functions to read float vector3 etc. but the problem is : in my application, i have to do this a lot because i received the string from some plugin and i need to do it every single frame, so i created a lot of strings which caused a lot GC will impact the performance, is their a way i can do similar stuff without creating temp strings?

Generation 0 objects such as those created here may well not impact performance too much, as they are relatively cheap to collect. I would change from using Convert to calling int.Parse() with the invariant culture before I started worrying about the GC overhead of the extra strings.
Also, you don't really need to create a new string to accomplish the Trim() behavior. After all, you're scanning and indexing the string anyway. Just do your initial scan for whitespace, and then for the space delimiter between ss and op, so you get just the substrings you need. Right now you're creating 50% more string instances than you really need.
All that said, no...there's not anything built into the basic .NET framework that would parse a substring without actually creating a new string instance. You would have to write your own parsing routines to accomplish that.
You should measure the actual real-world performance impact first, to make sure these substrings really are a significant issue.
I don't know what the "some plugin" is or how you have to handle the input from it, but I would not be surprised to hear that the overhead in acquiring the original input string(s) for this scenario swamps the overhead of the substrings for parsing.

Fastest way to search ASCII files in C# for simple keywords?

Right now, I search ASCII files for simple keywords like this:
int SearchInFile (string file, string searchString)
{
int num = 0;
StreamReader reader = File.OpenText (file);
string line = reader.ReadLine();
while (line != null)
{
int count = CountSubstrings(line, searchString);
if (count != 0)
{
num += count;
}
line = reader.ReadLine();
}
reader.Close();
return num;
}
Is this the fastest, most memory efficient way to do it? Returning the count is optional if it's going to make a huge difference in the way of searching, but not on its own.
I use it like:
SearchInFile ( "C:\\text.txt", "cool" );

In unmanaged code the most effective way from the performance side will be to use Memory-Mapped Files instead of reading the file in buffer. I am sure that the best results can be achieved only in the way, especially if the file which you want to scan could be a file from the remote storage (a file from the server).
I am not sure that the usage of the corresponding .NET 4.0 classes will be in your case exactly the same effective.

Just load the text file into a large string using StreamReader's ReadToEnd method and use string.IndexOf():
string test = reader.ReadToEnd();
test.indexOf("keyword")

If you really want more performance (processing files on the order of hundreds of MB or GB), then instead of doing a line-by-line search, you should read in strings by blocks of perhaps 1k and do searches on them. Despite having to deal with some boundary conditions, this should prove faster.
That being said, you should apply a profiler like ANTS to see if this is actually your bottleneck.

Fast string parsing in C#

What's the fastest way to parse strings in C#?
Currently I'm just using string indexing (string[index]) and the code runs reasonably, but I can't help but think that the continuous range checking that the index accessor does must be adding something.
So, I'm wondering what techniques I should consider to give it a boost. These are my initial thoughts/questions:
Use methods like string.IndexOf() and IndexOfAny() to find characters of interest. Are these faster than manually scanning a string by string[index]?
Use regex's. Personally, I don't like regex as I find them difficult to maintain, but are these likely to be faster than manually scanning the string?
Use unsafe code and pointers. This would eliminate the index range checking but I've read that unsafe code wont run in untrusted environments. What exactly are the implications of this? Does this mean the whole assembly won't load/run, or will only the code marked unsafe refuse to run? The library could potentially be used in a number of environments, so to be able to fall back to a slower but more compatible mode would be nice.
What else might I consider?
NB: I should say, the strings I'm parsing could be reasonably large (say 30k) and in a custom format for which there is no standard .NET parser. Also, performance of this code is not super critical, so this partly just a theoretical question of curiosity.

30k is not what I would consider to be large. Before getting excited, I would profile. The indexer should be fine for the best balance of flexibility and safety.
For example, to create a 128k string (and a separate array of the same size), fill it with junk (including the time to handle Random) and sum all the character code-points via the indexer takes... 3ms:
var watch = Stopwatch.StartNew();
char[] chars = new char[128 * 1024];
Random rand = new Random(); // fill with junk
for (int i = 0; i < chars.Length; i++) chars[i] =
(char) ((int) 'a' + rand.Next(26));
int sum = 0;
string s = new string(chars);
int len = s.Length;
for(int i = 0 ; i < len ; i++)
{
sum += (int) chars[i];
}
watch.Stop();
Console.WriteLine(sum);
Console.WriteLine(watch.ElapsedMilliseconds + "ms");
Console.ReadLine();
For files that are actually large, a reader approach should be used - StreamReader etc.

"Parsing" is quite an inexact term. Since you talks of 30k, it seems that you might be dealing with some sort of structured string which can be covered by creating a parser using a parser generator tool.
A nice tool to create, maintain and understand the whole process is the GOLD Parsing System by Devin Cook: http://www.devincook.com/goldparser/
This can help you create code which is efficient and correct for many textual parsing needs.
As for your points:
is usually not useful for parsing which goes further than splitting a string.
is better suited if there are no recursions or too complex rules.
is basically a no-go if you haven't really identified this as a serious problem. The JIT can take care of doing the range checks only when needed, and indeed for simple loops (the typical for loop) this is handled pretty well.

Performance issue: comparing to String.Format

A while back a post by Jon Skeet planted the idea in my head of building a CompiledFormatter class, for using in a loop instead of String.Format().
The idea is the portion of a call to String.Format() spent parsing the format string is overhead; we should be able to improve performance by moving that code outside of the loop. The trick, of course, is the new code should exactly match the String.Format() behavior.
This week I finally did it. I went through using the .Net framework source provided by Microsoft to do a direct adaption of their parser (it turns out String.Format() actually farms the work to StringBuilder.AppendFormat()). The code I came up with works, in that my results are accurate within my (admittedly limited) test data.
Unfortunately, I still have one problem: performance. In my initial tests the performance of my code closely matches that of the normal String.Format(). There's no improvement at all; it's even consistently a few milliseconds slower. At least it's still in the same order (ie: the amount slower doesn't increase; it stays within a few milliseconds even as the test set grows), but I was hoping for something better.
It's possible that the internal calls to StringBuilder.Append() are what actually drive the performance, but I'd like to see if the smart people here can help improve things.
Here is the relevant portion:
private class FormatItem
{
public int index; //index of item in the argument list. -1 means it's a literal from the original format string
public char[] value; //literal data from original format string
public string format; //simple format to use with supplied argument (ie: {0:X} for Hex
// for fixed-width format (examples below)
public int width; // {0,7} means it should be at least 7 characters
public bool justify; // {0,-7} would use opposite alignment
}
//this data is all populated by the constructor
private List<FormatItem> parts = new List<FormatItem>();
private int baseSize = 0;
private string format;
private IFormatProvider formatProvider = null;
private ICustomFormatter customFormatter = null;
// the code in here very closely matches the code in the String.Format/StringBuilder.AppendFormat methods.
// Could it be faster?
public String Format(params Object[] args)
{
if (format == null || args == null)
throw new ArgumentNullException((format == null) ? "format" : "args");
var sb = new StringBuilder(baseSize);
foreach (FormatItem fi in parts)
{
if (fi.index < 0)
sb.Append(fi.value);
else
{
//if (fi.index >= args.Length) throw new FormatException(Environment.GetResourceString("Format_IndexOutOfRange"));
if (fi.index >= args.Length) throw new FormatException("Format_IndexOutOfRange");
object arg = args[fi.index];
string s = null;
if (customFormatter != null)
{
s = customFormatter.Format(fi.format, arg, formatProvider);
}
if (s == null)
{
if (arg is IFormattable)
{
s = ((IFormattable)arg).ToString(fi.format, formatProvider);
}
else if (arg != null)
{
s = arg.ToString();
}
}
if (s == null) s = String.Empty;
int pad = fi.width - s.Length;
if (!fi.justify && pad > 0) sb.Append(' ', pad);
sb.Append(s);
if (fi.justify && pad > 0) sb.Append(' ', pad);
}
}
return sb.ToString();
}
//alternate implementation (for comparative testing)
// my own test call String.Format() separately: I don't use this. But it's useful to see
// how my format method fits.
public string OriginalFormat(params Object[] args)
{
return String.Format(formatProvider, format, args);
}
Additional notes:
I'm wary of providing the source code for my constructor, because I'm not sure of the licensing implications from my reliance on the original .Net implementation. However, anyone who wants to test this can just make the relevant private data public and assign values that mimic a particular format string.
Also, I'm very open to changing the FormatInfo class and even the parts List if anyone has a suggestion that could improve the build time. Since my primary concern is sequential iteration time from front to end maybe a LinkedList would fare better?
[Update]:
Hmm... something else I can try is adjusting my tests. My benchmarks were fairly simple: composing names to a "{lastname}, {firstname}" format and composing formatted phone numbers from the area code, prefix, number, and extension components. Neither of those have much in the way of literal segments within the string. As I think about how the original state machine parser worked, I think those literal segments are exactly where my code has the best chance to do well, because I no longer have to examine each character in the string.
Another thought:
This class is still useful, even if I can't make it go faster. As long as performance is no worse than the base String.Format(), I've still created a strongly-typed interface which allows a program to assemble it's own "format string" at run time. All I need to do is provide public access to the parts list.

Here's the final result:
I changed the format string in a benchmark trial to something that should favor my code a little more:
The quick brown {0} jumped over the lazy {1}.
As I expected, this fares much better compared to the original; 2 million iterations in 5.3 seconds for this code vs 6.1 seconds for String.Format. This is an undeniable improvement. You might even be tempted to start using this as a no-brainer replacement for many String.Format situations. After all, you'll do no worse and you might even get a small performance boost: as much 14%, and that's nothing to sneeze at.
Except that it is. Keep in mind, we're still talking less than half a second difference for 2 million attempts, under a situation specifically designed to favor this code. Not even busy ASP.Net pages are likely to create that much load, unless you're lucky enough to work on a top 100 web site.
Most of all, this omits one important alternative: you can create a new StringBuilder each time and manually handle your own formatting using raw Append() calls. With that technique my benchmark finished in only 3.9 seconds. That's a much greater improvement.
In summary, if performance doesn't matter as much, you should stick with the clarity and simplicity of the built-in option. But when in a situation where profiling shows this really is driving your performance, there is a better alternative available via StringBuilder.Append().

Don't stop now!
Your custom formatter might only be slightly more efficient than the built-in API, but you can add more features to your own implementation that would make it more useful.
I did a similar thing in Java, and here are some of the features I added (besides just pre-compiled format strings):
1) The format() method accepts either a varargs array or a Map (in .NET, it'd be a dictionary). So my format strings can look like this:
StringFormatter f = StringFormatter.parse(
"the quick brown {animal} jumped over the {attitude} dog"
);
Then, if I already have my objects in a map (which is pretty common), I can call the format method like this:
String s = f.format(myMap);
2) I have a special syntax for performing regular expression replacements on strings during the formatting process:
// After calling obj.toString(), all space characters in the formatted
// object string are converted to underscores.
StringFormatter f = StringFormatter.parse(
"blah blah blah {0:/\\s+/_/} blah blah blah"
);
3) I have a special syntax that allows the formatted to check the argument for null-ness, applying a different formatter depending on whether the object is null or non-null.
StringFormatter f = StringFormatter.parse(
"blah blah blah {0:?'NULL'|'NOT NULL'} blah blah blah"
);
There are a zillion other things you can do. One of the tasks on my todo list is to add a new syntax where you can automatically format Lists, Sets, and other Collections by specifying a formatter to apply to each element as well as a string to insert between all elements. Something like this...
// Wraps each elements in single-quote charts, separating
// adjacent elements with a comma.
StringFormatter f = StringFormatter.parse(
"blah blah blah {0:#['$'][,]} blah blah blah"
);
But the syntax is a little awkward and I'm not in love with it yet.
Anyhow, the point is that your existing class might not be much more efficient than the framework API, but if you extend it to satisfy all of your personal string-formatting needs, you might end up with a very convenient library in the end. Personally, I use my own version of this library for dynamically constructing all SQL strings, error messages, and localization strings. It's enormously useful.

It seems to me that in order to get actual performance improvement, you'd need to factor out any format analysis done by your customFormatter and formattable arguments into a function that returns some data structure that tells a later formatting call what to do. Then you pull those data structures in your constructor and store them for later use. Presumably this would involve extending ICustomFormatter and IFormattable. Seems kinda unlikely.

Have you accounted for the time to do the JIT compile as well? After all, the framework will be ngen'd which could account for the differences?

The framework provides explicit overrides to the format methods that take fixed-sized parameter lists instead of the params object[] approach to remove the overhead of allocating and collecting all of the temporary object arrays. You might want to consider that for your code as well. Also, providing strongly-typed overloads for common value types would reduce boxing overhead.

I gotta believe that spending as much time optimizing data IO would earn exponentially bigger returns!
This is surely a kissin' cousin to YAGNI for this. Avoid Premature Optimization. APO.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.