How to replace various string occurrences in multiple strings the fastest way - c#

I've a common problem where I've not found a proper solution. I've multiple XML strings with a specific tag (e.g. MIME_SOURCE) and I don't know which XML string contains which value. But I have to replace all occurrences.
On the other hand I have a dictionary containing all possible values of the XML as a key and the value to replace with as value. As I said, I don't know what to replace in which XML.
E.g.
Part of first XML
<MIME>
<MIME_SOURCE>\Web\Bilder Groß\1509_131_021_01.jpg</MIME_SOURCE>
</MIME>
<MIME>
<MIME_SOURCE>\Web\Bilder Groß\1509_131_021_01_MitWasserzeichen.jpg</MIME_SOURCE>
</MIME>
<MIME>
<MIME_SOURCE>\Web\Bilder Groß\icon_top.jpg</MIME_SOURCE>
</MIME>
Part of second XML:
<MIME>
<MIME_SOURCE>\Web\Bilder klein\5478.jpg</MIME_SOURCE>
</MIME>
Dictionary looks like:
{"\Web\Bilder Groß\1509_131_021_01.jpg", "/Web/Bilder Groß/1509_131_021_01.jpg"}
{"\Web\Bilder Groß\1509_131_021_01_MitWasserzeichen.jpg", "/Web/Bilder Groß/1509_131_021_01_MitWasserzeichen.jpg"}
{"\Web\Bilder Groß\icon_top.jpg", "icon_top.jpg"}
{"\Web\Bilder klein\5478.jpg", "5478.jpg"}
My main problem is, if I iterate through the dictionary for each XML string the effort will be count of XML strings multiplied with count of entries in the dictionary (n*m). This is really bad in my case as there can be around a million XML strings and at least thousands of entries in the dictionary.
Currently I'm using string.Replace for each key of the dictionary for each XML.
Do you have a good idea how to speed up this process?
Edit:
I've changed code to the following one:
var regex = new Regex(#"<MIME_SOURCE>[\s\S]*?<\/MIME_SOURCE>");
foreach (Match match in regex.Matches(stringForXml))
{
DoReplacements...
}
This fits to the requirements for now as the replacement will only be done for each MIME_SOURCE in the XML. But I will as well have a look at the mentioned algorithm.

The most correct way is to properly parse your XML. Then you can go through it in a single pass:
var xml = #"<root>
<MIME>
<MIME_SOURCE>\Web\Bilder Groß\1509_131_021_01.jpg</MIME_SOURCE>
</MIME>
<MIME>
<MIME_SOURCE>\Web\Bilder Groß\1509_131_021_01_MitWasserzeichen.jpg</MIME_SOURCE>
</MIME>
<MIME>
<MIME_SOURCE>\Web\Bilder Groß\icon_top.jpg</MIME_SOURCE>
</MIME>
</root>";
var replacements = new Dictionary<string, string>()
{
{#"\Web\Bilder Groß\1509_131_021_01.jpg", "/Web/Bilder Groß/1509_131_021_01.jpg"},
{#"\Web\Bilder Groß\1509_131_021_01_MitWasserzeichen.jpg", "/Web/Bilder Groß/1509_131_021_01_MitWasserzeichen.jpg"},
{#"\Web\Bilder Groß\icon_top.jpg", "icon_top.jpg"},
{#"\Web\Bilder klein\5478.jpg", "5478.jpg"}
};
var doc = XDocument.Parse(xml);
foreach (var source in doc.Root.Descendants("MIME_SOURCE"))
{
if (replacements.TryGetValue(source.Value, out var replacement))
{
source.Value = replacement;
}
}
var result = doc.ToString();
If you can make some assumptions about how your XML is structured (e.g. no whitespace between the <MINE_SOURCE> tags, no attributes, etc), then you can use some regex, allowing you to again make a single pass:
var result = Regex.Replace(xml, #"<MIME_SOURCE>([^<]+)</MIME_SOURCE>", match =>
{
if (replacements.TryGetValue(match.Groups[1].Value, out var replacement))
{
return $"<MIME_SOURCE>{replacement}</MIME_SOURCE>";
}
return match.Value;
});
You'll have to benchmark different approaches yourself on your own data. Use BenchmarkDotNet.

As I already mentioned in a comment above, I used to have a similar problem (see: c# Fastest string search in all files).
Using the Aho–Corasick algorithm that has been suggested to me in the accepted answer I was able to conduct a string search in fast enough time for my problem (going from a minutes execution time to merely seconds).
An implementation of said algorithm can be found here.
Here is a little sample on how to use the implementation linked above. (looking some needles in a haystack)
static bool anyViaAhoCorasick(string[] needles, string haystack)
{
var trie = new Trie();
trie.Add(needles);
trie.Build();
return trie.Find(haystack).Any();
}

Related

How to make xml to csv parsing/conversion faster?

I'm currently using the snippet below to convert xml data(not well formed) to .CSV format after doing some processing in between. It only converts those elements in the xml data that contain a integer from the list testList (List<int> testList). It only converts and writes to the file once that match has been made. I need to use this algorithm for files that are several GB's in size. Currently it processes a 1 Gb file in ~7.5 minutes. Can someone suggest any changes that I could make to improve performance? I've fixed everything I could but it won't get any faster. Any help will be appreciated!
Note: Message.TryParse is an external parsing method that I have to use and can't exclude or change.
Note: StreamElements is just a customized Xmlreader that improves performance.
foreach (var element in StreamElements(p, "XML"))
{
string joined = string.Concat(element.ToString().Split().Take(3)) + string.Join(" ", element.
ToString().Split().Skip(3));
List<string> listX = new List<string>();
listX.Add(joined.ToString());
Message msg = null;
if (Message.TryParse(joined.ToString(), out msg))
{
var values = element.DescendantNodes().OfType<XText>()
.Select(v => Regex.Replace(v.Value, "\\s+", " "));
foreach (var val in values)
{
for (int i = 0; i < testList.Count; i++)
{
if (val.ToString().Contains("," + testList[i].ToString() + ","))
{
var line = string.Join(",", values);
sss.WriteLine(line);
}
}
}
}
}
I'm seeing some things you could probably improve:
You're calling .ToString() on joined a couple of times, when joined is already a string.
You may be able to speed up your regex replace by compiling your regex first, outside of the loop.
You're iterating over values multiple times, and each time it has to re-evaluate the LINQ that makes up the definition for values. Try using .ToList() before saving the result of that LINQ statement into values.
But before focusing on stuff like this, you really need to identify what's taking the time in your code. My guess is that it's almost all spent in these two places:
Reading from the XML stream
Writing to sss
If I'm right, then anything else you focus on is going to be premature optimization. Spend some time testing what happens if you comment out various parts of your for loop, to see where all the time is being spent.

Dealing with awkward XML layout in c# using XmlTextReader

so I have an XML document I'm trying to import using XmlTextReader in C#, and my code works well except for one part, that's where the tag line is not on the same line as the actually text/content, for example with product_name:
<product>
<sku>27939</sku>
<product_name>
Sof-Therm Warm-Up Jacket
</product_name>
<supplier_number>ALNN1064</supplier_number>
</product>
My code to try to sort the XML document is as such:
while (reader.Read())
{
switch (reader.Name)
{
case "sku":
newEle = new XMLElement();
newEle.SKU = reader.ReadString();
break;
case "product_name":
newEle.ProductName = reader.ReadString();
break;
case "supplier_number":
newEle.SupplierNumber = reader.ReadString();
products.Add(newEle);
break;
}
}
I have tried almost everything I found in the XmlTextReader documentation
reader.MoveToElement();
reader.MoveToContent();
reader.MoveToNextAttribute();
and a couple others that made less sense, but none of them seem to be able to consistently deal with this issue. Obviously I could fix this one case, but then it would break the regular cases. So my question is, would there be a way to have it after I find the "product_name" tag to go to the next line that contains text and extract it?
I should have mentioned, I am outputting it to an HTML table after and the element is coming up blank so I'm fairly certain it is not reading it correctly.
Thanks in advanced!
I think you will find Linq To Xml easier to use
var xDoc = XDocument.Parse(xmlstring); //or XDocument.Load(filename);
int sku = (int)xDoc.Root.Element("sku");
string name = (string)xDoc.Root.Element("product_name");
string supplier = (string)xDoc.Root.Element("supplier_number");
You can also convert your xml to dictionary
var dict = xDoc.Root.Elements()
.ToDictionary(e => e.Name.LocalName, e => (string)e);
Console.WriteLine(dict["sku"]);
It looks like you may need to remove the carriage returns, line feeds, tabs, and spaces before and after the text in the XML element. In your example, you have
<!-- 1. Original example -->
<product_name>
Sof-Therm Warm-Up Jacket
</product_name>
<!-- 2. It should probably be. If possible correct the XML generator. -->
<product_name>Sof-Therm Warm-Up Jacket</product_name>
<!-- 3a. If white space is important, then preserve it -->
<product_name xml:space='preserve'>
Sof-Therm Warm-Up Jacket
</product_name>
<!-- 3b. If White space is important, use CDATA -->
<product_name>!<[CDATA[
Sof-Therm Warm-Up Jacket
]]></product_name>
The XmlTextReader has a WhitespaceHandling property, but when I tested it, it still including the returns and indentation:
reader.WhitespaceHandling = WhitespaceHandling.None;
An option is to use a method to remove the extra characters while you are parsing the document. This method removes the normal white space at the beginning and end of a string:
string TrimCrLf(string value)
{
return Regex.Replace(value, #"^[\r\n\t ]+|[\r\n\t ]+$", "");
}
// Then in your loop...
case "product_name":
// Trim the contents of the 'product_name' element to remove extra returns
newEle.ProductName = TrimCrLf(reader.ReadString());
break;
You can also use this method, TrimCrLf(), with Linq to Xml and the traditional XmlDocument. You can even make it an extension method:
public static class StringExtensions
{
public static string TrimCrLf(this string value)
{
return Regex.Replace(value, #"^[\r\n\t ]+|[\r\n\t ]+$", "");
}
}
// Use it like:
newEle.ProductName = reader.ReadString().TrimCrLf();
Regular expression explanation:
^ = Beginning of field
$ = End of field
[]+= Match 1 or more of any of the contained characters
\n = carriage return (0x0D / 13)
\r = line feed (0x0A / 10)
\t = tab (0x09 / 9)
' '= space (0x20 / 32)
I have run into a similar problem before when dealing with text that originated on a Mac platform due to reversed \r\n in newlines. Suggest you try Ryan's regex solution, but with the following regex:
"^[\r\n]+|[\r\n]+$"

Find/parse server-side <?abc?>-like tags in html document

I guess I need some regex help. I want to find all tags like <?abc?> so that I can replace it with whatever the results are for the code ran inside. I just need help regexing the tag/code string, not parsing the code inside :p.
<b><?abc print 'test' ?></b> would result in <b>test</b>
Edit: Not specifically but in general, matching (<?[chars] (code group) ?>)
This will build up a new copy of the string source, replacing <?abc code?> with the result of process(code)
Regex abcTagRegex = new Regex(#"\<\?abc(?<code>.*?)\?>");
StringBuilder newSource = new StringBuilder();
int curPos = 0;
foreach (Match abcTagMatch in abcTagRegex.Matches(source)) {
string code = abcTagMatch.Groups["code"].Value;
string result = process(code);
newSource.Append(source.Substring(curPos, abcTagMatch.Index));
newSource.Append(result);
curPos = abcTagMatch.Index + abcTagMatch.Length;
}
newSource.Append(source.Substring(curPos));
source = newSource.ToString();
N.B. I've not been able to test this code, so some of the functions may be slightly the wrong name, or there may be some off-by-one errors.
var new Regex(#"<\?(\w+) (\w+) (.+?)\?>")
This will take this source
<b><?abc print 'test' ?></b>
and break it up like this:
Value: <?abc print 'test' ?>
SubMatch: abc
SubMatch: print
SubMatch: 'test'
These can then be sent to a method that can handle it differently depending on what the parts are.
If you need more advanced syntax handling you need to go beyond regex I believe.
I designed a template engine using Antlr but thats way more complex ;)
exp = new Regex(#"<\?abc print'(.+)' \?>");
str = exp.Replace(str, "$1")
Something like this should do the trick. Change the regexes how you see fit

Replace Bad words using Regex

I am trying to create a bad word filter method that I can call before every insert and update to check the string for any bad words and replace with "[Censored]".
I have an SQL table with has a list of bad words, I want to bring them back and add them to a List or string array and check through the string of text that has been passed in and if any bad words are found replace them and return a filtered string back.
I am using C# for this.
Please see this "clbuttic" (or for your case cl[Censored]ic) article before doing a string replace without considering word boundaries:
http://www.codinghorror.com/blog/2008/10/obscenity-filters-bad-idea-or-incredibly-intercoursing-bad-idea.html
Update
Obviously not foolproof (see article above - this approach is so easy to get around or produce false positives...) or optimized (the regular expressions should be cached and compiled), but the following will filter out whole words (no "clbuttics") and simple plurals of words:
const string CensoredText = "[Censored]";
const string PatternTemplate = #"\b({0})(s?)\b";
const RegexOptions Options = RegexOptions.IgnoreCase;
string[] badWords = new[] { "cranberrying", "chuffing", "ass" };
IEnumerable<Regex> badWordMatchers = badWords.
Select(x => new Regex(string.Format(PatternTemplate, x), Options));
string input = "I've had no cranberrying sleep for chuffing chuffings days -
the next door neighbour is playing classical music at full tilt!";
string output = badWordMatchers.
Aggregate(input, (current, matcher) => matcher.Replace(current, CensoredText));
Console.WriteLine(output);
Gives the output:
I've had no [Censored] sleep for [Censored] [Censored] days - the next door neighbour is playing classical music at full tilt!
Note that "classical" does not become "cl[Censored]ical", as whole words are matched with the regular expression.
Update 2
And to demonstrate a flavour of how this (and in general basic string\pattern matching techniques) can be easily subverted, see the following string:
"I've had no cranberryıng sleep for chuffıng chuffıngs days - the next door neighbour is playing classical music at full tilt!"
I have replaced the "i"'s with Turkish lower case undottted "ı"'s. Still looks pretty offensive!
Although I'm a big fan of Regex, I think it won't help you here. You should fetch your bad word into a string List or string Array and use System.String.Replace on your incoming message.
Maybe better, use System.String.Split and .Join methods:
string mayContainBadWords = "... bla bla ...";
string[] badWords = new string[]{"bad", "worse", "worst"};
string[] temp = string.Split(badWords, StringSplitOptions.RemoveEmptyEntries);
string cleanString = string.Join("[Censored]", temp);
In the sample, mayContainBadWords is the string you want to check; badWords is a string array, you load from your bad word sql table and cleanString is your result.
you can use string.replace() method or RegEx class
There is also a nice article about it which can e found here
With a little html-parsing skills, you can get a large list with swear words from noswear

.NET String parsing performance improvement - Possible Code Smell

The code below is designed to take a string in and remove any of a set of arbitrary words that are considered non-essential to a search phrase.
I didn't write the code, but need to incorporate it into something else. It works, and that's good, but it just feels wrong to me. However, I can't seem to get my head outside the box that this method has created to think of another approach.
Maybe I'm just making it more complicated than it needs to be, but I feel like this might be cleaner with a different technique, perhaps by using LINQ.
I would welcome any suggestions; including the suggestion that I'm over thinking it and that the existing code is perfectly clear, concise and performant.
So, here's the code:
private string RemoveNonEssentialWords(string phrase)
{
//This array is being created manually for demo purposes. In production code it's passed in from elsewhere.
string[] nonessentials = {"left", "right", "acute", "chronic", "excessive", "extensive",
"upper", "lower", "complete", "partial", "subacute", "severe",
"moderate", "total", "small", "large", "minor", "multiple", "early",
"major", "bilateral", "progressive"};
int index = -1;
for (int i = 0; i < nonessentials.Length; i++)
{
index = phrase.ToLower().IndexOf(nonessentials[i]);
while (index >= 0)
{
phrase = phrase.Remove(index, nonessentials[i].Length);
phrase = phrase.Trim().Replace(" ", " ");
index = phrase.IndexOf(nonessentials[i]);
}
}
return phrase;
}
Thanks in advance for your help.
Cheers,
Steve
This appears to be an algorithm for removing stop words from a search phrase.
Here's one thought: If this is in fact being used for a search, do you need the resulting phrase to be a perfect representation of the original (with all original whitespace intact), but with stop words removed, or can it be "close enough" so that the results are still effectively the same?
One approach would be to tokenize the phrase (using the approach of your choice - could be a regex, I'll use a simple split) and then reassemble it with the stop words removed. Example:
public static string RemoveStopWords(string phrase, IEnumerable<string> stop)
{
var tokens = Tokenize(phrase);
var filteredTokens = tokens.Where(s => !stop.Contains(s));
return string.Join(" ", filteredTokens.ToArray());
}
public static IEnumerable<string> Tokenize(string phrase)
{
return string.Split(phrase, ' ');
// Or use a regex, such as:
// return Regex.Split(phrase, #"\W+");
}
This won't give you exactly the same result, but I'll bet that it's close enough and it will definitely run a lot more efficiently. Actual search engines use an approach similar to this, since everything is indexed and searched at the word level, not the character level.
I guess your code is not doing what you want it to do anyway. "moderated" would be converted to "d" if I'm right. To get a good solution you have to specify your requirements a bit more detailed. I would probably use Replace or regular expressions.
I would use a regular expression (created inside the function) for this task. I think it would be capable of doing all the processing at once without having to make multiple passes through the string or having to create multiple intermediate strings.
private string RemoveNonEssentialWords(string phrase)
{
return Regex.Replace(phrase, // input
#"\b(" + String.Join("|", nonessentials) + #")\b", // pattern
"", // replacement
RegexOptions.IgnoreCase)
.Replace(" ", " ");
}
The \b at the beginning and end of the pattern makes sure that the match is on a boundary between alphanumeric and non-alphanumeric characters. In other words, it will not match just part of the word, like your sample code does.
Yeah, that smells.
I like little state machines for parsing, they can be self-contained inside a method using lists of delegates, looping through the characters in the input and sending each one through the state functions (which I have return the next state function based on the examined character).
For performance I would flush out whole words to a string builder after I've hit a separating character and checked the word against the list (might use a hash set for that)
I would create A Hash table of Removed words parse each word if in the hash remove it only one time through the array and I believe that creating a has table is O(n).
How does this look?
foreach (string nonEssent in nonessentials)
{
phrase.Replace(nonEssent, String.Empty);
}
phrase.Replace(" ", " ");
If you want to go the Regex route, you could do it like this. If you're going for speed it's worth a try and you can compare/contrast with other methods:
Start by creating a Regex from the array input. Something like:
var regexString = "\\b(" + string.Join("|", nonessentials) + ")\\b";
That will result in something like:
\b(left|right|chronic)\b
Then create a Regex object to do the find/replace:
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(regexString, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
Then you can just do a Replace like so:
string fixedPhrase = regex.Replace(phrase, "");

Categories