I have a list of 200+ words that are not allowed on a website. The string.Replace method below takes ~80ms. If I increase s < 1000 by a factor of 10.00 to s < 10,000 this delay goes to ~834ms, a 10.43 increase. I am woried about the scalability of this function, especially if the list increases in size. I was told strings are immutable and text.Replace() is creating 200 new strings in memory. Is there something similar to a Stringbuilder for this?
List<string> FilteredWords = new List<string>();
FilteredWords.Add("RED");
FilteredWords.Add("GREEN");
FilteredWords.Add("BLACK");
for (int i = 1; i < 200; i++)
{ FilteredWords.Add("STRING " + i.ToString()); }
string text = "";
//simulate a large dynamically generated html page
for (int s = 1; s < 1000; s++)
{ text += #"Lorem ipsum dolor sit amet, minim BLACK cetero cu nam.
No vix platonem sententiae, pro wisi congue graecis id, GREEN assum interesset in vix.
Eum tamquam RED pertinacia ex."; }
// This is the function I seek to optimize
foreach (string s in FilteredWords)
{ text = text.Replace(s, "[REMOVED]"); }
Use StringBuilder.Replace and try to do it as a batch operation. That is to say you should try to only create the StringBuilder once as it has some overhead. It won't necessarily be a lot faster but it will be much more memory efficient.
You should also probably only do this sanitation once instead of every time data is requested. If you're reading the data from the database you should consider sanitizing it once when the data is inserted into the database, so there is less work to do when reading and displaying it to the page.
If you expect most of the text to be relatively nice than scanning whole text first for matching words could be better approach. You can also normalize words text at the same time to catch some standard replacements.
I.e. scan string by matching individual words (i.e. Regular expression like "\w+"), than for each detected word lookup (potentially normalized value) in dictionary of words to replace.
You can either simply scan first to get list of "words to replace" and than just replace individual word later, or scan and build resulting string at the same time (using StringBuilder or StreamWriter, obviously not String.Concat / +).
Note: Unicode provides large number of good characters to use, so don't expect your effort to be very successful. I.e. try to find "cool" in following text: "you are сооl".
Sample code (relying on Regex.Replace for tokenization and building the string and HashSet for matches).
var toFind = FilteredWords.Aggregate(
new HashSet<string>(), (c, i) => { c.Add(i); return c;});
text = new Regex(#"\w+")
.Replace(text, m => toFind.Contains(m.Value) ? "[REMOVED]" : m.Value));
There may be a better way, but this is how I would go about solving the problem.
You will need to create a tree structure that contains your dictionary of words to be replaced. The class may be something like:
public class Node
{
public Dictionary<char, Node> Children;
public bool IsWord;
}
Using a dictionary for the Children may not be the best choice, but it provides the easiest example here. Also, you will need a constructor to initialize the Children field. The IsWord field is used to deal with the possibility that a redacted "word" may be the prefix of another redacted "word". For example, if you want to remove both "red" and "redress".
You will build the tree from each character in each of the replacement words. For example:
public void AddWord ( string word )
{
// NOTE: this assumes word is non-null and contains at least one character...
Node currentNode = Root;
for (int iIndex = 0; iIndex < word.Length; iIndex++)
{
if (currentNode.Children.ContainsKey(word[iIndex])))
{
currentNode = currentNode.Children[word[iIndex];
continue;
}
Node newNode = new Node();
currentNode.Children.Add(word[iIndex], newNode);
currentNode = newNode;
}
// finished, mark the last node as being a complete word..
currentNode.IsWord = true;
}
You'll need to deal with case sensitivity somewhere in there. Also, you only need to build the tree once, afterwards you can use it from any number of threads without worrying about locking because you will be only reading from it. (Basically, I'm saying: store it in a static somewhere.)
Now, when you are ready to remove words from your string you will need to do the following:
Create a StringBuilder instance to store the result
Parse through your source string, looking for the start and stop of a "word". How you define "word" will matter. For simplicity I would suggest starting with Char.IsWhitespace as defining word separators.
Once you have determined that a range of character is a "word", starting from the root of the tree, locate the child node associated with the first character in "word".
If you do not find a child node, the entire word is added to the StringBuilder
If you find a child node, you continue with the next character matching against Children of the current node, until you either run out of characters or out of nodes.
If you reach the end of the "word", check the last node's IsWord field. If true the word is excluded, do not add it to the StringBuilder. If IsWord is false, the word is not replaced and you add it to the StringBuilder
Repeat until you have exhausted the input string.
You will also need to add word separators to the StringBuilder, hopefully that will be obvious as you parse the input string. If you are careful to only use the start and stop indices within the input string, you should be able to parse the entire string without creating any garbage strings.
When all of this is done, use StringBuilder.ToString() to get your final result.
You may also need to consider Unicode surrogate codepoints, but you can probably get away without worrying about it.
Beware, I typed this code here directly, so syntax errors, typos and other accidental misdirections are probably included.
The real regular expression solution would be:
var filteredWord = new Regex(#"\b(?:" + string.Join("|", FilteredWords.Select(Regex.Escape)) + #")\b", RegexOptions.Compiled);
text = filteredWord.Replace(text, "[REMOVED]");
I don’t know whether this is faster (but note that it also only replaces whole words).
Related
I have a file that is formatted this way --
{2000}000000012199{3100}123456789*{3320}110009558*{3400}9876
54321*{3600}CTR{4200}D2343984*JOHN DOE*1232 STREET*DALLAS TX
78302**{5000}D9210293*JANE DOE*1234 STREET*SUITE 201*DALLAS
TX 73920**
Basically, the number in curly brackets denotes field, followed by the value for that field. For example, {2000} is the field for "Amount", and the value for it is 121.99 (implied decimal). {3100} is the field for "AccountNumber" and the value for it is 123456789*.
I am trying to figure out a way to split the file into "records" and each record would contain the record type (the value in the curly brackets) and record value, but I don't see how.
How do I do this without a loop going through each character in the input?
A different way to look at it.... The { character is a record delimiter, and the } character is a field delimiter. You can just use Split().
var input = #"{2000}000000012199{3100}123456789*{3320}110009558*{3400}987654321*{3600}CTR{4200}D2343984*JOHN DOE*1232 STREET*DALLAS TX78302**{5000}D9210293*JANE DOE*1234 STREET*SUITE 201*DALLASTX 73920**";
var rows = input.Split( new [] {"{"} , StringSplitOptions.RemoveEmptyEntries);
foreach (var row in rows)
{
var fields = row.Split(new [] { "}"}, StringSplitOptions.RemoveEmptyEntries);
Console.WriteLine("{0} = {1}", fields[0], fields[1]);
}
Output:
2000 = 000000012199
3100 = 123456789*
3320 = 110009558*
3400 = 987654321*
3600 = CTR
4200 = D2343984*JOHN DOE*1232 STREET*DALLAS TX78302**
5000 = D9210293*JANE DOE*1234 STREET*SUITE 201*DALLASTX 73920**
Fiddle
This regular expression should get you going:
Match a literal {
Match 1 or more digts ("a number")
Match a literal }
Match all characters that are not an opening {
\{\d+\}[^{]+
It assumes that the values itself cannot contain an opening curly brace. If that's the case, you need to be more clever, e.g. #"\{\d+\}(?:\\{|[^{])+" (there are likely better ways)
Create a Regex instance and have it match against the text. Each "field" will be a separate match
var text = #"{123}abc{456}xyz";
var regex = new Regex(#"\{\d+\}[^{]+", RegexOptions.Compiled);
foreach (var match in regex.Matches(text)) {
Console.WriteLine(match.Groups[0].Value);
}
This doesn't fully answer the question, but it was getting too long to be a comment, so I'm leaving it here in Community Wiki mode. It does, at least, present a better strategy that may lead to a solution:
The main thing to understand here is it's rare — like, REALLY rare — to genuinely encounter a whole new kind of a file format for which an existing parser doesn't already exist. Even custom applications with custom file types will still typically build the basic structure of their file around a generic format like JSON or XML, or sometimes an industry-specific format like HL7 or MARC.
The strategy you should follow, then, is to first determine exactly what you're dealing with. Look at the software that generates the file; is there an existing SDK, reference, or package for the format? Or look at the industry surrounding this data; is there a special set of formats related to that industry?
Once you know this, you will almost always find an existing parser ready and waiting, and it's usually as easy as adding a NuGet package. These parsers are genuinely faster, need less code, and will be less susceptible to bugs (because most will have already been found by someone else). It's just an all-around better way to address the issue.
Now what I see in the question isn't something I recognize, so it's just possible you genuinely do have a custom format for which you'll need to write a parser from scratch... but even so, it doesn't seem like we're to that point yet.
Here is how to do it in linq without slow regex
string x = "{2000}000000012199{3100}123456789*{3320}110009558*{3400}987654321*{3600}CTR{4200}D2343984*JOHN DOE*1232 STREET*DALLAS TX78302**{5000}D9210293*JANE DOE*1234 STREET*SUITE 201*DALLASTX 73920**";
var result =
x.Split('{',StringSplitOptions.RemoveEmptyEntries)
.Aggregate(new List<Tuple<string, string>>(),
(l, z) => { var az = z.Split('}');
l.Add(new Tuple<string, string>(az[0], az[1]));
return l;})
LinqPad output:
I was doing a small 'scalable' C# MVC project, with quite a bit of read/write to a database.
From this, I would need to add/remove the first letter of the input string.
'Removing' the first character is quite easy (using a Substring method) - using something like:
String test = "HHello world";
test = test.Substring(1,test.Length-1);
'Adding' a character efficiently seems to be messy/awkward:
String test = "ello World";
test = "H" + test;
Seeing as this will be done for a lot of records, would this be be the most efficient way of doing these operations?
I am also testing if a string starts with the letter 'T' by using, and adding 'T' if it doesn't by:
String test = "Hello World";
if(test[0]!='T')
{
test = "T" + test;
}
and would like to know if this would be suitable for this
If you have several records and to each of the several records field you need to append a character at the beginning, you can use String.Insert with an index of 0 http://msdn.microsoft.com/it-it/library/system.string.insert(v=vs.110).aspx
string yourString = yourString.Insert( 0, "C" );
This will pretty much do the same of what you wrote in your original post, but since it seems you prefer to use a Method and not an operator...
If you have to append a character several times, to a single string, then you're better using a StringBuilder http://msdn.microsoft.com/it-it/library/system.text.stringbuilder(v=vs.110).aspx
Both are equally efficient I think since both require a new string to be initialized, since string is immutable.
When doing this on the same string multiple times, a StringBuilder might come in handy when adding. That will increase performance over adding.
You could also opt to move this operation to the database side if possible. That might increase performance too.
For removing I would use the remove command as this doesn't require to know the length of the string:
test = test.Remove(0, 1);
You could also treat the string as an array for the Add and use
test = test.Insert(0, "H");
If you are always removing and then adding a character you can treat the string as an array again and just replace the character.
test = (test.ToCharArray()[0] = 'H').ToString();
When doing lots of operations to the same string I would use a StringBuilder though, more expensive to create but faster operations on the string.
I have a html Document and want to filter it against the occurrency of multiple (1 - 10k) [1k at the moment, later on up to 10k] keywords.
I have a precompiled regex which stores my searchterms like:
static Regex r = new Regex(#"keyword1|keyword2|keyword999",RegexOptions.Compiled | RegexOptions.IgnoreCase);
This is my code:
Stopwatch sw = new Stopwatch();
sw.Start();
MatchCollection matches = Cache.r.Matches(doc.DocumentNode.InnerHtml);
string s = "";
if (matches.Count > 0)
{
foreach (Match m in matches)
{
s += m.Value + ",";
}
}
long time = sw.ElapsedMilliseconds;
Console.Write(time + " = "+matches.Count+" -> "+s );
The average time takes about 5-8 seconds. Which is way too much.
Is there any efficient way to filter a html document against alot of keywords?
Or maybe there are more efficient algorythms to filter this..
As lboshuizen pointed out
Creating a regex with 10k keywords seems not the way to go [...]
If you can afford spawning multiple threads you can scan the document in parallel for occurences of keywords:
IEnumerable<string> keywords = LoadKeywords();
List<string> list = new List<string>();
keywords.AsParallel()
.Aggregate(list, (seed, keyword) =>
{
if(doc.DocumentNode.InnerHtml.Contains(keyword))
seed.Add(keyword);
return seed;
});
You should use StringBuilder instead of string..
Unless you tell us more about what the keywords are,there is hardly any optimization..
Some of the answers are already pretty good, but I figured I'd throw this in as well ...
I've done the same thing and I used the HTML Agility Pack to help cut down on what I was analyzing for keywords.
http://htmlagilitypack.codeplex.com/
It's very easy to take an HTML fragment, search only for textual nodes and then run your keyword analysis over that space instead of the entire document.
Also it helps get rid of false positives (keywords appearing in javascript comments, alt tags, whatever else).
Just an idea to try and trim down your search space.
Suggestion:
Creating a regex with 10k keywords seems not the way to go from my POV. A regex is greedy and will try all kind of redundant matches. (=wasting time)
Building regex's with smaller keyword-sets and run them incremental in your html document.
Optimization can be to remove the the matched keywords (and related content) from the document, the will shrink and the remaining regex's has much less to do == run faster.
Or
Turn it around, don;t use a regex to scan agains a document.
Break down the document in to words and check them agains a dictionary. I doubt that the document will contain all 10k words. (looping from the smallest set is more efficient then from the largest set)
Using C# (VS 2010 Express) I read the contents of a text file into a string. The string is rather long but reliably broken up by "\t" for tabs and "\r\n" for carriage returns/newlines.
The tabs indicate a new column of data, and new line indicates a new row of data.
I want to create an array or List of dimensions (X)(Y) such that each spot in the array can hold 1 row of data from the text file, and all of the Y columns contained in that 1 row ("\t" means a new column of data, and "\r\n" means a new row of data").
To make things simple let's say my text has 10 rows of data, and 2 columns. I'd like to create an array or List or whatever you think is best to store the data. How do I do this? Thanks.
This is the code that I used to read the data in the text file into a string:
// Read the file as one string.
System.IO.StreamReader myFile = new System.IO.StreamReader("f:\\data.txt");
string myString = myFile.ReadToEnd();
Just as is (you already have a string with everything):
str.Split(new string[]{"\r\n"}, StringSplitOptions.None)
.Select(s => s.Split('\t'));
Gives you an IEnumerable<string[]> producing variantes like list of list, array of array and so on just needs the suitable ToArray() or ToList() etc.
However, if you can deal with each line one at a time, you can be better off with something that lets you do so:
public IEnumerable<string[]> ReadTSV(TextReader tr)
{
using(tr)
for(string line = tr.ReadLine(); line != null; line = tr.ReadLine())
yield return line.Split('\t');
}
Then you only use as much memory as each line needs. We could go further and change the reading to emit each individual cell one at a time, but this is normally enough to read files of several hundred MB in size, with reasonable efficiency.
Edit based on comments on question:
If you really wanted to, you could get a List<string[]> from:
var myFile = new StreamReader("f:\\data.txt");
var list = ReadTSV(myFile).ToList();
Alternatively, change the line yield return line.Split('\t'); to yield return line.Split('\t'); and you get a List<List<string>>.
However, if possible then work on the results directly, rather than putting it into a list first:
var myFile = new StreamReader("f:\\data.txt");
var chunks = ReadTSV(myFile);
foreach(var chunk in chunks)
{
DoSometingOnAChunk(chunk[0], chunk[1]);
}
It'll use less memory, and get started faster rather than pausing to read the whole thing first. Code like this can merrily work its way through gigabytes without complaint.
String.Split
http://msdn.microsoft.com/en-us/library/system.string.split.aspx
File.ReadLines(sourceFilePath)
.Select(line => line.Split('\t'))
.ToArray();
This will read the file and create a list of string arrays for you
List<string[]> rows= File.ReadLines("PathToFile")
.Select(line=>line.Split('\t')).ToList();
If you want string[][] version, simply use ToArray(); instead of ToList(); at the end.
The TextFieldParser is a fantastic class for dealing with text based delimited files. You can provide it a file, a delimiter (in this case "\t") and it will provide a method to get the next line of values (as a string array).
It has advantages over a simple Split in the general case as it can handle comments, quoted fields, escaped delimiters, etc. You may or may not have such cases, but having all of those awkward edge cases handled pretty much for free is rather nice.
var result = contents.Split("\r\n".ToArray(), StringSplitOptions.RemoveEmptyEntries).Select(s => {
s.Split('\t').ToList();
}).ToList();
result will be a List<List<String>>.
If a have a string with words and no spaces, how should I parse those words given that I have a dictionary/list that contains those words?
For example, if my string is "thisisastringwithwords" how could I use a dictionary to create an output "this is a string with words"?
I hear that using the data structure Tries could help but maybe if someone could help with the pseudo code? For example, I was thinking that maybe you could index the dictionary into a trie structure, then follow each char down the trie; problem is, I'm unfamiliar with how to do this in (pseudo)code.
I'm assuming that you want an efficient solution, not the obvious one where you repeatedly check if your text starts with a dictionary word.
If the dictionary is small enough, I think you could try and modify the standard KMP algorithm. Basically, build a finite-state machine on your dictionary which consumes the text character by character and yields the constructed words.
EDIT: It appeared that I was reinventing tries.
I already did something similar. You cannot use a simple dictionary. The result will be messy. It depends if you only have to do this once or as whole program.
My solution was to:
Connect to a database with working
words from a dictionary list (for
example online dictionary)
Filter long and short words in dictionary and check if you want to trim stuff (for example don't use words with only one character like 'I')
Start with short words and compare your bigString with the database dictionary.
Now you need to create a "table of possibility". Because a lot of words can fit into 100% but are wrong. As longer the word as more sure you are, that this word is the right one.
It is cpu intensive but it can work precise in the result.
So lets say, you are using a small dictionary of 10,000 words and 3,000 of them are with a length of 8 characters, you need to compare your bigString at start with all 3,000 words and only if result was found, it is allowed to proceed to the next word. If you have 200 characters in your bigString you need about (2000chars / 8 average chars) = 250 full loops minimum with comparation.
For me, I also did a small verification of misspelled words into the comparation.
example of procedure (don't copy paste)
Dim bigString As String = "helloworld.thisisastackoverflowtest!"
Dim dictionary As New List(Of String) 'contains the original words. lets make it case insentitive
dictionary.Add("Hello")
dictionary.Add("World")
dictionary.Add("this")
dictionary.Add("is")
dictionary.Add("a")
dictionary.Add("stack")
dictionary.Add("over")
dictionary.Add("flow")
dictionary.Add("stackoverflow")
dictionary.Add("test")
dictionary.Add("!")
For Each word As String In dictionary
If word.Length < 1 Then dictionary.Remove(word) 'remove short words (will not work with for each in real)
word = word.ToLower 'make it case insentitive
Next
Dim ResultComparer As New Dictionary(Of String, Double) 'String is the dictionary word. Double is a value as percent for a own function to weight result
Dim i As Integer = 0 'start at the beginning
Dim Found As Boolean = False
Do
For Each word In dictionary
If bigString.IndexOf(word, i) > 0 Then
ResultComparer.Add(word, MyWeightOfWord) 'add the word if found, long words are better and will increase the weight value
Found = True
End If
Next
If Found = True Then
i += ResultComparer(BestWordWithBestWeight).Length
Else
i += 1
End If
Loop
I told you that it seems like an impossible task. But you can have a look at this related SO question - it may help you.
If you are sure you have all the words of the phrase in the dictionary, you can use that algo:
String phrase = "thisisastringwithwords";
String fullPhrase = "";
Set<String> myDictionary;
do {
foreach(item in myDictionary){
if(phrase.startsWith(item){
fullPhrase += item + " ";
phrase.remove(item);
break;
}
}
} while(phrase.length != 0);
There are so many complications, like, some items starting equally, so the code will be changed to use some tree search, BST or so.
This is the exact problem one has when trying to programmatically parse languages like Chinese where there are no spaces between words. One method that works with those languages is to start by splitting text on punctuation. This gives you phrases. Next you iterate over the phrases and try to break them into words starting with the length of the longest word in your dictionary. Let's say that length is 13 characters. Take the first 13 characters from the phrase and see if it is in your dictionary. If so, take it as a correct word for now, move forward in the phrase and repeat. Otherwise, shorten your substring to 12 characters, then 11 characters, etc.
This works extremely well, but not perfectly because we've accidentally put in a bias towards words that come first. One way to remove this bias and double check your result is to repeat the process starting at the end of the phrase. If you get the same word breaks you can probably call it good. If not, you have an overlapping word segment. For example, when you parse your sample phrase starting at the end you might get (backwards for emphasis)
words with string a Isis th
At first, the word Isis (Egyptian Goddess) appears to be the correct word. When you find that "th" is not in your dictionary, however, you know there is a word segmentation problem nearby. Resolve this by going with the forward segmentation result "this is" for the non-aligned sequence "thisis" since both words are in the dictionary.
A less common variant of this problem is when adjacent words share a sequence which could go either way. If you had a sequence like "archand" (to make something up), should it be "arc hand" or "arch and"? The way to determine is to apply a grammar checker to the results. This should be done to the whole text anyway.
Ok, I will make a hand wavy attempt at this. The perfect(ish) data structure for your problem is (as you've said a trie) made up of the words in the dictionary. A trie is best visualised as a DFA, a nice state machine where you go from one state to the next on every new character. This is really easy to do in code, a Java(ish) style class for this would be :
Class State
{
String matchedWord;
Map<char,State> mapChildren;
}
From hereon, building the trie is easy. Its like having a rooted tree structure with each node having multiple children. Each child is visited on one character transition. The use of a HashMap kind of structure trims down time to look up character to next State mappings. Alternately if all you have are 26 characters for the alphabet, a fixed size array of 26 would do the trick as well.
Now, assuming all of that made sense, you have a trie, your problem still isn't fully solved. This is where you start doing things like regular expressions engines do, walk down the trie, keep track of states which match to a whole word in the dictionary (thats what I had the matchedWord for in the State structure), use some backtracking logic to jump to a previous match state if the current trail hits a dead end. I know its general but given the trie structure, the rest is fairly straightforward.
If you have dictionary of words and need a quick implmentation this can be solved efficiently with dynamic programming in O(n^2) time, assuming the dictionary lookups are O(1). Below is some C# code, the substring extraction could and dictionary lookup could be improved.
public static String[] StringToWords(String str, HashSet<string> words)
{
//Index of char - length of last valid word
int[] bps = new int[str.Length + 1];
for (int i = 0; i < bps.Length; i++)
bps[i] = -1;
for (int i = 0; i < str.Length; i++)
{
for (int j = i + 1; j <= str.Length ; j++)
{
if (bps[j] == -1)
{
//Destination cell doesn't have valid backpointer yet
//Try with the current substring
String s = str.Substring(i, j - i);
if (words.Contains(s))
bps[j] = i;
}
}
}
//Backtrack to recovery sequence and then reverse
List<String> seg = new List<string>();
for (int bp = str.Length; bps[bp] != -1 ;bp = bps[bp])
seg.Add(str.Substring(bps[bp], bp - bps[bp]));
seg.Reverse();
return seg.ToArray();
}
Building a hastset with the word list from /usr/share/dict/words and testing with
foreach (var s in StringSplitter.StringToWords("thisisastringwithwords", dict))
Console.WriteLine(s);
I get the output "t hi sis a string with words". Because as others have pointed out this algorithm will return a valid segmentation (if one exists), however this may not be the segmentation you expect. The presence of short words is reducing the segmentation quality, you might be able to add heuristic to favour longer words if two valid sub-segmentation enter an element.
There are more sophisticated methods that finite state machines and language models that can generate multiple segmentations and apply probabilistic ranking.