String subset match in C# - c#

Ok I have an autocomplete/string matching problem to solve. I have an expression string typed in by a user into a text box, e.g.
More detail:
Expression textbox has string
"Buy some Al"
and client has a list of suggestions given by a server after a fuzzy match which populate a listbox
All Bran, Almonds, Alphabetti Spaghetti
now on the GUI I have a nice intellisense style autocomplete, but I need to wire up the "TAB" action to perform the complete. So if the user presses TAB and "All Bran" was the top suggestion, the string becomes
"Buy some All Bran"
e.g. the string "Al" was substituted for the top match "All Bran"
It's more than a simple string split on the expression to match the suggestions, as the expression text could be this
"Buy some All Bran and Al"
with suggestions
Alphabetti Spaghetti
In which case I'd expect the final Al to be substituted with the top match so the result becomes
"Buy some All Bran and Alphabetti Spaghetti"
I'm wondering how to do this simply in C# (Just the C# string manipulation, not GUI code) without going back to the server and asking for a substitution to be made.

You could do this with regex, but it doesn't seem necessary. The following solution assumes that the suggestion will always be preceded by a space (or start at the beginning of the sentence). If that's not the case then you'll need to share more examples to get the rules down.
string sentence = "Buy some Al";
string selection = "All Bran";
Console.WriteLine(AutoComplete(sentence, selection));
sentence = "Al";
Console.WriteLine(AutoComplete(sentence, selection));
sentence = "Buy some All Bran and Al";
selection = "Alphabetti Spaghetti";
Console.WriteLine(AutoComplete(sentence, selection));
Here is the AutoComplete method:
public string AutoComplete(string sentence, string selection)
{
if (String.IsNullOrWhiteSpace(sentence))
{
throw new ArgumentException("sentence");
}
if (String.IsNullOrWhiteSpace(selection))
{
// alternately, we could return the original sentence
throw new ArgumentException("selection");
}
// TrimEnd might not be needed depending on how your UI / suggestion works
// but in case the user can add a space at the end, and still have suggestions listed
// you would want to get the last index of a space prior to any trailing spaces
int index = sentence.TrimEnd().LastIndexOf(' ');
if (index == -1)
{
return selection;
}
return sentence.Substring(0, index + 1) + selection;
}

Use string.Join(" and ", suggestions) to create your replacement string and then string.Replace() to do substitution.

You can add the listbox items in the array and while traversing through array, once a match is found, break the loop and get out of it and display the output.

Related

Case insensitive text search in C#: How can I keep the original case when highlighting matching phrases?

I have a case insensitive text search (in the controller, I'm doing .ToLower() on both sides of the comparison), and I am highlighting the search phrases in the result text like this:
#Html.Raw(searchPhrase.Length == 0
? item.Description
: (item.Description ?? "") // An item's Description could be NULL
.Replace(searchPhrase, $"<span class='highlight'>{searchPhrase}</span>"))
Items matching the search phrase are displayed, but if the case doesn't match, there won't be any highlighting.
I want matching text to be highlighted even if the case doesn't match, and I want to keep the original case.
E.g.: If I search for "Potato", both "Potato" and "potato" should be highlighted in the search result.
I have seen some similar questions around, but not for C#, and I have not been able to translate any of the solutions to C#.
Well you can't do Replace like that because you are overwriting your source data.
You either need to mark the beginning and end of each match and, starting at the end of the string and going to the beginning, insert your opening and closing brackets, or write a custom replace routine. I threw together a rather basic version of the latter, but it works. It keeps the existing values so all cases will be preserved.
string testData = "This is my fake data for matching this string";
string searchPhrase = "thiS";
string resultSet = "";
for (int i = 0; i < testData.Length - searchPhrase.Length; i++)
{
if (searchPhrase.ToLower() == testData.Substring(i, searchPhrase.Length).ToLower())
{
resultSet += "<span class='highlight'>" + testData.Substring(i, searchPhrase.Length) + "</span>";
i += searchPhrase.Length -1;
}
else
{
resultSet += testData[i].ToString();
}
}
Console.WriteLine(resultSet);
Granted, this code could probably be made faster with string parsing but I'll leave all that to you if you want to redo it.
To elaborate on #stuartd 's comment, here is how to use Regex.Replace for the same thing:
var ans = searchPhrase.Length == 0
? (item.Description ?? String.Empty)
: Regex.Replace((item.Description ?? String.Empty), // An item's Description could be NULL
Regex.Escape(searchPhrase),
"<span class='highlight'>$&</span>",
RegexOptions.IgnoreCase);

How to strip a string from the point a hyphen is found within the string C#

I'm currently trying to strip a string of data that is may contain the hyphen symbol.
E.g. Basic logic:
string stringin = "test - 9894"; OR Data could be == "test";
if (string contains a hyphen "-"){
Strip stringin;
output would be "test" deleting from the hyphen.
}
Console.WriteLine(stringin);
The current C# code i'm trying to get to work is shown below:
string Details = "hsh4a - 8989";
var regexItem = new Regex("^[^-]*-?[^-]*$");
string stringin;
stringin = Details.ToString();
if (regexItem.IsMatch(stringin)) {
stringin = stringin.Substring(0, stringin.IndexOf("-") - 1); //Strip from the ending chars and - once - is hit.
}
Details = stringin;
Console.WriteLine(Details);
But pulls in an Error when the string does not contain any hyphen's.
How about just doing this?
stringin.Split('-')[0].Trim();
You could even specify the maximum number of substrings using overloaded Split constructor.
stringin.Split('-', 1)[0].Trim();
Your regex is asking for "zero or one repetition of -", which means that it matches even if your input does NOT contain a hyphen. Thereafter you do this
stringin.Substring(0, stringin.IndexOf("-") - 1)
Which gives an index out of range exception (There is no hyphen to find).
Make a simple change to your regex and it works with or without - ask for "one or more hyphens":
var regexItem = new Regex("^[^-]*-+[^-]*$");
here -------------------------^
It seems that you want the (sub)string starting from the dash ('-') if original one contains '-' or the original string if doesn't have dash.
If it's your case:
String Details = "hsh4a - 8989";
Details = Details.Substring(Details.IndexOf('-') + 1);
I wouldn't use regex for this case if I were you, it makes the solution much more complex than it can be.
For string I am sure will have no more than a couple of dashes I would use this code, because it is one liner and very simple:
string str= entryString.Split(new [] {'-'}, StringSplitOptions.RemoveEmptyEntries)[0];
If you know that a string might contain high amount of dashes, it is not recommended to use this approach - it will create high amount of different strings, although you are looking just for the first one. So, the solution would look like something like this code:
int firstDashIndex = entryString.IndexOf("-");
string str = firstDashIndex > -1? entryString.Substring(0, firstDashIndex) : entryString;
you don't need a regex for this. A simple IndexOf function will give you the index of the hyphen, then you can clean it up from there.
This is also a great place to start writing unit tests as well. They are very good for stuff like this.
Here's what the code could look like :
string inputString = "ho-something";
string outPutString = inputString;
var hyphenIndex = inputString.IndexOf('-');
if (hyphenIndex > -1)
{
outPutString = inputString.Substring(0, hyphenIndex);
}
return outPutString;

Search text file for text above which pattern matches input

I am trying to make my program display the text above the input text which matches a pattern I set.
For example, if user input 'FastModeIdleImmediateCount"=dword:00000000', I should get the closest HKEY above, which is [HKEY_CURRENT_CONFIG\System\CurrentControlSet\Enum\SCSI\Disk&Ven_ATA&Prod_TOSHIBA_MQ01ABD0\4&6a0976b&0&000000] for this case.
[HKEY_CURRENT_CONFIG\System\CurrentControlSet\Enum\SCSI\Disk&Ven_ATA&Prod_TOSHIBA_MQ01ABD0\4&6a0976b&0&000000]
"StandardModeIdleImmediateCount"=dword:00000000
"FastModeIdleImmediateCount"=dword:00000000
[HKEY_CURRENT_CONFIG\System\CurrentControlSet\SERVICES]
[HKEY_CURRENT_CONFIG\System\CurrentControlSet\SERVICES\TSDDD]
[HKEY_CURRENT_CONFIG\System\CurrentControlSet\SERVICES\TSDDD\DEVICE0]
"Attach.ToDesktop"=dword:00000001
Could anyone please show me how I can code something like that? I tried playing around with regular expressions to match text with bracket, but I am not sure how to make it to only search for the text above my input.
I'm assuming your file is a .txt file, although it's most probably not. But the logic is the same.
It is not hard at all, a simple for() loop would do the trick.
Code with the needed description:
string[] lines = File.ReadAllLines(#"d:\test.txt");//replace your directory. We're getting all lines from a text file.
string inputToSearchFor = "\"FastModeIdleImmediateCount\"=dword:00000000"; //that's the string to search for
int indexOfMatchingLine = Array.FindIndex(lines, line => line == inputToSearchFor); //getting the index of the line, which equals the matchcode
string nearestHotKey = String.Empty;
for(int i = indexOfMatchingLine; i >=0; i--) //looping for lines above the matched one to find the hotkey
{
if(lines[i].IndexOf("[HKEY_") == 0) //if we find a line which begins with "[HKEY_" (that means it's a hotkey, right?)
{
nearestHotKey = lines[i]; //we get the line into our hotkey string
break; //breaking the loop
}
}
if(nearestHotKey != String.Empty) //we have actually found a hotkey, so our string is not empty
{
//add code...
}
You could try to split the text into lines, find the index of the line that contains your text (whether exact match or regex is used doesn't matter) and then backsearch for the first key. Reverse sorting the lines first might help.

Replace Bad words using Regex

I am trying to create a bad word filter method that I can call before every insert and update to check the string for any bad words and replace with "[Censored]".
I have an SQL table with has a list of bad words, I want to bring them back and add them to a List or string array and check through the string of text that has been passed in and if any bad words are found replace them and return a filtered string back.
I am using C# for this.
Please see this "clbuttic" (or for your case cl[Censored]ic) article before doing a string replace without considering word boundaries:
http://www.codinghorror.com/blog/2008/10/obscenity-filters-bad-idea-or-incredibly-intercoursing-bad-idea.html
Update
Obviously not foolproof (see article above - this approach is so easy to get around or produce false positives...) or optimized (the regular expressions should be cached and compiled), but the following will filter out whole words (no "clbuttics") and simple plurals of words:
const string CensoredText = "[Censored]";
const string PatternTemplate = #"\b({0})(s?)\b";
const RegexOptions Options = RegexOptions.IgnoreCase;
string[] badWords = new[] { "cranberrying", "chuffing", "ass" };
IEnumerable<Regex> badWordMatchers = badWords.
Select(x => new Regex(string.Format(PatternTemplate, x), Options));
string input = "I've had no cranberrying sleep for chuffing chuffings days -
the next door neighbour is playing classical music at full tilt!";
string output = badWordMatchers.
Aggregate(input, (current, matcher) => matcher.Replace(current, CensoredText));
Console.WriteLine(output);
Gives the output:
I've had no [Censored] sleep for [Censored] [Censored] days - the next door neighbour is playing classical music at full tilt!
Note that "classical" does not become "cl[Censored]ical", as whole words are matched with the regular expression.
Update 2
And to demonstrate a flavour of how this (and in general basic string\pattern matching techniques) can be easily subverted, see the following string:
"I've had no cranberryıng sleep for chuffıng chuffıngs days - the next door neighbour is playing classical music at full tilt!"
I have replaced the "i"'s with Turkish lower case undottted "ı"'s. Still looks pretty offensive!
Although I'm a big fan of Regex, I think it won't help you here. You should fetch your bad word into a string List or string Array and use System.String.Replace on your incoming message.
Maybe better, use System.String.Split and .Join methods:
string mayContainBadWords = "... bla bla ...";
string[] badWords = new string[]{"bad", "worse", "worst"};
string[] temp = string.Split(badWords, StringSplitOptions.RemoveEmptyEntries);
string cleanString = string.Join("[Censored]", temp);
In the sample, mayContainBadWords is the string you want to check; badWords is a string array, you load from your bad word sql table and cleanString is your result.
you can use string.replace() method or RegEx class
There is also a nice article about it which can e found here
With a little html-parsing skills, you can get a large list with swear words from noswear

.NET String parsing performance improvement - Possible Code Smell

The code below is designed to take a string in and remove any of a set of arbitrary words that are considered non-essential to a search phrase.
I didn't write the code, but need to incorporate it into something else. It works, and that's good, but it just feels wrong to me. However, I can't seem to get my head outside the box that this method has created to think of another approach.
Maybe I'm just making it more complicated than it needs to be, but I feel like this might be cleaner with a different technique, perhaps by using LINQ.
I would welcome any suggestions; including the suggestion that I'm over thinking it and that the existing code is perfectly clear, concise and performant.
So, here's the code:
private string RemoveNonEssentialWords(string phrase)
{
//This array is being created manually for demo purposes. In production code it's passed in from elsewhere.
string[] nonessentials = {"left", "right", "acute", "chronic", "excessive", "extensive",
"upper", "lower", "complete", "partial", "subacute", "severe",
"moderate", "total", "small", "large", "minor", "multiple", "early",
"major", "bilateral", "progressive"};
int index = -1;
for (int i = 0; i < nonessentials.Length; i++)
{
index = phrase.ToLower().IndexOf(nonessentials[i]);
while (index >= 0)
{
phrase = phrase.Remove(index, nonessentials[i].Length);
phrase = phrase.Trim().Replace(" ", " ");
index = phrase.IndexOf(nonessentials[i]);
}
}
return phrase;
}
Thanks in advance for your help.
Cheers,
Steve
This appears to be an algorithm for removing stop words from a search phrase.
Here's one thought: If this is in fact being used for a search, do you need the resulting phrase to be a perfect representation of the original (with all original whitespace intact), but with stop words removed, or can it be "close enough" so that the results are still effectively the same?
One approach would be to tokenize the phrase (using the approach of your choice - could be a regex, I'll use a simple split) and then reassemble it with the stop words removed. Example:
public static string RemoveStopWords(string phrase, IEnumerable<string> stop)
{
var tokens = Tokenize(phrase);
var filteredTokens = tokens.Where(s => !stop.Contains(s));
return string.Join(" ", filteredTokens.ToArray());
}
public static IEnumerable<string> Tokenize(string phrase)
{
return string.Split(phrase, ' ');
// Or use a regex, such as:
// return Regex.Split(phrase, #"\W+");
}
This won't give you exactly the same result, but I'll bet that it's close enough and it will definitely run a lot more efficiently. Actual search engines use an approach similar to this, since everything is indexed and searched at the word level, not the character level.
I guess your code is not doing what you want it to do anyway. "moderated" would be converted to "d" if I'm right. To get a good solution you have to specify your requirements a bit more detailed. I would probably use Replace or regular expressions.
I would use a regular expression (created inside the function) for this task. I think it would be capable of doing all the processing at once without having to make multiple passes through the string or having to create multiple intermediate strings.
private string RemoveNonEssentialWords(string phrase)
{
return Regex.Replace(phrase, // input
#"\b(" + String.Join("|", nonessentials) + #")\b", // pattern
"", // replacement
RegexOptions.IgnoreCase)
.Replace(" ", " ");
}
The \b at the beginning and end of the pattern makes sure that the match is on a boundary between alphanumeric and non-alphanumeric characters. In other words, it will not match just part of the word, like your sample code does.
Yeah, that smells.
I like little state machines for parsing, they can be self-contained inside a method using lists of delegates, looping through the characters in the input and sending each one through the state functions (which I have return the next state function based on the examined character).
For performance I would flush out whole words to a string builder after I've hit a separating character and checked the word against the list (might use a hash set for that)
I would create A Hash table of Removed words parse each word if in the hash remove it only one time through the array and I believe that creating a has table is O(n).
How does this look?
foreach (string nonEssent in nonessentials)
{
phrase.Replace(nonEssent, String.Empty);
}
phrase.Replace(" ", " ");
If you want to go the Regex route, you could do it like this. If you're going for speed it's worth a try and you can compare/contrast with other methods:
Start by creating a Regex from the array input. Something like:
var regexString = "\\b(" + string.Join("|", nonessentials) + ")\\b";
That will result in something like:
\b(left|right|chronic)\b
Then create a Regex object to do the find/replace:
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(regexString, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
Then you can just do a Replace like so:
string fixedPhrase = regex.Replace(phrase, "");

Categories