auto detect tag within a text - c#

Does there is any library or algorithm that can do auto detection of tags in a text (ignoring the usual words of the chosen language)?
Something like this:
string[] keywords = GetKeyword("Your order is num #0123456789")
and keywords[] would contain "order" and "#0123456789" ...?
Does it exist? Or the user will select by himself all the tags of every document all the time? :?

foreach(string keyword in keywords) { // where keywords is a List<string>
if ("Your order is num #0123456789".Contains(keyword)) {
keywordsPresent.Add(keyword); // where keywordsPresent is a List<string>
}
}
return keywordsPresent;
What the above does is not cater for your #0123456789, for that add some more logic to find the index of the # or something...

Sorry, I misunderstood the question. If you want to look for specific words, the algorithm will depend on you strings. For example, you can use string.Split() to generate an array of words from one string, and then work with that, like this:
string[] words = string.Split("Your order is num #0123456789");
string orderNumber = "";
if(words.Contains("order") && w.StartsWith("#").Count > 0)
{
orderNumber = words.Where(w=>w.StartsWith("#").FirstOrDefault();
}
This will first generate an array of words from "Your order is num #0123456789" , then if it contains the word "order" it will wind a word that starts with "#" and select that;

I think that a lot of different algorithms can be used. Some of them are simple another are super complex. I can suggest you the next basic way:
Split all text into array of words.
Remove stop words from the array. (Goole "stop words list" to get full list of stop words.)
Walk through the array and calculate count of each word.
Sort words in accordance with their 'weight' in the array.
Choose necessary amount of tags.

Related

c# storing richtextbox user input into string array elements with defined length

I have a richtextbox named rtbDisclaimer. It's set to limit user input to 500 characters. I have been trying to figure out a way to take whatever the user types into the rtb, and break it into an array of strings with each string element limited to a certain character length, such as 100 for example, however if the 100th character is in the middle of a word I don't want it to include that word, but rather to put it in the next element so that words are not cut in half. Basically, if a user types in 500 characters, the whole thing would be broken into 5 or 6 string array elements with each element limited to 100 characters maximum without cutting any words in half. I've searched around and haven't been able to find anything that would work, and can't quite figure out how to tackle this problem. Also, the user input can be any length up to 500, so this would need to be flexible to allow for a input ranging anywhere from 0 - 500 characters.
Ex:
Not this:
[0]This is the user input typed into the ri
[1]chtextbox as it would appear in the ar
[2]ray elements
~~~~~~~~~~~~~~~~~~~~~~~~
This:
[0]This is the user input typed into the
[1]richtextbox as it would appear in the
[2]array elements
Something like this maybe?
var longText = "this morning, bla blabla";
var words = longText.split(' '); // split on space
var sentences = new List<string>();
var currentSentence = string.Empty();
foreach(var word in words){
if((sentence.Length + word.Length + 1) < 100){
currentSentence += word + " ";
}
else{
sentences.Add(currentSentence);
currentSentence = word;
}
}

How to search for specific char in an array and how to then manipulate that index in c#

Okay, so I'm creating a hangman game and everything functions so far, including what I'm TRYING to do in the question.
But it feels like there is a much more efficient method of obtaining the char that is also easier to manipulate the index.
protected static void alphabetSelector(string activeWordAlphabet)
{
char[] activeWord = activeWordAlphabet.ToCharArray();
string activeWordString = new string(activeWord);
Console.WriteLine("If you'd like to guess a letter, enter the letter. \n
If you'd like to guess the word, please type in the word. --- testing answer{0}",
activeWordString);
//Console.WriteLine("For Testing Purposes ONLY");
String chosenLetter = Console.ReadLine();
//Char[] letterFinder = Array.FindAll(activeWord, s => s.Equals(chosenLetter));
//string activeWordString = new string(activeWord);
foreach (char letter in activeWord);
{
if(activeWordString.Contains(chosenLetter))
{
Console.WriteLine("{0}", activeWordString);
Console.ReadLine();
}
else
{
Console.WriteLine("errrr...wrong!");
Console.ReadLine();
}
}
}
I have broken up the code in some areas to prevent the reader from having to scroll sideways. If this is bothersome, please let me know and I'll leave it in the future.
So this code will successfully print out the 'word' whenever I select the correct letter from the random word (I have the console print the actual word so that I can test it successfully each time). It will also print 'wrong' when I choose a letter NOT in the string.
But I feel like I should be able to use the
Array.FindAll(activeWord, ...)
functionality or some other way. But every time I try and reorder the arguments, it gives me all kinds of different errors and tells me to redo my arguments.
So, if you can look at this and find an easier method of searching the actual array for the user-selected 'letter', please help!! Even if it's not using the Array.FindAll method!!
Edit
Okay, it seems like there's some confusion with what I've done and why I've done it.
I'm ONLY printing the word inside that 'if' statement to test and make sure that the foreach{if{}} will actually work to find the char inside the string. But I ultimately need to be able to provide a placeholder for a char that is successfully found, as well as being able to 'cross out' the letter (from the alphabet list not shown here).
It's hangman - surely you guys know what I'm needing it to do. It has to keep track of which letters are left in the word, which letters have been chosen, as well as which letters are left in the entire alphabet.
I'm a 4-day old newb when it comes to programming, so please. . . I'm only doing what I know to do and when I get errors, I comment things out and write more until I find something that works.
Take a look at this demo I put together for you: https://dotnetfiddle.net/eP9TQM
I'd suggest creating a second string for the display string. Use a StringBuilder, and you can replace the characters in it at specific indices while creating the fewest number of stringobjects in the process.
string word = "your word or phrase here";
//Initialize a new StringBuilder that will display the word with placeholders.
StringBuilder display = new StringBuilder(word.Length); //You know the display word is the same length as the original word
display.Append('-', word.Length); //Fill it with placeholders.
So now you have your phrase/word, and a string builder full of characters that need to be discovered.
Go ahead and convert the display StringBuilder to a string that you can check on each pass to see if it equals your word:
var displayString = display.ToString();
//Loop until the display string is equal to the word
while (!displayString.Equals(word))
{
//Inside here your logic will follow.
}
So you are basically looping until the person answers here. You could of course go back and add logic to limit the number of attempts, or whatever you desire as an alternate exit strategy.
Inside this logic, you will check if they guessed a letter or a word based on how many characters they entered.
If they guessed a word, the logic is simple. Check if the guessed word is the same as the hidden word. If it is, then you break the loop and they are done. Otherwise, guessing loops back around.
If they guessed a letter, the logic is pretty straightforward, but more involved.
First get the character they guessed, just because it may be easier to work with this way.
char guess = input[0];
Now, look over the word for instances of that character:
//Look for instances of the character in the word.
for (int i = 0; i < word.Length; ++i)
{
//If the current index in the word matches their guess, then update the display.
if (char.ToUpperInvariant(word[i]) == char.ToUpperInvariant(guess))
display[i] = word[i];
}
The comments above should explain the idea here.
Update your displayString at the bottom of the loop so that it will check against the hidden word again:
displayString = display.ToString();
That's really all you need to do here. No fancy Linq needed.
Ok your code is really confusing, even with your edit.
First, why these 2 lines of code since activeWordAlphabet is a string :
char[] activeWord = activeWordAlphabet.ToCharArray();
string activeWordString = new string(activeWord);
Then you do your foreach.
For the word "FooBar", if the player types 'F', you will print
FooBar
FooBar
FooBar
FooBar
FooBar
FooBar
How does this help you in anything?
I think you have to review your algorithm. The string type have the function you need
int chosenLetterPosition = activeWord.IndexOf(chosenLetter, alreadyFoundPosition)
alreadyFoundPosition is an int from where the function will search the letter
IndexOf() returns -1 if the letter is not find or a positive number.
You can save this position with your letter in a dictionary to use it again as your new 'alreadyFoundPosition' if the chosenLetter is already in the dictionary
This is my answer. Because I don't have a lot of tasks today :)
class Letter
{
public bool ischosen { get; set; }
public char value { get; set; }
}
class LetterList
{
public LetterList(string word)
{
_lst = new List<Letter>();
word.ToList().ForEach(x => _lst.Add(new Letter() { value = x }));
}
public bool FindLetter(char letter)
{
var search = _lst.Where(x => x.value == letter).ToList();
search.ForEach(x=>x.ischosen=true);
return search.Count > 0 ? true : false;
}
public string NotChosen()
{
var res = "";
_lst.Where(x => !x.ischosen).ToList().ForEach(x => { res += x.value; });
return res;
}
List<Letter> _lst;
}
How to use
var abc = new LetterList("abcdefghijklmnopqrstuvwxyz");
var answer = new LetterList("myanswer");
Console.WriteLine("This my question. Why? write your answer please");
char x = Console.ReadLine()[0];
if (answer.FindLetter(x))
{
Console.WriteLine("you are right!");
}
else
{
Console.WriteLine("fail");
}
abc.FindLetter(x);
Console.WriteLine("not chosen abc:{0} answer:{1}", abc.NotChosen(), answer.NotChosen());
At least we used to play this game like that when i was a child.

Extracting values from a string in C#

I have the following string which i would like to retrieve some values from:
============================
Control 127232:
map #;-
============================
Control 127235:
map $;NULL
============================
Control 127236:
I want to take only the Control . Hence is there a way to retrieve from that string above into an array containing like [127232, 127235, 127236]?
One way of achieving this is with regular expressions, which does introduce some complexity but will give the answer you want with a little LINQ for good measure.
Start with a regular expression to capture, within a group, the data you want:
var regex = new Regex(#"Control\s+(\d+):");
This will look for the literal string "Control" followed by one or more whitespace characters, followed by one or more numbers (within a capture group) followed by a literal string ":".
Then capture matches from your input using the regular expression defined above:
var matches = regex.Matches(inputString);
Then, using a bit of LINQ you can turn this to an array
var arr = matches.OfType<Match>()
.Select(m => long.Parse(m.Groups[1].Value))
.ToArray();
now arr is an array of long's containing just the numbers.
Live example here: http://rextester.com/rundotnet?code=ZCMH97137
try this (assuming your string is named s and each line is made with \n):
List<string> ret = new List<string>();
foreach (string t in s.Split('\n').Where(p => p.StartsWith("Control")))
ret.Add(t.Replace("Control ", "").Replace(":", ""));
ret.Add(...) part is not elegant, but works...
EDITED:
If you want an array use string[] arr = ret.ToArray();
SYNOPSYS:
I see you're really a newbie, so I try to explain:
s.Split('\n') creates a string[] (every line in your string)
.Where(...) part extracts from the array only strings starting with Control
foreach part navigates through returned array taking one string at a time
t.Replace(..) cuts unwanted string out
ret.Add(...) finally adds searched items into returning list
Off the top of my head try this (it's quick and dirty), assuming the text you want to search is in the variable 'text':
List<string> numbers = System.Text.RegularExpressions.Regex.Split(text, "[^\\d+]").ToList();
numbers.RemoveAll(item => item == "");
The first line splits out all the numbers into separate items in a list, it also splits out lots of empty strings, the second line removes the empty strings leaving you with a list of the three numbers. if you want to convert that back to an array just add the following line to the end:
var numberArray = numbers.ToArray();
Yes, the way exists. I can't recall a simple way for It, but string is to be parsed for extracting this values. Algorithm of it is next:
Find a word "Control" in string and its end
Find a group of digits after the word
Extract number by int.parse or TryParse
If not the end of the string - goto to step one
realizing of this algorithm is almost primitive..)
This is simplest implementation (your string is str):
int i, number, index = 0;
while ((index = str.IndexOf(':', index)) != -1)
{
i = index - 1;
while (i >= 0 && char.IsDigit(str[i])) i--;
if (++i < index)
{
number = int.Parse(str.Substring(i, index - i));
Console.WriteLine("Number: " + number);
}
index ++;
}
Using LINQ for such a little operation is doubtful.

Replace Bad words using Regex

I am trying to create a bad word filter method that I can call before every insert and update to check the string for any bad words and replace with "[Censored]".
I have an SQL table with has a list of bad words, I want to bring them back and add them to a List or string array and check through the string of text that has been passed in and if any bad words are found replace them and return a filtered string back.
I am using C# for this.
Please see this "clbuttic" (or for your case cl[Censored]ic) article before doing a string replace without considering word boundaries:
http://www.codinghorror.com/blog/2008/10/obscenity-filters-bad-idea-or-incredibly-intercoursing-bad-idea.html
Update
Obviously not foolproof (see article above - this approach is so easy to get around or produce false positives...) or optimized (the regular expressions should be cached and compiled), but the following will filter out whole words (no "clbuttics") and simple plurals of words:
const string CensoredText = "[Censored]";
const string PatternTemplate = #"\b({0})(s?)\b";
const RegexOptions Options = RegexOptions.IgnoreCase;
string[] badWords = new[] { "cranberrying", "chuffing", "ass" };
IEnumerable<Regex> badWordMatchers = badWords.
Select(x => new Regex(string.Format(PatternTemplate, x), Options));
string input = "I've had no cranberrying sleep for chuffing chuffings days -
the next door neighbour is playing classical music at full tilt!";
string output = badWordMatchers.
Aggregate(input, (current, matcher) => matcher.Replace(current, CensoredText));
Console.WriteLine(output);
Gives the output:
I've had no [Censored] sleep for [Censored] [Censored] days - the next door neighbour is playing classical music at full tilt!
Note that "classical" does not become "cl[Censored]ical", as whole words are matched with the regular expression.
Update 2
And to demonstrate a flavour of how this (and in general basic string\pattern matching techniques) can be easily subverted, see the following string:
"I've had no cranberryıng sleep for chuffıng chuffıngs days - the next door neighbour is playing classical music at full tilt!"
I have replaced the "i"'s with Turkish lower case undottted "ı"'s. Still looks pretty offensive!
Although I'm a big fan of Regex, I think it won't help you here. You should fetch your bad word into a string List or string Array and use System.String.Replace on your incoming message.
Maybe better, use System.String.Split and .Join methods:
string mayContainBadWords = "... bla bla ...";
string[] badWords = new string[]{"bad", "worse", "worst"};
string[] temp = string.Split(badWords, StringSplitOptions.RemoveEmptyEntries);
string cleanString = string.Join("[Censored]", temp);
In the sample, mayContainBadWords is the string you want to check; badWords is a string array, you load from your bad word sql table and cleanString is your result.
you can use string.replace() method or RegEx class
There is also a nice article about it which can e found here
With a little html-parsing skills, you can get a large list with swear words from noswear

.NET String parsing performance improvement - Possible Code Smell

The code below is designed to take a string in and remove any of a set of arbitrary words that are considered non-essential to a search phrase.
I didn't write the code, but need to incorporate it into something else. It works, and that's good, but it just feels wrong to me. However, I can't seem to get my head outside the box that this method has created to think of another approach.
Maybe I'm just making it more complicated than it needs to be, but I feel like this might be cleaner with a different technique, perhaps by using LINQ.
I would welcome any suggestions; including the suggestion that I'm over thinking it and that the existing code is perfectly clear, concise and performant.
So, here's the code:
private string RemoveNonEssentialWords(string phrase)
{
//This array is being created manually for demo purposes. In production code it's passed in from elsewhere.
string[] nonessentials = {"left", "right", "acute", "chronic", "excessive", "extensive",
"upper", "lower", "complete", "partial", "subacute", "severe",
"moderate", "total", "small", "large", "minor", "multiple", "early",
"major", "bilateral", "progressive"};
int index = -1;
for (int i = 0; i < nonessentials.Length; i++)
{
index = phrase.ToLower().IndexOf(nonessentials[i]);
while (index >= 0)
{
phrase = phrase.Remove(index, nonessentials[i].Length);
phrase = phrase.Trim().Replace(" ", " ");
index = phrase.IndexOf(nonessentials[i]);
}
}
return phrase;
}
Thanks in advance for your help.
Cheers,
Steve
This appears to be an algorithm for removing stop words from a search phrase.
Here's one thought: If this is in fact being used for a search, do you need the resulting phrase to be a perfect representation of the original (with all original whitespace intact), but with stop words removed, or can it be "close enough" so that the results are still effectively the same?
One approach would be to tokenize the phrase (using the approach of your choice - could be a regex, I'll use a simple split) and then reassemble it with the stop words removed. Example:
public static string RemoveStopWords(string phrase, IEnumerable<string> stop)
{
var tokens = Tokenize(phrase);
var filteredTokens = tokens.Where(s => !stop.Contains(s));
return string.Join(" ", filteredTokens.ToArray());
}
public static IEnumerable<string> Tokenize(string phrase)
{
return string.Split(phrase, ' ');
// Or use a regex, such as:
// return Regex.Split(phrase, #"\W+");
}
This won't give you exactly the same result, but I'll bet that it's close enough and it will definitely run a lot more efficiently. Actual search engines use an approach similar to this, since everything is indexed and searched at the word level, not the character level.
I guess your code is not doing what you want it to do anyway. "moderated" would be converted to "d" if I'm right. To get a good solution you have to specify your requirements a bit more detailed. I would probably use Replace or regular expressions.
I would use a regular expression (created inside the function) for this task. I think it would be capable of doing all the processing at once without having to make multiple passes through the string or having to create multiple intermediate strings.
private string RemoveNonEssentialWords(string phrase)
{
return Regex.Replace(phrase, // input
#"\b(" + String.Join("|", nonessentials) + #")\b", // pattern
"", // replacement
RegexOptions.IgnoreCase)
.Replace(" ", " ");
}
The \b at the beginning and end of the pattern makes sure that the match is on a boundary between alphanumeric and non-alphanumeric characters. In other words, it will not match just part of the word, like your sample code does.
Yeah, that smells.
I like little state machines for parsing, they can be self-contained inside a method using lists of delegates, looping through the characters in the input and sending each one through the state functions (which I have return the next state function based on the examined character).
For performance I would flush out whole words to a string builder after I've hit a separating character and checked the word against the list (might use a hash set for that)
I would create A Hash table of Removed words parse each word if in the hash remove it only one time through the array and I believe that creating a has table is O(n).
How does this look?
foreach (string nonEssent in nonessentials)
{
phrase.Replace(nonEssent, String.Empty);
}
phrase.Replace(" ", " ");
If you want to go the Regex route, you could do it like this. If you're going for speed it's worth a try and you can compare/contrast with other methods:
Start by creating a Regex from the array input. Something like:
var regexString = "\\b(" + string.Join("|", nonessentials) + ")\\b";
That will result in something like:
\b(left|right|chronic)\b
Then create a Regex object to do the find/replace:
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(regexString, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
Then you can just do a Replace like so:
string fixedPhrase = regex.Replace(phrase, "");

Categories