I have a requirement to contract a string such as...
Would you consider becoming a robot? You would be provided with a free annual oil change."
...to something much shorter but yet still humanly identifiable (it will need to be found from a select list - my current solution has users entering an arbitrary title for the sole purpose of selection)
I would like to extract only the portion of the string which forms a question (if possible) and then somehow reduce it to something like
WouldConsiderBecomingRobot
Are there any grammatical algorithms out there that might help me with this? I'm thinking there might be something that allows be to pick out just verbs and nouns.
As this is just to act as a key it doesn't have to be perfect; I'm not seeking to trivialise the inherant complexity of the english language.
Probably too simplistic, but I might be tempted to start with a list of "filler words":
var fillers = new[]{"you","I","am","the","a","are"};
Then extract everything before a questionmark (using regex, string mashing, whatever you fancy), yielding you "Would you consider becoming a robot".
Then go through the string extracting every word considered a filler.
var sentence = "Would you consider becoming a robot";
var newSentence = String.Join("",sentence.Split(" ").Where(w => !fillers.Contains(w)).ToArray());
// newSentence is "Wouldconsiderbecomingrobot".
Pascal casing each word would result in your desired string - i'll leave that as an excercise for the reader.
Create a popular social media website. When users want to join or post comments, prompt them to solve a captcha. The captcha will consist of matching your shortened versions of the long strings to their full versions. Your shortening algorithm will be based on a neural net or genetic algorithm which is trained from the capcha results.
You can also sell advertising on the website.
I ended up creating the following extension method which does work surprisingly well. Thanks to Joe Blow for his excellent and effective suggestions:
public static string Contract(this string e, int maxLength)
{
if(e == null) return e;
int questionMarkIndex = e.IndexOf('?');
if (questionMarkIndex == -1)
questionMarkIndex = e.Length - 1;
int lastPeriodIndex = e.LastIndexOf('.', questionMarkIndex, 0);
string question = e.Substring(lastPeriodIndex != -1 ? lastPeriodIndex : 0, questionMarkIndex + 1).Trim();
var punctuation =
new [] {",", ".", "!", ";", ":", "/", "...", "...,", "-,", "(", ")", "{", "}", "[", "]","'","\""};
question = punctuation.Aggregate(question, (current, t) => current.Replace(t, ""));
IDictionary<string, bool> words = question.Split(' ').ToDictionary(x => x, x => false);
string mash = string.Empty;
while (words.Any(x => !x.Value) && mash.Length < maxLength)
{
int maxWordLength = words.Where(x => !x.Value).Max(x => x.Key.Length);
var pair = words.Where(x => !x.Value).Last(x => x.Key.Length == maxWordLength);
words.Remove(pair);
words.Add(new KeyValuePair<string, bool>(pair.Key, true));
mash = string.Join("", words.Where(x => x.Value)
.Select(x => x.Key.Capitalize())
.ToArray()
);
}
return mash;
}
This contracts the following to 15 chars:
This does not have any prereqs - write an essay...: PrereqsWriteEssay
You've selected a car: YouveSelectedCar
I don't think there is any algorithm that can identify if each word of a string is a noun, adjective or whatever. The only solution would be to use a custom dictionary : just create a list of words that can't be identified as verbs or nouns (I, you, they, them, his, hers, of, a, the etc.).
Then you just have to keep all the words before the question mark that are not in the list.
It is just a workaround, and I you said, it is not perfect.
Hope this helps !
Welcome to the wonderful world of natural language processing. If you want to identify nouns and verbs, you will need a part of speech tagger.
Related
I am learning Dotnet c# on my own.
how to find whether a given text exists or not in a string and if exists, how to find count of times the word has got repeated in that string. even if the word is misspelled, how to find it and print that the word is misspelled?
we can do this with collections or linq in c# but here i used string class and used contains method but iam struck after that.
if we can do this with help of linq, how?
because linq works with collections, Right?
you need a list in order to play with linq.
but here we are playing with string(paragraph).
how linq can be used find a word in paragraph?
kindly help.
here is what i have tried so far.
string str = "Education is a ray of light in the darkness. It certainly is a hope for a good life. Eudcation is a basic right of every Human on this Planet. To deny this right is evil. Uneducated youth is the worst thing for Humanity. Above all, the governments of all countries must ensure to spread Education";
for(int i = 0; i < i++)
if (str.Contains("Education") == true)
{
Console.WriteLine("found");
}
else
{
Console.WriteLine("not found");
}
You can make a string a string[] by splitting it by a character/string. Then you can use LINQ:
if(str.Split().Contains("makes"))
{
// note that the default Split without arguments also includes tabs and new-lines
}
If you don't care whether it is a word or just a sub-string, you can use str.Contains("makes") directly.
If you want to compare in a case insensitive way, use the overload of Contains:
if(str.Split().Contains("makes", StringComparer.InvariantCultureIgnoreCase)){}
string str = "money makes many makes things";
var strArray = str.Split(" ");
var count = strArray.Count(x => x == "makes");
the simplest way is to use Split extension to split the string into an array of words.
here is an example :
var words = str.Split(' ');
if(words.Length > 0)
{
foreach(var word in words)
{
if(word.IndexOf("makes", StringComparison.InvariantCultureIgnoreCase) != -1)
{
Console.WriteLine("found");
}
else
{
Console.WriteLine("not found");
}
}
}
Now, since you just want the count of number word occurrences, you can use LINQ to do that in a single line like this :
var totalOccurrences = str.Split(' ').Count(x=> x.IndexOf("makes", StringComparison.InvariantCultureIgnoreCase) != -1);
Note that StringComparison.InvariantCultureIgnoreCase is required if you want a case-insensitive comparison.
I am trying to create a bad word filter method that I can call before every insert and update to check the string for any bad words and replace with "[Censored]".
I have an SQL table with has a list of bad words, I want to bring them back and add them to a List or string array and check through the string of text that has been passed in and if any bad words are found replace them and return a filtered string back.
I am using C# for this.
Please see this "clbuttic" (or for your case cl[Censored]ic) article before doing a string replace without considering word boundaries:
http://www.codinghorror.com/blog/2008/10/obscenity-filters-bad-idea-or-incredibly-intercoursing-bad-idea.html
Update
Obviously not foolproof (see article above - this approach is so easy to get around or produce false positives...) or optimized (the regular expressions should be cached and compiled), but the following will filter out whole words (no "clbuttics") and simple plurals of words:
const string CensoredText = "[Censored]";
const string PatternTemplate = #"\b({0})(s?)\b";
const RegexOptions Options = RegexOptions.IgnoreCase;
string[] badWords = new[] { "cranberrying", "chuffing", "ass" };
IEnumerable<Regex> badWordMatchers = badWords.
Select(x => new Regex(string.Format(PatternTemplate, x), Options));
string input = "I've had no cranberrying sleep for chuffing chuffings days -
the next door neighbour is playing classical music at full tilt!";
string output = badWordMatchers.
Aggregate(input, (current, matcher) => matcher.Replace(current, CensoredText));
Console.WriteLine(output);
Gives the output:
I've had no [Censored] sleep for [Censored] [Censored] days - the next door neighbour is playing classical music at full tilt!
Note that "classical" does not become "cl[Censored]ical", as whole words are matched with the regular expression.
Update 2
And to demonstrate a flavour of how this (and in general basic string\pattern matching techniques) can be easily subverted, see the following string:
"I've had no cranberryıng sleep for chuffıng chuffıngs days - the next door neighbour is playing classical music at full tilt!"
I have replaced the "i"'s with Turkish lower case undottted "ı"'s. Still looks pretty offensive!
Although I'm a big fan of Regex, I think it won't help you here. You should fetch your bad word into a string List or string Array and use System.String.Replace on your incoming message.
Maybe better, use System.String.Split and .Join methods:
string mayContainBadWords = "... bla bla ...";
string[] badWords = new string[]{"bad", "worse", "worst"};
string[] temp = string.Split(badWords, StringSplitOptions.RemoveEmptyEntries);
string cleanString = string.Join("[Censored]", temp);
In the sample, mayContainBadWords is the string you want to check; badWords is a string array, you load from your bad word sql table and cleanString is your result.
you can use string.replace() method or RegEx class
There is also a nice article about it which can e found here
With a little html-parsing skills, you can get a large list with swear words from noswear
The code below is designed to take a string in and remove any of a set of arbitrary words that are considered non-essential to a search phrase.
I didn't write the code, but need to incorporate it into something else. It works, and that's good, but it just feels wrong to me. However, I can't seem to get my head outside the box that this method has created to think of another approach.
Maybe I'm just making it more complicated than it needs to be, but I feel like this might be cleaner with a different technique, perhaps by using LINQ.
I would welcome any suggestions; including the suggestion that I'm over thinking it and that the existing code is perfectly clear, concise and performant.
So, here's the code:
private string RemoveNonEssentialWords(string phrase)
{
//This array is being created manually for demo purposes. In production code it's passed in from elsewhere.
string[] nonessentials = {"left", "right", "acute", "chronic", "excessive", "extensive",
"upper", "lower", "complete", "partial", "subacute", "severe",
"moderate", "total", "small", "large", "minor", "multiple", "early",
"major", "bilateral", "progressive"};
int index = -1;
for (int i = 0; i < nonessentials.Length; i++)
{
index = phrase.ToLower().IndexOf(nonessentials[i]);
while (index >= 0)
{
phrase = phrase.Remove(index, nonessentials[i].Length);
phrase = phrase.Trim().Replace(" ", " ");
index = phrase.IndexOf(nonessentials[i]);
}
}
return phrase;
}
Thanks in advance for your help.
Cheers,
Steve
This appears to be an algorithm for removing stop words from a search phrase.
Here's one thought: If this is in fact being used for a search, do you need the resulting phrase to be a perfect representation of the original (with all original whitespace intact), but with stop words removed, or can it be "close enough" so that the results are still effectively the same?
One approach would be to tokenize the phrase (using the approach of your choice - could be a regex, I'll use a simple split) and then reassemble it with the stop words removed. Example:
public static string RemoveStopWords(string phrase, IEnumerable<string> stop)
{
var tokens = Tokenize(phrase);
var filteredTokens = tokens.Where(s => !stop.Contains(s));
return string.Join(" ", filteredTokens.ToArray());
}
public static IEnumerable<string> Tokenize(string phrase)
{
return string.Split(phrase, ' ');
// Or use a regex, such as:
// return Regex.Split(phrase, #"\W+");
}
This won't give you exactly the same result, but I'll bet that it's close enough and it will definitely run a lot more efficiently. Actual search engines use an approach similar to this, since everything is indexed and searched at the word level, not the character level.
I guess your code is not doing what you want it to do anyway. "moderated" would be converted to "d" if I'm right. To get a good solution you have to specify your requirements a bit more detailed. I would probably use Replace or regular expressions.
I would use a regular expression (created inside the function) for this task. I think it would be capable of doing all the processing at once without having to make multiple passes through the string or having to create multiple intermediate strings.
private string RemoveNonEssentialWords(string phrase)
{
return Regex.Replace(phrase, // input
#"\b(" + String.Join("|", nonessentials) + #")\b", // pattern
"", // replacement
RegexOptions.IgnoreCase)
.Replace(" ", " ");
}
The \b at the beginning and end of the pattern makes sure that the match is on a boundary between alphanumeric and non-alphanumeric characters. In other words, it will not match just part of the word, like your sample code does.
Yeah, that smells.
I like little state machines for parsing, they can be self-contained inside a method using lists of delegates, looping through the characters in the input and sending each one through the state functions (which I have return the next state function based on the examined character).
For performance I would flush out whole words to a string builder after I've hit a separating character and checked the word against the list (might use a hash set for that)
I would create A Hash table of Removed words parse each word if in the hash remove it only one time through the array and I believe that creating a has table is O(n).
How does this look?
foreach (string nonEssent in nonessentials)
{
phrase.Replace(nonEssent, String.Empty);
}
phrase.Replace(" ", " ");
If you want to go the Regex route, you could do it like this. If you're going for speed it's worth a try and you can compare/contrast with other methods:
Start by creating a Regex from the array input. Something like:
var regexString = "\\b(" + string.Join("|", nonessentials) + ")\\b";
That will result in something like:
\b(left|right|chronic)\b
Then create a Regex object to do the find/replace:
System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(regexString, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
Then you can just do a Replace like so:
string fixedPhrase = regex.Replace(phrase, "");
Is there any .Net library to remove all problematic characters of a string and only leave alphanumeric, hyphen and underscore (or similar subset) in an intelligent way? This is for using in URLs, file names, etc.
I'm looking for something similar to stringex which can do the following:
A simple prelude
"simple English".to_url =>
"simple-english"
"it's nothing at all".to_url =>
"its-nothing-at-all"
"rock & roll".to_url =>
"rock-and-roll"
Let's show off
"$12 worth of Ruby power".to_url =>
"12-dollars-worth-of-ruby-power"
"10% off if you act now".to_url =>
"10-percent-off-if-you-act-now"
You don't even wanna trust Iconv for this next part
"kick it en Français".to_url =>
"kick-it-en-francais"
"rock it Español style".to_url =>
"rock-it-espanol-style"
"tell your readers 你好".to_url =>
"tell-your-readers-ni-hao"
You can try this
string str = phrase.ToLower(); //optional
str = str.Trim();
str = Regex.Replace(str, #"[^a-z0-9\s_]", ""); // invalid chars
str = Regex.Replace(str, #"\s+", " ").Trim(); // convert multiple spaces into one space
str = str.Substring(0, str.Length <= 400 ? str.Length : 400).Trim(); // cut and trim it
str = Regex.Replace(str, #"\s", "-");
Perhaps this question here can help you on your way. It gives you code on how Stackoverflow generates its url's (more specifically, how question names are turned into nice urls.
Link to Question here, where Jeff Atwood shows their code
From your examples, the closest thing I've found (although I don't think it does everything that you're after) is:
My Favorite String Extension Methods in C#
and also:
ÜberUtils - Part 3 : Strings
Since neither of these solutions will give you exactly what you're after (going from the examples in your question) and assuming that the goal here is to make your string "safe", I'd second Hogan's advice and go with Microsoft's Anti Cross Site Scripting Library, or at least use that as a basis for something that you create yourself, perhaps deriving from the library.
Here's a link to a class that builds a number of string extension methods (like the first two examples) but leverages Microsoft's AntiXSS Library:
Extension Methods for AntiXss
Of course, you can always combine the algorithms (or similar ones) used within the AntiXSS library with the kind of algorithms that are often used in websites to generate "slug" URL's (much like Stack Overflow and many blog platforms do).
Here's an example of a good C# slug generator:
Improved C# Slug Generator
You could use HTTPUtility.UrlEncode, but that would encode everything, and not replace or remove problematic characters. So your spaces would be + and ' would be encoded as well. Not a solution, but maybe a starting point
If the goal is to make the string "safe" I recommend Mirosoft's anti-xss libary
There will be no library capable of what you want since you are stating specific rules that you want applied, e.g. $x => x-dollars, x% => x-percent. You will almost certainly have to write your own method to acheive this. It shouldn't be too difficult. A string extension method and use of one or more Regex's for making the replacements would probably be quite a nice concise way of doing it.
e.g.
public static string ToUrl(this string text)
{
return text.Trim().Regex.Replace(text, ..., ...);
}
Something the Ruby version doesn't make clear (but the original Perl version does) is that the algorithm it's using to transliterate non-Roman characters is deliberately simplistic -- "better than nothing" in both senses. For example, while it does have a limited capability to transliterate Chinese characters, this is entirely context-insensitive -- so if you feed it Japanese text then you get gibberish out.
The advantage of this simplistic nature is that it's pretty trivial to implement. You just have a big table of Unicode characters and their corresponding ASCII "equivalents". You could pull this straight from the Perl (or Ruby) source code if you decide to implement this functionality yourself.
I'm using something like this in my blog.
public class Post
{
public string Subject { get; set; }
public string ResolveSubjectForUrl()
{
return Regex.Replace(Regex.Replace(this.Subject.ToLower(), "[^\\w]", "-"), "[-]{2,}", "-");
}
}
I couldn't find any library that does it, like in Ruby, so I ended writing my own method. This is it in case anyone cares:
/// <summary>
/// Turn a string into something that's URL and Google friendly.
/// </summary>
/// <param name="str"></param>
/// <returns></returns>
public static string ForUrl(this string str) {
return str.ForUrl(true);
}
public static string ForUrl(this string str, bool MakeLowerCase) {
// Go to lowercase.
if (MakeLowerCase) {
str = str.ToLower();
}
// Replace accented characters for the closest ones:
char[] from = "ÂÃÄÀÁÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïðñòóôõöøùúûüýÿ".ToCharArray();
char[] to = "AAAAAACEEEEIIIIDNOOOOOOUUUUYaaaaaaceeeeiiiidnoooooouuuuyy".ToCharArray();
for (int i = 0; i < from.Length; i++) {
str = str.Replace(from[i], to[i]);
}
// Thorn http://en.wikipedia.org/wiki/%C3%9E
str = str.Replace("Þ", "TH");
str = str.Replace("þ", "th");
// Eszett http://en.wikipedia.org/wiki/%C3%9F
str = str.Replace("ß", "ss");
// AE http://en.wikipedia.org/wiki/%C3%86
str = str.Replace("Æ", "AE");
str = str.Replace("æ", "ae");
// Esperanto http://en.wikipedia.org/wiki/Esperanto_orthography
from = "ĈĜĤĴŜŬĉĝĥĵŝŭ".ToCharArray();
to = "CXGXHXJXSXUXcxgxhxjxsxux".ToCharArray();
for (int i = 0; i < from.Length; i++) {
str = str.Replace(from[i].ToString(), "{0}{1}".Args(to[i*2], to[i*2+1]));
}
// Currencies.
str = new Regex(#"([¢€£\$])([0-9\.,]+)").Replace(str, #"$2 $1");
str = str.Replace("¢", "cents");
str = str.Replace("€", "euros");
str = str.Replace("£", "pounds");
str = str.Replace("$", "dollars");
// Ands
str = str.Replace("&", " and ");
// More aesthetically pleasing contractions
str = str.Replace("'", "");
str = str.Replace("’", "");
// Except alphanumeric, everything else is a dash.
str = new Regex(#"[^A-Za-z0-9-]").Replace(str, "-");
// Remove dashes at the begining or end.
str = str.Trim("-".ToCharArray());
// Compact duplicated dashes.
str = new Regex("-+").Replace(str, "-");
// Let's url-encode just in case.
return str.UrlEncode();
}
How would you convert names to proper case in C#?
I have a list of names that I'd like to proof.
For example: mcdonalds to McDonalds or o'brien to O'Brien.
You could consider using a search engine to help you. Submit a query and see how the results have capitalized the name.
I wrote the following extension methods. Feel free to use them.
public static class StringExtensions
{
public static string ToProperCase( this string original )
{
if( original.IsNullOrEmpty() )
return original;
string result = _properNameRx.Replace( original.ToLower( CultureInfo.CurrentCulture ), HandleWord );
return result;
}
public static string WordToProperCase( this string word )
{
if( word.IsNullOrEmpty() )
return word;
if( word.Length > 1 )
return Char.ToUpper( word[0], CultureInfo.CurrentCulture ) + word.Substring( 1 );
return word.ToUpper( CultureInfo.CurrentCulture );
}
private static readonly Regex _properNameRx = new Regex( #"\b(\w+)\b" );
private static readonly string[] _prefixes = { "mc" };
private static string HandleWord( Match m )
{
string word = m.Groups[1].Value;
foreach( string prefix in _prefixes )
{
if( word.StartsWith( prefix, StringComparison.CurrentCultureIgnoreCase ) )
return prefix.WordToProperCase() + word.Substring( prefix.Length ).WordToProperCase();
}
return word.WordToProperCase();
}
}
There is absolutely no way for a computer just to magically know that the first "D" in "McDonalds" should be capitalized. So, I think there are two choices.
Someone out there may have a piece of software or a library that will do this for you.
Barring that, your only choice is to take the following approach: First, I'd look up the name in a dictionary of words that have "interesting" capitalization. Obviously you'd have to provide this dictionary yourself, unless one exists already. Second, apply an algorithm that corrects some of the obvious ones, like Celtic names beginning with O' and Mac and Mc, although given a large enough pool of names, such an algorithm will undoubtedly have a lot of false positives. Lastly, capitalize the first letter of every name that doesn't meet the first two criteria.
The hard part of this is the algorithms to decide on the capitalization. The string manipulation itself is pretty easy. There isn't a perfect way, since there are no "rules" for cases. One strategy might be a set of rules, such as "capitalize the first letter...usually" and "capitalize the 3rd letter if the first two letters are mc...usually"
Starting with a dictionary of real names and comparing them to your own name for matches will help. You could also take a dictionary of real names, generate a Markhov chain from it, and throw any new names at the Markhov chain to determine the capitalization. That's a crazy, complicated solution.
The ultimate perfect solution is to use humans to correct the data.
Doing this requires that your program be able to interpret the english language to an extent. At the very least be able to break down a string into a set of words. There is no API built-into the .Net Framework that can achieve this.
However if there was, you could use the following code.
public string ProperCase(string str, Func<string,bool> isWord) {
var word = new StringBuilder();
var cur = new StringBuilder();
for ( var i = 0; i < str.Length; i++ ) {
cur.Append(cur.Length == 0 ? Char.ToUpper(str[i]) : str[i]));
if ( isWord(cur.ToString()) {
word.Append(cur.ToString());
cur.Length = 0;
}
}
if ( cur.Length > 0 ) {
word.Append(cur);
}
return word.ToString();
}
It's not a perfect solution but it gives you a general idea of the outline
You could check the lower/mixed case surname against a dictionary (file) that has the correct casings in it, then return the 'real' value from the dictionary.
I had a quick google to see if one exists, but to no avail!
I'm planning on writing such a function, but will probably not go into too many edge cases... Below in psuedo-code with regex for matching...
start with /\b[A-Z]+\b/ as set matching, so each sequence of letters up against a word boundary, match as a set.
if the string is all uppercase...
lower-case the string
upper-case the first letter
do the following beginning of string replacements
Vanb -> VanB
Vanh -> VanH
Mc? -> Mc? (uppercase wildcard character)
Mac[^kh] -> Mac? (uppercase wildcard match)
With the replaced whole-name string do matching against other replacement sets like...
"De La " -> "de la "
That should catch most cases for names in particular... but a nice database of common name casing would be very nice.
Here was my solution. This hard-codes the names into the program but with a little work you could keep a text file outside of the program and read in the name exceptions (i.e. Van, Mc, Mac) and loop through them.
public static String toProperName(String name)
{
if (name != null)
{
if (name.Length >= 2 && name.ToLower().Substring(0, 2) == "mc") // Changes mcdonald to "McDonald"
return "Mc" + Regex.Replace(name.ToLower().Substring(2), #"\b[a-z]", m => m.Value.ToUpper());
if (name.Length >= 3 && name.ToLower().Substring(0, 3) == "van") // Changes vanwinkle to "VanWinkle"
return "Van" + Regex.Replace(name.ToLower().Substring(3), #"\b[a-z]", m => m.Value.ToUpper());
return Regex.Replace(name.ToLower(), #"\b[a-z]", m => m.Value.ToUpper()); // Changes to title case but also fixes
// appostrophes like O'HARE or o'hare to O'Hare
}
return "";
}
CultureInfo cultureInfo = Thread.CurrentThread.CurrentCulture;
TextInfo textInfo = cultureInfo.TextInfo;
string txt = textInfo.ToTitleCase("texthere");