How to implement a simple String search - c#

I want to implement a simple search in my application, based on search query I have.
Let's say I have an array containing 2 paragraphs or articles and I want to search in these articles for related subject or related keywords I enter.
For example:
//this is my search query
string mySearchQuery = "how to play with matches";
//these are my articles
string[] myarticles = new string[] {"article 1: this article will teach newbies how to start fire by playing with the awesome matches..", "article 2: this article doesn't contain anything"};
How can I get the first article based on the search query I provided above? Any idea?

This would return any string in myarticles that contains all of the words in mysearchquery:
var tokens = mySearchQuery.Split(' ');
var matches = myarticles.Where(m => tokens.All(t => m.Contains(t)));
foreach(var match in matches)
{
// do whatever you wish with them here
}

I'm sure you can fine a nice framework for string search, cause it's a wide subject, and got many search rules.
But for this simple sample, try splitting the search query with " ", for each word do a simple string search, if you find it, add 1 point to the paragraph search match, at the end return the paragraph with the most points...

Related

Regex: Find pagenumber from partial matching urls

As we all know, Regex patterns will make your stomache turn the first time you see them (or 10th time since you never went head first and truly learned it. Quilty.). I'm currently reading upon it, but since I'm on a tight deadline I'll check here if I can get a quicker and better answer/explaination meanwhile.
I have some url to a forum thread, and I want to scan through the html and find the last page for the thread.
So say I have one of the following urls identifying the thread in question:
https://www.somesite.com/forum/thread-93912* (absolute url to the
thread)
/forum/thread-93912 (relative url to the thread)
and I want to get all values (integers) that appear directly (next path) after any of the above "partial" match in the html-document.
So from any of the following hrefs located anywhere in the html-document (the doc is represented as a single string):
https://www.somesite.com/forum/thread-93912/34
https://www.somesite.com/forum/thread-93912/34/morestuffhere/whatevs
/forum/thread-93912/34
/forum/thread-93912/34/somethingheretoo
I want to extract the number 34 (only 34), so I can parse it to int.
EDIT
Okay, to make it simpler:
Say I have all the html in htmlString, and in this string I want to find all numbers x that appear after my inputString /forum/thread-93912.
These all appear in the htmlString, and I want to extract the numbers:
thread-93912/34
thread-93912/14
thread-93912/84
thread-93912/64
thread-93912/4
You don't need regex. Just use System.Uri.Segments
Uri url = new Uri("your url here");
Console.WriteLine(url.Segments[4]);
\b(\d+)\b(?=[^\d]*$)
Try this.See demo.grab the capture.
http://regex101.com/r/sU3fA2/55
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
Regex regex = new Regex(#"\b\d+\b(?=[^\d]*$)");
Match match = regex.Match("/forum/thread-93912/34");
if (match.Success)
{
Console.WriteLine(match.Value);
}
}
}
Since my question was a little hard to explain thuroughly (and since I "changed" my problem a little), I thought I'd add my own answer to get the exact code I went with (which I came up with thanks to the other answers here, so I'll give you all an upvote!).
I'm sure this can be made prettier and more compact, but I went for clearity since I'm new to regex!
First, get all strings matching the url + some number (separated with a slash "/"), then extract that number to a group called "page".
Regex regex = new Regex(urlToThread + #"/(?<page>\d+)");
MatchCollection matches = regex.Matches(htmlString);
Then iterate all matches and extract the "page"-value (garanteed to be an integer), and parse it to an integer. Add all parsed integers to a list and sort when done. The last one will be the greatest (last page).
List<int> pages = new List<int>();
foreach(Match match in matches)
pages.Add(int.Parse(match.Groups["page"].Value));
pages.Sort();
// And here we get the last page
int nrOfPages = pages[pages.Count-1];

RegEx for a Glossary Function

I'm working on a web-based help system that will auto-insert links into the explanatory text, taking users to other topics in help. I have hundreds of terms that should be linked, i.e.
"Manuals and labels" (describes these concepts in general)
"Delete Manuals and Labels" (describes this specific action)
"Learn more about adding manuals and labels" (again, more specific action)
I have a RegEx to find / replace whole words (good ol' \b), which works great, except for linked terms found inside other linked terms. Instead of:
Learn more about manuals and labels
I end up with
Learn more about <a href="#">manuals and labels</a>
Which makes everyone cry a little. Changing the order in which the terms are replaced (going shortest to longest) means that I''d get:
Learn more about manuals and labels
Without the outer link I really need.
The further complication is that the capitalization of the search terms can vary, and I need to retain the original capitalization. If I could do something like this, I'd be all set:
Regex _regex = new Regex("\\b" + termToFind + "(|s)" + "\\b", RegexOptions.IgnoreCase);
string resultingText = _regex.Replace(textThatNeedsLinksInserted, "<a>" + "$&".Replace(" ", "_") + "</a>));
And then after all the terms are done, remove the "_", that would be perfect. "Learn_more_about_manuals_and_labels" wouldn't match "manuals and labels," and all is well.
It would be hard to have the help authors delimit the terms that need to be replaced when writing the text -- they're not used to coding. Also, this would limit the flexibility to add new terms later, since we'd have to go back and add delimiters to all the previously written text.
Is there a RegEx that would let me replace whitespace with "_" in the original match? Or is there a different solution that's eluding me?
From your examples with nested links it sounds like you're making individual passes over the terms and performing multiple Regex.Replace calls. Since you're using a regex you should let it do the heavy lifting and put a nice pattern together that makes use of alternation.
In other words, you likely want a pattern like this: \b(term1|term2|termN)\b
var input = "Having trouble with your manuals and labels? Learn more about adding manuals and labels. Need to get rid of them? Try to delete manuals and labels.";
var terms = new[]
{
"Learn more about adding manuals and labels",
"Delete Manuals and Labels",
"manuals and labels"
};
var pattern = #"\b(" + String.Join("|", terms) + #")\b";
var replacement = #"$1";
var result = Regex.Replace(input, pattern, replacement, RegexOptions.IgnoreCase);
Console.WriteLine(result);
Now, to address the issue of a corresponding href value for each term, you can use a dictionary and change the regex to use a MatchEvaluator that will return the custom format and look up the value from the dictionary. The dictionary also ignores case by passing in StringComparer.OrdinalIgnoreCase. I tweaked the pattern slightly by adding ?: at the start of the group to make it a non-capturing group since I am no longer referring to the captured item as I did in the first example.
var terms = new Dictionary<string, string>(StringComparer.OrdinalIgnoreCase)
{
{ "Learn more about adding manuals and labels", "2.html" },
{ "Delete Manuals and Labels", "3.html" },
{ "manuals and labels", "1.html" }
};
var pattern = #"\b(?:" + String.Join("|", terms.Select(t => t.Key)) + #")\b";
var result = Regex.Replace(input, pattern,
m => String.Format(#"{1}", terms[m.Value], m.Value),
RegexOptions.IgnoreCase);
Console.WriteLine(result);
I would use an ordered dictionary like this, making sure the smallest term is last:
using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
public class Test
{
public static void Main()
{
OrderedDictionary Links = new OrderedDictionary();
Links.Add("Learn more about adding manuals and labels", "2");
Links.Add("Delete Manuals and Labels", "3");
Links.Add("manuals and labels", "1");
string text = "Having trouble with your manuals and labels? Learn more about adding manuals and labels. Need to get rid of them? Try to delete manuals and labels.";
foreach (string termToFind in Links.Keys)
{
Regex _regex = new Regex(#"\b" + termToFind + #"s?\b(?![^<>]*</)", RegexOptions.IgnoreCase);
text = _regex.Replace(text, #"$&");
}
Console.WriteLine(text);
}
}
ideone demo
The negative lookahead ((?![^<>]*</)) I added prevents the replace of a part you already replaced before which is between anchor tags.
First, you can prevent your Regex for manuals and labels from finding Learn more about manuals and labels by using a lookbehind. Modified your regex looks like this:
(?<!Learn more about )(manuals and labels)
But for your specific request i would suggest a different solution. You should define a rule or priority list for your regexs or both. A possible rule could be "always search for the regex first that matches the most characters". This however requires that your regexs are always fixed length. And it does not prevent one regex from consuming and replacing characters that would have been matched by a different regex (maybe even of the same size).
Of course you will need to add an additional lookbehind and lookahead to each of your regexs to prevent replacing strings that are inside of your replacing elements

Determine POS tagging in English based on database files

I'm a little bit confused how to determine part-of-speech tagging in English. In this case, I assume that one word in English has one type, for example word "book" is recognized as NOUN, not as VERB. I want to recognize English sentences based on tenses. For example, "I sent the book" is recognized as past tense.
Description:
I have a number of database (*.txt) files: NounList.txt, verbList.txt, adjectiveList.txt, adverbList.txt, conjunctionList.txt, prepositionList.txt, articleList.txt. And if input words are available in the database, I assume that type of those words can be concluded. But, how to begin lookup in the databases? For example, "I sent the book": how to begin a search in the databases for every word, "I" as Noun, "sent" as verb, "the" as article, "book" as noun? Any better approach than searching every word in every database? I doubt that every databases has unique element.
I enclose my perspective here.
private List<string> ParseInput(String allInput)
{
List<string> listSentence = new List<string>();
char[] delimiter = ".?!;".ToCharArray();
var sentences = allInput.Split(delimiter, StringSplitOptions.RemoveEmptyEntries).Select(s => s.Trim());
foreach (var s in sentences)
listSentence.Add(s);
return listSentence;
}
private void tenseReviewMenu_Click(object sender, EventArgs e)
{
string allInput = rtbInput.Text;
List<string> listWord = new List<string>();
List<string> listSentence = new List<string>();
HashSet<string> nounList = new HashSet<string>(getDBList("nounList.txt"));
HashSet<string> verbList = new HashSet<string>(getDBList("verbList.txt"));
HashSet<string> adjectiveList = new HashSet<string>(getDBList("adjectiveList.txt"));
HashSet<string> adverbList = new HashSet<string>(getDBList("adverbList.txt"));
char[] separator = new char[] { ' ', '\t', '\n', ',' etc... };
listSentence = ParseInput(allInput);
foreach (string sentence in listSentence)
{
foreach (string word in sentence.Split(separator))
if (word.Trim() != "")
listWord.Add(word);
}
string testPOS = "";
foreach (string word in listWord)
{
if (nounList.Contains(word.ToLowerInvariant()))
testPOS += "noun ";
else if (verbList.Contains(word.ToLowerInvariant()))
testPOS += "verb ";
else if (adjectiveList.Contains(word.ToLowerInvariant()))
testPOS += "adj ";
else if (adverbList.Contains(word.ToLowerInvariant()))
testPOS += "adv ";
}
tbTest.Text = testPOS;
}
POS tagging is my secondary explanation in my assignment. So I use a simple approach to determine POS tagging that is based on database. But, if there's a simpler approach: easy to use, easy to understand, easy to get pseudocode, easy to design... to determine POS tagging, please let me know.
I hope the pseudocode I present below proves helpful to you. If I find time, I'd also write some code for you.
This problem can be tackled by following the steps below:
Create a dictionary of all the common sentence patterns in the English language. For example, Subject + Verb is an English pattern and all the sentences like I sleep, Dog barked and Ship will arrive match the S-V pattern. You can find a list of the most common english patterns here. Please note that for some time you may need to keep revising this dictionary to enhance the accuracy of your program.
Try to fit the input sentence in one of the patterns in the dictionary you created above, for example, if the input sentence is Snakes, unlike elephants, are venomous., then your code must be able to find a match with the pattern: Subject, unlike AnotherSubject, Verb Object or S-,unlike-S`-, -V-O. To successfully perform this step, you may need to write code that's good at spotting Structure Markers like the word unlike, in this example sentence.
When you have found a match for your input sentence in your pattern dictionary, you can easily assign a tag to each word in the sentence. For example, in our sentence, the word Snakes would be tagged as a subject, just like the word elephants, the word are would be tagged as a verb and finally the word venomous would be tagged as an object.
Once you have assigned a unique tag to each of the words in your sentence, you can go lookup the word in the appropriate text files that you already have and determine whether or not your sentence is valid.
If your sentence doesn't match any sentence pattern, then you have two options:
a) Add the pattern of this unrecognized sentence in your pattern dictionary if it is a valid English sentence.
b) Or, discard the input sentence as an invalid English sentence.
Things like what you're trying to achieve are best solved using machine learning techniques so that the system can learn any new patterns. So, you may want to include a trainer system that would add a new pattern to your pattern dictionary whenever it finds a valid English sentence not matching any of the existing patterns. I haven't thought much about how this can be done, but for now, you may manually revise your Sentence Pattern dictionary.
I'd be glad to hear your opinion about this pseudocode and would be available to brainstorm it further.

Proximity Search example Lucene.Net

I want to make a Proximity Search with Lucene.Net. I saw this question where it looks like that was the answer for him, but no code was suplied. The Java documentation says to use the ~ character with the number of words in between, but I don't see where this character would go in the code. Anyone can give me an example of a Proximity Search using Lucene.Net?
Edit:
What I have so far:
IndexSearcher searcher = new IndexSearcher(this.Directory, true);
string[] fieldList = new string[] { "Name", "Description" };
List<BooleanClause.Occur> occurs = new List<BooleanClause.Occur>();
foreach (string field in fieldList)
{
occurs.Add(BooleanClause.Occur.SHOULD);
}
Query searchQuery = MultiFieldQueryParser.Parse(this.LuceneVersion, query, fieldList, occurs.ToArray(), this.Analyzer);
If I try to add the "~" with any number on the MultiFieldQueryParser it errors out saying that for a FuzzySearch the values should be between 0.0 and 1.0, but I want a Proximity Search 3 words of separation Ex. "my search"~3
The tilde means either a fuzzy search if you apply it on a single term, or a proximity search if you apply it on a phrase. The error you're receiving sounds like you're applying it on a single term (term~10) instead of using a phrase ("term term"~10).
To do a proximity search use the tilde, "~", symbol at the end of a Phrase.
The only differences between Lucene.NET and classic java lucene of the same version should be internal, not external -- operational goal is to have a very compatible project, especially on the input (queries) and output (index files) side. So it should work however it works for java lucene. If it don't, it is a bug.

Using .NET RegEx to retrieve part of a string after the second '-'

This is my first stack message. Hope you can help.
I have several strings i need to break up for use later. Here are a couple of examples of what i mean....
fred-064528-NEEDED
frederic-84728957-NEEDED
sam-028-NEEDED
As you can see above the string lengths vary greatly so regex i believe is the only way to achieve what i want. what i need is the rest of the string after the second hyphen ('-').
i am very weak at regex so any help would be great.
Thanks in advance.
Just to offer an alternative without using regex:
foreach(string s in list)
{
int x = s.LastIndexOf('-')
string sub = s.SubString(x + 1)
}
Add validation to taste.
Something like this. It will take anything (except line breaks) after the second '-' including the '-' sign.
var exp = #"^\w*-\w*-(.*)$";
var match = Regex.Match("frederic-84728957-NEE-DED", exp);
if (match.Success)
{
var result = match.Groups[1]; //Result is NEE-DED
Console.WriteLine(result);
}
EDIT: I answered another question which relates to this. Except, it asked for a LINQ solution and my answer was the following which I find pretty clear.
Pimp my LINQ: a learning exercise based upon another post
var result = String.Join("-", inputData.Split('-').Skip(2));
or
var result = inputData.Split('-').Skip(2).FirstOrDefault(); //If the last part is NEE-DED then only NEE is returned.
As mentioned in the other SO thread it is not the fastest way of doing this.
If they are part of larger text:
(\w+-){2}(\w+)
If there are presented as whole lines, and you know you don't have other hyphens, you may also use:
[^-]*$
Another option, if you have each line as a string, is to use split (again, depending on whether or not you're expecting extra hyphens, you may omit the count parameter, or use LastIndexOf):
string[] tokens = line.Split("-".ToCharArray(), 3);
string s = tokens.Last();
This should work:
.*?-.*?-(.*)
This should do the trick:
([^\-]+)\-([^\-]+)\-(.*?)$
the regex pattern will be
(?<first>.*)?-(?<second>.*)?-(?<third>.*)?(\s|$)
then you can get the named group "second" to get the test after 2nd hyphen
alternatively
you can do a string.split('-') and get the 2 item from the array

Categories