Highlighting keywords in text - c#

I have a SQL Server 2008 database that has one table with a varchar(1000) field that contains a bunch of user input about books. I have another table that contains a bunch of keywords. When I render the user's info about the books, I want to highlight (or eventually create a hyperlink) on these keywords. I'm looking for suggestions on the most efficient method for scanning through the text and matching the keywords up. I wasn't sure if there's a way to do it right in SQL or it it needs to be in the code. Thanks.

I recommend doing it in code. It's business logic, and besides doing so lightens the load in the database - so if the database is in a machine other than the servers running your app, you don't hog that machine's resources.
I think regex would do the trick for you - it's the most efficient method of text matching ever invented, and the internal implementation in most technologies (not only .NET) is pretty much the best you can get. If you try to come up with something else you'll at best be reinventing the wheel.
So I'd do this: place every keyword in a hashtable or dictionary - which has the bonus of cutting off duplicates, then iterate through that. Then for each match on of the keyword in the main text, you can get the first and last index of the match and wrap that with markup for the highlight and the link.

Here's a quick LINQPad program that I wrote up that will replace values in a string with dictionary values that match up on the keys. Let me know if this is what you were looking for.
Side Note: I agree with everyone else that you should be doing this in the application layer for a variety of reasons.
void Main()
{
Dictionary<string, string> links = new Dictionary<string, string>();
links.Add("awesome", "link-to-awesome");
links.Add("okay", "link-to-okay");
string text = "This is some text about an okay book review of an otherwise awesome book.";
string result = links.Aggregate(text, (current, kvp) => current.Replace(kvp.Key, kvp.Value));
text.Dump();
result.Dump();
}
Results:
This is some text about an okay book review of an otherwise awesome book.
This is some text about an link-to-okay book review of an otherwise link-to-awesome book.
Edit: This isn't a perfect example. You'd have to strip out punctuation and what-not for a final version. Hopefully this gets you on the right track.

If the links are fixed then personally I would do this when WRITING to the DB as it will only be done once, then it required no extra effort to display.

Related

Using Irony with C# to convert a search string into SQL Full text index query

I have a search box where users can enter text, when they hit search the text they entered will then be used in a SQL CONTAINSTABLE statement. I need to parse the string so that it is in an appropriate format for the CONTAINSTABLE function, and I have found an example that uses Irony that almost does exactly what I need. I got the Irony sample class here:
http://irony.codeplex.com/SourceControl/latest#Irony.Samples/FullTextSearchQueryConverter/SearchGrammar.cs
which is actually designed for the SQL CONTAINS function but the difference between that and CONTAINSTABLE aren't a problem for me at the minute. I made a slight modification in that I didn't want the 'Inflectional' behaviour, so I changed any references of that to be 'Exact'.
The problem I am having now is that I want a search phrase to be treated as a phrase, rather than as a list of keywords separated by an AND operator. So for example, if a user enters: "General manager", then I want it to come through the parser as "General manager" but it's currently bringing back "General" AND "manager"
I think I need to modify the constructor somehow, where it is building all the expression rules - but I'm not even sure where to start on that!
Any help greatly appreciated, thanks.

Which is the best way to Prevent Bad word entry in Discussion module in my site

In my web site some hackers are entering bad words. Which is the best way to prevent this?
I am using ASP.NET, C# and SQL Server as resources.
check bad words in form backend ?
check bad words in javascript?
check bad words in stored procedure before insert?
I think first method is best.
Please tell the optimized code for this check
Now I am using this method
var filterWords = ["fool", "dumb", "couch potato"];
// "i" is to ignore case and "g" for global
var rgx = new RegExp(filterWords.join(""), "gi");
function wordFilter(str) {
return str.replace(rgx, "****");
}
// call the function
document.write("Original String - ");
document.writeln("You fool. Why are you so dumb <br/>");
document.write("Replaced String - ");
document.writeln(wordFilter("You fool. Why are you so dumb"));
You should check in the ASP.NET code, on the server side. JavaScript or any other client side check can be easily worked around. The code you posted works fine, except it is not particularly robust (a variety of simple misspellings will get around it).
make sure to check for permutations such as
Secure --> $3(ur3
And I would replace the word with something like
[REMOVED] or [CENSORED]
Having words like s***t still can be viewed as offensive to customers/others.
Edit: Seeing HevyLight's thoughts on javascript usage here... you might try a string filter in your C# layer (assuming that is doing the heavy lifting already and database calls). Pass all strings posts through the filter before writing to database (and for others to see).
Reality is that you can’t prevent 100% of bad words.
I’d go with a two-step verification on the server side (JS can be disabled and SQL is not really suitable for handling this)
Create a list of most common bad words that are used the most – this will probably catch like 80% of all inputs.
Create a list of patterns for suspects that will signal you to manually verify these.
This could be patterns such as
a) Word contains two or more ** characters
b) Word contains letters and one of the following characters 0,3,$, and others
In time you’ll just have to keep both lists updated. Again, this will not solve 100% of cases but it will probably catch and fix like 95% if implemented properly.

How to parse through data efficiently

I am wondering if anyone can help me out with parsing out data for key words.
say I am looking for this keyword: My Example Yo (this is one of many keywords)
I have data like this
MY EXAMPLE YO #108
my-example-yo #108
my-example #108
MY Example #108
This is just a few combinations. There could be words or number is front of these sentences, there could be in any case, maybe nothing comes after it maybe like the above example something comes after it.
A few ideas came to mind.
store all combinations that I can possible think of in my database then use contains
The downside with this is I going a huge database table with every combination of everything thing I need to find. I then will have to load the data into memory(through nhibernate and check every combination). I am trying to determine what category to use based on keyword and they can upload thousands of rows to check for.
Even if I load subsets and look through them I still picture this will be slow.
Remove all special characters and make single spaces and ignore case and try to use regex to see how much of the keyword matches up.
Not sure what to do if the keyword has special characters like dashes and such.
I know I will not get every combination out there but I want to try get as many as I can.
Have you considered Lucene.Net? I haven't used it myself, but I hear it's a great tool for full text searching. It might do well with keyword searching too. I believe that stackoverflow uses Lucene.

validating user input tags

I know this question might sound a little cheesy but this is the first time I am implementing a "tagging" feature to one of my project sites and I want to make sure I do everything right.
Right now, I am using the very same tagging system as in SO.. space seperated, dash(-) combined multiple words. so when I am validating a user-input tag field I am checking for
Empty string (cannot be empty)
Make sure the string doesnt contain particular letters (suggestions are welcommed here..)
At least one word
if there is a space (there are more than one words) split the string
for each splitted, insert into db
I am missing something here? or is this roughly ok?
Split the string at " ", iterate over the parts, make sure that they comply with your expectations. If they do, put them into the DB.
For example, you can use this regex to check the individual parts:
^[-\w]{2,25}$
This would limit allowed input to consecutive strings of alphanumerics (and "_", which is part of "\w" as well as "-" because you asked for it) 2..25 characters long. This essentially removes any code injection threat you might be facing.
EDIT: In place of the "\w", you are free to take any more closely defined range of characters, I chose it for simplicity only.
I've never implemented a tagging system, but am likely to do so soon for a project I'm working on. I'm primarily a database guy and it occurs to me that for performance reasons it may be best to relate your tagged entities with the tag keywords via a resolution table. So, for instance, with example tables such as:
TechQuestion
TechQuestionID (pk)
SubjectLine
QuestionBody
TechQuestionTag
TechQuestionID (pk)
TagID (pk)
Active (indexed)
Tag
TagID (pk)
TagText (indexed)
... you'd only add new Tag table entries when never-before-used tags were used. You'd re-associate previously provided tags via the TechQuestionTag table entry. And your query to pull TechQuestions related to a given tag would look like:
SELECT
q.TechQuestionID,
q.SubjectLine,
q.QuestionBody
FROM
Tag t INNER JOIN TechQuestionTag qt
ON t.TagID = qt.TagID AND qt.Active = 1
INNER JOIN TechQuestion q
ON qt.TechQuestionID = q.TechQuestionID
WHERE
t.TagText = #tagText
... or what have you. I don't know, perhaps this was obvious to everyone already, but I thought I'd put it out there... because I don't believe the alternative (redundant, indexed, text-tag entries) wouldn't query as efficiently.
Be sure your algorithm can handle leading/trailing/extra spaces with no trouble = )
Also worth thinking about might be a tag blacklist for inappropriate tags (profanity for example).
I hope you're doing the usual protection against injection attacks - maybe that's included under #2.
At the very least, you're going to want to escape quote characters and make embedded HTML harmless - in PHP, functions like addslashes and htmlentities can help you with that. Given that it's for a tagging system, my guess is you'll only want to allow alphanumeric characters. I'm not sure what the best way to accomplish that is, maybe using regular expressions.

What's The Best Approach To Do Keyword Searching On A Text Field?

I have a database table which holds meta data about images, the field in concern is the caption field. I want users to be able to enter keywords into a textbox and have the app return a selection of images that match the keywords based on their caption.
I already have the code that returns an array of the individual keywords entered by the user but whats the best way to do the comparison. So I'm thinking along the lines of...
foreach (Image image in Images)
{
foreach (string keyword in keywords)
{
if (image.Caption.Contains(keyword))
{
imageCollection.Add(image);
break;
}
}
}
But this just seems a little too simplistic because it wouldn't support matching whole words only. Not to mention special characters, punctuation etc.
I feel like Regex should be used here but i'm no Regex expert. Or should I be breaking up the caption into individual words and processing the comparison on words one by one. Looking for some suggestions really!
I'm writing in c# but can be language agnostic I think
EDIT: I'm also quite interested in weighting results based on number of keywords matched. But i'm not trying to recreate Google images here!
Probably the best way to do this would be to use full-text index on the caption field in the database. Let the database do the work for you!

Categories