Efficiently removing naughty words from users comments

Efficiently removing naughty words from users comments - c#

I have an ASP.NET app that accepts users comments and them in a SQL database. I want to make sure that I weed out any "naughty" words so I can keep my app respectable. Problem is that I'm finding there are LOTS of these words. ;>
My question is, what's the most efficient way to do this processing? Should I have a table in SQL and write a stored proc that does the work? Should I do it with c# and Regex in memory on the web server? Are there other options? Has anyone else successfully done this kind of text scanning at scale? If y, what worked?

It's a futile task. If people want to swear then they will start typing things like f uck and sh*t.
There's no substitute for effective moderation. Anything else is likely to leave you with clbuttic errors on your page
I remember a quote from somewhere about technical solutions to social problems, but I can't source it right now

Scunthorpe Problem
One should be embar***ed to try to solve this in code.

There are some things to consider here:
Do you want to be able to add or remove words from that black list later? If so it might make sense to do this only before showing the message, but store the original message.
Do you want to have a copy of the message later on (e.g. for legal reasons or customer support)? Then it also makes sense to keep the message unchanged in the database.
So I would keep the message in the database and parse it only before rendering it. To me it looks like the most efficient way to do that would be either to:
Keep the blacklist in an indexed column (lowercase) in the database and return the comments through a stored procedure which filters it
Keep the blacklist lowercase in some data structure that allows for efficient access (e.g. Dictionary) in memory on the middle layer.
In both cases you would simply run through each comment and filter it. The latter method is more easier implemented but means that you would have to keep a list in memory, which stops to make sense when you have a very large blacklist.
(I actually see no point in using regex.)

There are already some Perl modules out there to do all of that for you.
https://metacpan.org/pod/Regexp::Common::profanity
https://metacpan.org/pod/Regexp::Profanity::US
https://metacpan.org/pod/Plagger::Plugin::Filter::Profanity

Related

VB.Net Text File vs Database for storing data?

just a quick question that I am sure someone on here can give me a detailed answer about.
Basically, I am using a DataGridView to display records from a database, which in my case is simply a text file which is being parsed. I feel this is simple, and if I want to select records based on certain parameters, I iterate through the list searching for matches. However, I wonder if this is ineffective vs using a full blown DB such as Mongo or SQL.
Can I get away with this if my software is relatively simple? I really prefer to sway from complicating things, when they don't need to be complicated.
By the way, I am expecting to have a DB (sometimes) larger than 100k entries, so take that into consideration.

#DavidStampher
Even though you may be using just one table or file, I would strongly suggest using a database system for this. Database engines are optimized for speed so it's not so frustrating or time-consuming when performing via query versus trying to update a single text file.
I'm only suggesting MySQL as an option because it's the one I am most familiar with. Other users may have different or better suggestions.
You can easily download and install one from MySQL installer. The setup is relatively simple and should take less than 10 minutes. You could create a new schema, add a table, then query up to do what you need.
I would suggest creating a new user other than root, just in case someone manages to hack into your account.
If you would like the easiest way to manage the database rather than going through the old fashioned phpMyAdmin, download MySQL Workbench. It's pretty cool and relatively easy to use.
Let me know if you have questions. :-)

C# Database MasterReset

I'm not sure if this is possible and I can't find any answers anywhere else.
Is it possible to have a blank OleDb database stored in the Folders of a Project in C# and after a buttonClick replace the current database (that's being used) with the blank Database?
I'm not looking for a way to save the data from the original somewhere else, although I can understand how that is useful and good practice.
I'm primarily interested in whether this is possible, and how I would go about learning how to use it in my system.
Thanks for reading, and thanks in advance for any help or suggestions :)

I suggest instead of using a fixed database file the user has to "reset", allowing the user to create multiple database files that he can choose from. Most office applications have "create file"/"open file"-dialogs, so users are very familiar with them. You may want to implement an "open last file" or "recent files list"-feature for convenience.
First of all, this would allow the user to start fresh without having to throw away his old data. A user would also be much less reluctant to create something new than to "reset" something (as he should be), so that would also help in user experience terms.
Secondly, disconnecting a database or truncating the contents can work, but it is not as simple as it first sounds. If you're still connected to the database, the file will be locked and you can't remove it. The truncate approach can also cause problems, not only does your whole application have to be able to deal with a suddenly empty database, existing foreign key constraints will have to be considered too.

Should I use LINQ to SQL or XML for my project?

My program has 3 text fields, Title, Website, and PictureURL. When I click the 'save' button I want it to add the 3 entries into a log of some sort (LINQ or XML seems like the best choice). Only 1 user will be accessing the program at a time. The log will be local on the machine, and not on an external server. After the 3 fields have been saved as a single entry to the log, I want to be able to load each group of entries from the log back into the textboxes. Would either be a simpler solution or a more appropriate choice for this type of project? I am new to both hence my uncertainty for which would be better.

With given set of requirements indeed it would be better to stick with XML storage since you have not neither big amount of data nor complex search and grouping conditions nor remote and distributed access. So LINQ-to-XML would suit perfect for such simple desctop application. Keep it simple.

Why not LINQ to XML? Assuming local storage is going to be, as you stated, an XML file:
http://msdn.microsoft.com/en-us/library/bb387098.aspx

It's hard to give a good answer without knowing more about your situation.
If you are just running this locally on one machine, and do not anticipate the log growing overly large, I'd say XML wold be the better choice, as it requires less setup and overhead than a database.
However, if it needs to scale for size or users, you'll want to use a database. But that will add additional complexity, despite the fact that LINQ to SQL makes it simpler to use.

Assistance with URL structure for accept/decline links

I am in the process of creating an app in which a customer can add email addresses to an event. This means that each email address is sent 2 urls via email when added to the list, 1 url to accept and the other to decline. The url is made up of a number of query parmatters, id's etc.
The issue I have is that I want to prevent the scenario in which someone could "guess" another persons url - as such guest the combination of parametters etc. While this is very unlikely, I still want to prevent such.
I have seen several scenarios to help prevent this, ie. add a hash value, encrypt the url etc. However I am looking for the most secure and best practise approach to this and would like any possible feedback.
As an aside I am coding in C# but I dont believe the solution to this is language specific.
Thanks in advance.

I agree this is not language specific. I had a situation very similar to this within the last few years. It needed to be extremely secure due to children and parents receiving the communications. The fastest solution was something like the following:
First store the information that you would use in the URL as parameters somewhere in a database. This should be relatively quick and simple.
Create two GUIDs.
Associate the first GUID with the data in the database that you would have used for processing an "acceptance".
Associate the second GUID for a "decline" record in the database.
Create the two URL's with only the GUID's as parameters.
If the Acceptance URL is clicked, use the database data associated with it to process the acceptance.
If the Decline is clicked, delete the data out of the database, or archive it, or whatever.
After a timeframe, is no URL is clicked, delete or archive the data associated with those GUID's so that they can no longer be used.
GUID's are extremely hard to guess, and the likelihood of guessing one that is actually usable would be so unlikely it is nearly impossible.

I'm guessing you are saving these email addresses somewhere. So it's quite easy to make a secure identifier for each entry you have. Whether that is a hash or some encryption technique, doesn't really matter. But I guess a hash is easier to implement and actually meant for this job.
So you hash for example the emailaddress, the PK value of the record, with the timestamp of when it was added, and some really impossible to guess salt. Just concatenate the various fields together and hash them.
In the end, you send nothing but the hashed key to the server. So when you send those two links, they could look as follows:
http://www.url.com/newsletter/acceptsubscription.aspx?id=x1r15ff2svosdf4r2s0f1
http://www.url.com/newsletter/cancelsubscription.aspx?id=x1r15ff2svosdf4r2s0f1
When the user clicks such a link, your server looks in the database for the record which contains the supplied key. Easy to implement, and really safe if done right. No way in hell someone can guess another persons key. Just bear in mind the standard things when doing something with hashing. Such as:
Do not forget to add salt.
Pick a really slow, and really secure, hashing algorithm.
Just make sure that no one can figure out their own hash, from information they can possess.
If you are really scared of people doing bad things, make sure to stop bruteforcing by adding throttle control to the website. Only allow X number of requests per minute for example. Or some form of banning on an IP-address.
I'm not an expert at these things, so there might be room for improvement. However I think this should point you in the right direction.
edit: I have to add; the solution provided by Tim C is also good. GUID's are indeed very useful for situations like these, and work effectively the same as my hashed solution above.

intelligent database search

The issue is there is a database with around 20k customer records and I want to make a best effort to avoid duplicate entries. The database is Microsoft SQL Server 2005, the application that maintains that database is Microsoft Dynamics/SL. I am creating an ASP.NET webservice that interacts with that database. My service can insert customer records into the database, read records from it, or modify those records. Either in my webservice, or through MS Dynamics, or in Sql Server, I would like to give a list of possible matches before a user confirms a new record add.
So the user would submit a record, if it seems to be unique, the record will save and return a new ID. If there are possible duplications, the user can then resubmit with a confirmation saying, "yes, I see the possible duplicates, this is a new record, and I want to submit it".
This is easy if it is just a punctuation or space thing (such as if you are entering "Company, Inc." and there is a "Company Inc" in the database, But what if there is slight changes such as "Company Corp." instead of "Company Inc" or if there is a fat fingered misspelling, such as "Cmpany, Inc." Is it even possible to return records like that in the list? If it's absolutely not possible, I'll deal with what I have. It just causes more work later on, if records need to be merged due to duplications.

The specifics of which algorithm will work best for you depends greatly on your domain, so I'd suggest experimenting with a few different ones - you may even need to combine a few to get optimal results. Abbreviations, especially domain specific ones, may need to be preprocessed or standardized as well.
For the names, you'd probably be best off with a phonetic algorithm - which takes into account pronunciation. These will score Smith and Schmidt close together, as they are easy to confuse when saying the words. Double Metaphone is a good first choice.
For fat fingering, you'd probably be better off with an edit distance algorithm - which gives a "difference" between 2 words. These would score Smith and Smoth close together - even though the 2 may slip through the phonetic search.
T-SQL has SOUNDEX and DIFFERENCE - but they are pretty poor. A Levenshtein variant is the canonical choice, but there's other good choices - most of which are fairly easy to implement in C#, if you can't find a suitably licensed implementation.
All of these are going to be much easier to code/use from C# than T-SQL (though I did find double metaphone in a horrendous abuse of T-SQL that may work in SQL).
Though this example is in Access (and I've never actually looked at the code, or used the implementation) the included presentation gives a fairly good idea of what you'll probably end up needing to do. The code is probably worth a look, and perhaps a port from VBA.

Look into SOUNDEXing within SQL Server. I believe it will give you the fuzziness of probable matches that you're looking for.
SOUNDEX # MSDN
SOUNDEX # Wikipedia

If it's possible to integrate Lucene.NET into your solutionm you should definetly try it out.

You could try using Full Text Search with FreeText (or FreeTextTable) functions to try to find possible matches.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.