simple spell checking tool in C# - c#

What i'm tying to achieve is a input field where you can put in how you think you spell the word then it will search my text file named words.txt and will find words that are of similar spelling then it will put the results into a new window.
thanks in advance

This is the one I have used and it sounded exactly what you wanted:
Make similar suggestions for input text by remembering old inputs
You can see it in action in the screen capture video here
ps I pre-populated a dictionary.dic file to suit in one instance and in the above example I did some other rules around LogParsers SQL-Like syntax to provide intellisense. HTH

Related

Using Irony with C# to convert a search string into SQL Full text index query

I have a search box where users can enter text, when they hit search the text they entered will then be used in a SQL CONTAINSTABLE statement. I need to parse the string so that it is in an appropriate format for the CONTAINSTABLE function, and I have found an example that uses Irony that almost does exactly what I need. I got the Irony sample class here:
http://irony.codeplex.com/SourceControl/latest#Irony.Samples/FullTextSearchQueryConverter/SearchGrammar.cs
which is actually designed for the SQL CONTAINS function but the difference between that and CONTAINSTABLE aren't a problem for me at the minute. I made a slight modification in that I didn't want the 'Inflectional' behaviour, so I changed any references of that to be 'Exact'.
The problem I am having now is that I want a search phrase to be treated as a phrase, rather than as a list of keywords separated by an AND operator. So for example, if a user enters: "General manager", then I want it to come through the parser as "General manager" but it's currently bringing back "General" AND "manager"
I think I need to modify the constructor somehow, where it is building all the expression rules - but I'm not even sure where to start on that!
Any help greatly appreciated, thanks.

Example using DiffMatchPatch

I hope I am not breaking any rules here. I have a question about another post, but I am not a big user on stackoverflow, so my reputation is too low to add a comment to questions or answers that are not my own.
On this question: How to compare two rich text box contents and highlight the characters that are changed?
TaW provided some sample C# code and we have made use of that in a Visual Studio project. But, we discovered a problem and don't know how to fix it.
If RTB1 contains the text "My name is David" and RTB2 contains the text "My name is", then after the comparison is run there are two diffs in the diffs collection and somehow, when the rich text boxes are rewritten to show the differences, RTB1 is an exact match of RTB2 and nothing is highlighted. Maybe this is the expected behavior and we just are not realizing that, but we were hoping that the text " David" in RTB1 would be highlighted.
If the text in RTB2 is "My name is " (two added spaces at the end of the line), then we get the expected behavior.
I should have mentioned that we wrote a VB.NET equivalent of TaW's C# code and just noticed a difference. I have noted the difference in the comments.
If I was up to 50 reputation, I would have also added in my comment that we are very thankful to TaW for sharing his example, as well as the creator of DiffMatchPatch.
I think we figured out the problem. In our project we are using vb.net and we are fairly certain we translated correctly from C# to VB. However in the collectChunks function in C#, you are comparing RTB and RTB2 as objects, not the text property within the objects. so for instance, when you compare RTB and RTB2, even though the text in the two text boxes being compared is equal, your code is comparing the objects, and all their other associated properties, including the text box positions. Therefore, the first == is always false.
In VB, we are not allowed to do an object comparison. i.e. We are not allowed use RTB = RTB2, we must use RTB.Text = RTB2.Text in the if statement. (There is a way to compare the RTB objects in VB, but I am guessing what really needs to be compared is the text property in the RTB and RTB2 objects). If this is the case, is it possible that the results you got were based on an assumption that the text in the text boxes were being compared? And perhaps that assumption led you to code the way you decided to stay in or jump out of the for loop?

lucene.net/examine weight html tags

I've got this project where we are implementing Examine / Lucene.net. And I'm look for some guidance from you guys.
As far as I have been able to find out from the knowledge of google, is that if I want to boost the weight, I need to boost the weight on the Field, right ?
But could I get something like this: Is it able to give a boost to a term if the term is inside a h1-tag or the title for that matter. When giving a complete site-html, and do a frequent term search.
the thing i would like to do, is no make a service which gets a html document, and from that is able to find what words in a this document optimised after depending on which terms are used in the text and if they are in the important places, like in a title-tag or h2-tag and so forward.
Is this possible to achieve ? its so the editors live can know, "what they are writing are best found with which searchwords.
Big thanks in advance.
I don't think it quite works that way. Yes, you can boost a field but you cannot boost a term dependent on it's location in some markup because you don't know that at the time of the search.
I think what you could do is create an Umbraco event handler that fires when a page is published. This event could:
Utilise the GatheringNodeData event of an Index
Take the contents of the rich text editor-based field and using regex or something like HtmlUtility extract specific text based upon it's markup location, e.g. H1, H2 and H3 text.
For each piece of text in a heading found, add it into a string variable
Add the whole string into the Lucene index as a new field, e.g. "Headings"
You can now boost on the "Headings" field separately to the field containing the field containing the HTML.

How to parse through data efficiently

I am wondering if anyone can help me out with parsing out data for key words.
say I am looking for this keyword: My Example Yo (this is one of many keywords)
I have data like this
MY EXAMPLE YO #108
my-example-yo #108
my-example #108
MY Example #108
This is just a few combinations. There could be words or number is front of these sentences, there could be in any case, maybe nothing comes after it maybe like the above example something comes after it.
A few ideas came to mind.
store all combinations that I can possible think of in my database then use contains
The downside with this is I going a huge database table with every combination of everything thing I need to find. I then will have to load the data into memory(through nhibernate and check every combination). I am trying to determine what category to use based on keyword and they can upload thousands of rows to check for.
Even if I load subsets and look through them I still picture this will be slow.
Remove all special characters and make single spaces and ignore case and try to use regex to see how much of the keyword matches up.
Not sure what to do if the keyword has special characters like dashes and such.
I know I will not get every combination out there but I want to try get as many as I can.
Have you considered Lucene.Net? I haven't used it myself, but I hear it's a great tool for full text searching. It might do well with keyword searching too. I believe that stackoverflow uses Lucene.

Is it possible to programmatically 'clean' emails?

Does anyone have any suggestions as to how I can clean the body of incoming emails? I want to strip out disclaimers, images and maybe any previous email text that may be also be present so that I am left with just the body text content. My guess is it isn't going to be possible in any reliable way, but has anyone tried it? Are there any libraries geared towards this sort of thing?
In email, there is couple of agreed markings that mean something you wish to strip. You can look for these lines using regular expressions. I doubt you can't really well "sanitize" your emails, but some things you can look for:
Line starting with "> " (greater than then whitespace) marks a quote
Line with "-- " (two hyphens then whitespace then linefeed) marks the beginning of a signature, see Signature block on Wikipedia
Multipart messages, boundaries start with --, beyond that you need to do some searching to separate the message body parts from unwanted parts (like base64 images)
As for an actual C# implementation, I leave that for you or other SOers.
A few obvious things to look at:
if the mail is anything but pure plain text, the message will be multi-part mime. Any part whose type is "image/*" (image/jpeg, etc), can probably be dropped. In all likelyhood any part whose type is not "text/*" can go.
A HTML message will probably have a part of type "multipart/alternative" (I think), and will have 2 parts, one "text/plain" and one "text/html". The two parts should be just about equivalent, so you can drop the HTML part. If the only part present is the HTML bit, you may have to do a HTML to plain text conversion.
The usual format for quoted text is to precede the text by a ">" character. You should be able to drop these lines, unless the line starts ">From", in which case the ">" has been inserted to prevent the mail reader from thinking that the "From " is the start of a new mail.
The signature should start with "-- \r\n", though there is a very good chance that the trailing space will be missing.
Version 3 of OSBF-Lua has a mail-parsing library that will handle the MIME and split a message into its MIME parts and so on. I currently have a mess of Lua scripts that do
stuff like ignore most non-text attachments, prefer plain text to HTML, and so on. (I also wrap long lines to 80 characters while trying to preserve quoting.)
As far as removing previously quoted mail, the suggestions above are all good (you must subscribe to some ill-mannered mailing lists).
Removing disclaimers reliably is probably going to be hard. My first cut would be simply to maintain a library of disclaimers that would be stripped off the end of each mail message; I would write a script to make it easy for me to add to the library. For something more sophisticated I would try some kind of machine learning.
I've been working on spam filtering since Feb 2007 and I've learned that anything to do with email is a mess. A good rule of thumb is that whatever you want to do is a lot harder than you think it is :-(
Given your question "Is it possible to programmatically ‘clean’ emails?", I'd answer "No, not reliably".
The danger you face isn't really a technological one, but a sociological one.
It's easy enough to spot, and filter out, some aspects of the messages - like images. Filtering out signatures and disclaimers is, likewise, possible to achieve (though more of a challenge).
The real problem is the cost of getting it wrong.
What happens if your filter happens to remove a critical piece of the message? Can you trace it back to find the missing piece, or is your filtering desctructive? Worse, would you even notice that the piece was missing?
There's a classic comedy sketch I saw years ago that illustrates the point. Two guys working together on a car. One is underneath doing the work, the other sitting nearby reading instructions from a service manual - it's clear that neither guy knows what he's doing, but they're doing their best.
Manual guy, reading aloud: "Undo the bold in the centre of the oil pan ..." [turns page]
Tool guy: "Ok, it's out."
Manual guy: "... under no circumstances."
If you creating your own application i'd look into Regex, to find text and replace it. To make the application a little nice, i'd create a class Called Email and in that class i have a property called RAW and a property called Stripped.
Just some hints, you'll gather the rest when you look into regex!
SigParser has an assembly you can use in .NET. It gives you the body back in both HTML and text forms with the rest of the stuff stripped out. If you give it an HTML email it will convert the email to text if you need that.
var parser = new SigParser.EmailParsing.EmailParser();
var result = await parser.GetCleanedBodyAsync(new SigParser.EmailParsing.Models.CleanedBodyInput {
FromEmailAddress = "john.smith#example.com",
FromName = "John Smith",
TextBody = #"Hi Mark,
This is my message.
Thanks
John Smith
888-333-4434"
});
// This would print "Hi Mark,\r\nThis is my message."
Console.WriteLine(result.CleanedBodyPlain);

Categories