HI,
I have the following problem- the following text is in a rich text box .
The world is [[wonderful]] today .
If the user provides two brackets before and afer a word, as in the case of wonderful , the word in brackets, in this case, wonderful shall change to a link, ( with a green colour ) .
I am having problems in getting the sequence of the keystrokes, ie. how do I know that the user has entered [[ , so I can start parsing the rest of the text which follows it .
I can get it by handlng KeyDown, event, and a list , but it does not look to be elegant at all.
Please let me know what should be a proper way.
Thanks,
Sujay
You have two approaches that I can think of off-hand.
One is, as you suggest, maintain the current state with a list—was this key a bracket? was the last key a bracket?—and update on the fly.
The other approach would be to simply handle the TextChanged event and re-scan the text for the [[text-here]] pattern and update as appropriate.
The first requires more bookkeeping but will be much faster for longer text. The second approach is easier and can probably be done with a decent regex, but it will get slower as your text gets longer. If you know you have some upper limit, like 256 characters, then you're probably fine. But if you're expecting novels, probably not such a great idea.
I would recommend Google'ing: "richtextbox syntax highlighter", there are so many people that have done this, and there is a lot behind the scenes to make it work.
I dare myself to say, that EVERY SINGLE simple solution have major drawbacks. Proper way would be to use some control that already does this "syntax highlighting" and extending it to your syntax. It is also most likely the easiest way.
You can search free .net controls in Codeplex. link
I would try handling the KeyDown, and checking for the closing bracket instead "]". Once you receive one, you could check the last character in your text box for the second ], and if it's there, just replace out the last few characters.
This eliminates the need for maintaining state (ie: the list). As soon as the second ] was typed, the block would change to a link instantly.
Keeping a list will be rather complex I think. What if the user types a '[' character, clicks somewhere else in the text and then types a '[' character again. The user has then typed two consecutive '[' characters but in completely different parts of the text. Also, you may want to be able to handle text inserted from the clipboard as well.
I think the safest way is to analyze the full text and do what should be done from that context, using RegEx or some other technique.
(Sorry, don't have enough reputation to add comments yet, so have to add a new answer). As suggested by jeffamaphone I'd handle the TextChanged event and rescan the text each time - but to keep the cost constant, just scan a few characters ahead of the current cursor position instead of reading the entire text.
Trying to intercept the keystrokes and maintain an internal state is a bad approach - it is very easy for your idea of what has happened to get out of sync with the control you are monitoring and cause weird problems. (and how do you handle clicks? Alt-tab? Pastes? arrow keys? Other applicatiosn grabbing focus? Too many special cases to worry about...)
Related
Is there a way to create a regex will insure that five out of eight characters are present in order in a given character range (like 20 chars for example)?
I am dealing with horrible OCR/scanning, and I can stand the false positives.
Is there a way to do this?
Update: I want to match for example "mshpeln" as misspelling. I do not want to do OCR. The OCR job has been done, but is has been done poorly (i.e. it originally said misspelling, but the OCR'd copy reads "mshpeln"). I do not know what the text that I will have to match against will be (i.e. I do not know that it is "mshpeln" it could be "mispel" or any number of other combinations).
I am not trying to use this as a spell checker, but merely find the end of a capture group. As an aside, I am currently having trouble getting the all.css file, so commenting is impossible temporarily.
I think you need not regex, but database with all valid words and creative usage of functions like soundex() and/or levenshtein().
You can do this: create table with all valid words (dictionary), populate it with columns like word and snd (computed as soundex(word)), create indexes for both word and snd columns.
For example, for word mispeling you would fill snd as M214. If you use SQLite, it has soundex() implemented by default.
Now, when you get new bad word, compute soundex() for it and look it up in your indexed table. For example, for word mshpeln it would be soundex('mshpeln') = M214. There you go, this way you can get back correct word.
But this would not look anything like regex - sorry.
To be honest, I think that a project like this would be better for an actual human to do, not a computer. If the project is to large for 1 or 2 people to do easily, you might want to look into something like Amazon's Mechanical Turk where you can outsource to work for pennies per solution.
This can't be done with a regex, but it can be done with a custom algorithm.
For example, to find words that are like 'misspelling' in your body of text:
1) Preprocess. Create a Set (in the mathematical sense, collection of guaranteed to be unique elements) with all of the unique letters that are in misspelling - {e, i, g, l, m, n, p, s}
2) Split the body of text into words.
3) For each word, create a Set with all of its unique letters. Then, perform the operation of set intersection on this set and the set of the word you are matching against - this will get you letters that are contained by both sets. If this set has 5 or more characters left in it, you have a possible match here.
If the OCR can add in erroneous spaces, then consider two words at a time instead of single words. And etc based on what your requirements are.
I have no solution for this problem, in fact, here's exactly the opposite.
Correcting OCR errors is not programmaticaly possible for two reasons:
You cannot quantify the error that was made by the OCR algorithm as it can goes between 0 and 100%
To apply a correction, you need to know what the maximum error could be in order to set an acceptable level.
Let nello world be the first guess of "hello world", which is quite similar. Then, with another font that is written in "painful" yellow or something, a second guess is noiio verio for the same expression. How should a computer know that this word would have been similar if it was better recognized?
Otherwise, given a predetermined error, mvp's solution seems to be the best in my opinion.
UPDATE:
After digging a little, I found a reference that may be relevant: String similarity measures
I am wondering if anyone can help me out with parsing out data for key words.
say I am looking for this keyword: My Example Yo (this is one of many keywords)
I have data like this
MY EXAMPLE YO #108
my-example-yo #108
my-example #108
MY Example #108
This is just a few combinations. There could be words or number is front of these sentences, there could be in any case, maybe nothing comes after it maybe like the above example something comes after it.
A few ideas came to mind.
store all combinations that I can possible think of in my database then use contains
The downside with this is I going a huge database table with every combination of everything thing I need to find. I then will have to load the data into memory(through nhibernate and check every combination). I am trying to determine what category to use based on keyword and they can upload thousands of rows to check for.
Even if I load subsets and look through them I still picture this will be slow.
Remove all special characters and make single spaces and ignore case and try to use regex to see how much of the keyword matches up.
Not sure what to do if the keyword has special characters like dashes and such.
I know I will not get every combination out there but I want to try get as many as I can.
Have you considered Lucene.Net? I haven't used it myself, but I hear it's a great tool for full text searching. It might do well with keyword searching too. I believe that stackoverflow uses Lucene.
I'm working on a text entry application that uses regular expressions to validate user input. The goal is to allow keypresses that fit a certain RegEx while rejecting invalid characters. One issue I've run into is that when a user starts inputting information they may create a string that doesn't yet match the given regex, but could cause a match in the future. These strings get erroneously rejected. Here's an example - given the following regex for inputting date information:
(0?[1-9]|10|11|12)/(0?[1-9]|[12]\\d|30|31)/\\d{2}\\d{2}
A user may begin entering "1/" which could be a valid date, but RegEx.IsMatch() will return false and my code ends up rejecting the string. Is there a way to "optimistically" test strings against a regular expression so that possible or partial matches are allowed?
Bonus: For this RegEx in particular there are some sequences which cause required characters. For example, if the user types "2/15" the only possible valid character they could enter next is "/". Is it possible to detect those scenarios so that the required characters could be automatically entered for the user to ease input?
What you can do is anchor your RegExp (i.e. adding ^ and $, as in start/end of line) and make some component optionnal for validation, but strictly defined if present.
Something looking like this:
^(0?[1-9]|10|11|12)(/((0?[1-9]|[12]\\d|30|31)(/(\\d{2}(\\d{2})?)?)?)?)?$
I do realize it looks horrible but as far as I know there is no way to tell the regexp engine to validate as long as the string satisfies the beginning of the regexp pattern.
In my opinion, the best way to achieve what you want to do is to create separate inputs for day/month/date and check their value when leaving the text field.
It also provides a better visibility and user-experience, as I believe no one likes to be prevented from typing certain characters into a text field with or without noticing them disappear as they type or having slashes inserted automatically and without notice.
Have you ever used and app or form that worked that way, simply refusing to accept any keypress it didn't like? If the answer is Yes, did it blow an electronic raspberry each time you pressed a wrong key?
If you really need to validate the input before the form is submitted, use a passive feedback mechanism like a red border around the textfield that disappears the regex matches the input. Also, make sure there's a Help button or a tooltip nearby to provide constructive feedback.
Of course, the best option would be to use a dedicated control like a date-entry widget. But whatever you do, don't do it in such a a way that it feels like you're playing guessing games with the user.
In Eclipse, editing Java code, if I type an open-paren, I get a pair of parens. If I then "type through" the second paren, it does not insert an additional paren. How do I get that in emacs?
The Eclipse editor is smart enough to know, when I type the close-paren, that I am just finishing what I started. The cursor advances past the close paren. If I then type a semicolon, same thing: it just overwrites past the semicolon, and I don't get two of them.
In emacs, in java-mode, or csharp-mode if I bind open-paren to skeleton-pair-insert-maybe, I get an open-close paren pair, which is good. But then if I "type through" the close paren, I get two close-parens.
Is there a way to teach emacs to not insert the close paren after an immediately preceding skeleton-pair-insert-maybe? And if that is possible, what about some similar intelligence to avoid doubling the semicolon?
I'm asking about parens, but the same applies to double-quotes, curly braces, square brackets, etc. Anything inserted with skeleton-pair-insert-maybe.
This post shows how to do what you want. As a bonus it also shows how to set it up so that if you immediately backspace after the opening char, it will also delete the closing char after the cursor.
Update:
Since I posted this answer, I've discovered Autopair which is a pretty much perfect system for this use case. I've been using it a lot and loving it.
To summarize what I did, I looked at this post, and took what I wanted out of it. What I ended up with was simpler, because I didn't have the additional requirements he had.
I used these two new definitions:
(defvar cheeso-skeleton-pair-alist
'((?\) . ?\()
(?\] . ?\[)
(?" . ?")))
(defun cheeso-skeleton-pair-end (arg)
"Skip the char if it is an ending, otherwise insert it."
(interactive "*p")
(let ((char last-command-char))
(if (and (assq char cheeso-skeleton-pair-alist)
(eq char (following-char)))
(forward-char)
(self-insert-command (prefix-numeric-value arg)))))
And then in my java-mode-hook, I bound the close-paren and close-bracket this way:
(local-set-key (kbd ")") 'cheeso-skeleton-pair-end)
(local-set-key (kbd "]") 'cheeso-skeleton-pair-end)
I use paredit-mode, which does this and a lot more.
ParEdit sounds like it would handle the parenthesis part of your need, with the caveat that it was designed for Common Lisp and Scheme. Steve Yegge mentions JDEE for emacs Java development, but I can't speak for that from personal experience, and I couldn't find any documentation on it talking about structured editing.
Does anyone have any suggestions as to how I can clean the body of incoming emails? I want to strip out disclaimers, images and maybe any previous email text that may be also be present so that I am left with just the body text content. My guess is it isn't going to be possible in any reliable way, but has anyone tried it? Are there any libraries geared towards this sort of thing?
In email, there is couple of agreed markings that mean something you wish to strip. You can look for these lines using regular expressions. I doubt you can't really well "sanitize" your emails, but some things you can look for:
Line starting with "> " (greater than then whitespace) marks a quote
Line with "-- " (two hyphens then whitespace then linefeed) marks the beginning of a signature, see Signature block on Wikipedia
Multipart messages, boundaries start with --, beyond that you need to do some searching to separate the message body parts from unwanted parts (like base64 images)
As for an actual C# implementation, I leave that for you or other SOers.
A few obvious things to look at:
if the mail is anything but pure plain text, the message will be multi-part mime. Any part whose type is "image/*" (image/jpeg, etc), can probably be dropped. In all likelyhood any part whose type is not "text/*" can go.
A HTML message will probably have a part of type "multipart/alternative" (I think), and will have 2 parts, one "text/plain" and one "text/html". The two parts should be just about equivalent, so you can drop the HTML part. If the only part present is the HTML bit, you may have to do a HTML to plain text conversion.
The usual format for quoted text is to precede the text by a ">" character. You should be able to drop these lines, unless the line starts ">From", in which case the ">" has been inserted to prevent the mail reader from thinking that the "From " is the start of a new mail.
The signature should start with "-- \r\n", though there is a very good chance that the trailing space will be missing.
Version 3 of OSBF-Lua has a mail-parsing library that will handle the MIME and split a message into its MIME parts and so on. I currently have a mess of Lua scripts that do
stuff like ignore most non-text attachments, prefer plain text to HTML, and so on. (I also wrap long lines to 80 characters while trying to preserve quoting.)
As far as removing previously quoted mail, the suggestions above are all good (you must subscribe to some ill-mannered mailing lists).
Removing disclaimers reliably is probably going to be hard. My first cut would be simply to maintain a library of disclaimers that would be stripped off the end of each mail message; I would write a script to make it easy for me to add to the library. For something more sophisticated I would try some kind of machine learning.
I've been working on spam filtering since Feb 2007 and I've learned that anything to do with email is a mess. A good rule of thumb is that whatever you want to do is a lot harder than you think it is :-(
Given your question "Is it possible to programmatically ‘clean’ emails?", I'd answer "No, not reliably".
The danger you face isn't really a technological one, but a sociological one.
It's easy enough to spot, and filter out, some aspects of the messages - like images. Filtering out signatures and disclaimers is, likewise, possible to achieve (though more of a challenge).
The real problem is the cost of getting it wrong.
What happens if your filter happens to remove a critical piece of the message? Can you trace it back to find the missing piece, or is your filtering desctructive? Worse, would you even notice that the piece was missing?
There's a classic comedy sketch I saw years ago that illustrates the point. Two guys working together on a car. One is underneath doing the work, the other sitting nearby reading instructions from a service manual - it's clear that neither guy knows what he's doing, but they're doing their best.
Manual guy, reading aloud: "Undo the bold in the centre of the oil pan ..." [turns page]
Tool guy: "Ok, it's out."
Manual guy: "... under no circumstances."
If you creating your own application i'd look into Regex, to find text and replace it. To make the application a little nice, i'd create a class Called Email and in that class i have a property called RAW and a property called Stripped.
Just some hints, you'll gather the rest when you look into regex!
SigParser has an assembly you can use in .NET. It gives you the body back in both HTML and text forms with the rest of the stuff stripped out. If you give it an HTML email it will convert the email to text if you need that.
var parser = new SigParser.EmailParsing.EmailParser();
var result = await parser.GetCleanedBodyAsync(new SigParser.EmailParsing.Models.CleanedBodyInput {
FromEmailAddress = "john.smith#example.com",
FromName = "John Smith",
TextBody = #"Hi Mark,
This is my message.
Thanks
John Smith
888-333-4434"
});
// This would print "Hi Mark,\r\nThis is my message."
Console.WriteLine(result.CleanedBodyPlain);