Is it possible to programmatically 'clean' emails?

Is it possible to programmatically 'clean' emails? - c#

Does anyone have any suggestions as to how I can clean the body of incoming emails? I want to strip out disclaimers, images and maybe any previous email text that may be also be present so that I am left with just the body text content. My guess is it isn't going to be possible in any reliable way, but has anyone tried it? Are there any libraries geared towards this sort of thing?

In email, there is couple of agreed markings that mean something you wish to strip. You can look for these lines using regular expressions. I doubt you can't really well "sanitize" your emails, but some things you can look for:
Line starting with "> " (greater than then whitespace) marks a quote
Line with "-- " (two hyphens then whitespace then linefeed) marks the beginning of a signature, see Signature block on Wikipedia
Multipart messages, boundaries start with --, beyond that you need to do some searching to separate the message body parts from unwanted parts (like base64 images)
As for an actual C# implementation, I leave that for you or other SOers.

A few obvious things to look at:
if the mail is anything but pure plain text, the message will be multi-part mime. Any part whose type is "image/*" (image/jpeg, etc), can probably be dropped. In all likelyhood any part whose type is not "text/*" can go.
A HTML message will probably have a part of type "multipart/alternative" (I think), and will have 2 parts, one "text/plain" and one "text/html". The two parts should be just about equivalent, so you can drop the HTML part. If the only part present is the HTML bit, you may have to do a HTML to plain text conversion.
The usual format for quoted text is to precede the text by a ">" character. You should be able to drop these lines, unless the line starts ">From", in which case the ">" has been inserted to prevent the mail reader from thinking that the "From " is the start of a new mail.
The signature should start with "-- \r\n", though there is a very good chance that the trailing space will be missing.

Version 3 of OSBF-Lua has a mail-parsing library that will handle the MIME and split a message into its MIME parts and so on. I currently have a mess of Lua scripts that do
stuff like ignore most non-text attachments, prefer plain text to HTML, and so on. (I also wrap long lines to 80 characters while trying to preserve quoting.)
As far as removing previously quoted mail, the suggestions above are all good (you must subscribe to some ill-mannered mailing lists).
Removing disclaimers reliably is probably going to be hard. My first cut would be simply to maintain a library of disclaimers that would be stripped off the end of each mail message; I would write a script to make it easy for me to add to the library. For something more sophisticated I would try some kind of machine learning.
I've been working on spam filtering since Feb 2007 and I've learned that anything to do with email is a mess. A good rule of thumb is that whatever you want to do is a lot harder than you think it is :-(

Given your question "Is it possible to programmatically ‘clean’ emails?", I'd answer "No, not reliably".
The danger you face isn't really a technological one, but a sociological one.
It's easy enough to spot, and filter out, some aspects of the messages - like images. Filtering out signatures and disclaimers is, likewise, possible to achieve (though more of a challenge).
The real problem is the cost of getting it wrong.
What happens if your filter happens to remove a critical piece of the message? Can you trace it back to find the missing piece, or is your filtering desctructive? Worse, would you even notice that the piece was missing?
There's a classic comedy sketch I saw years ago that illustrates the point. Two guys working together on a car. One is underneath doing the work, the other sitting nearby reading instructions from a service manual - it's clear that neither guy knows what he's doing, but they're doing their best.
Manual guy, reading aloud: "Undo the bold in the centre of the oil pan ..." [turns page]
Tool guy: "Ok, it's out."
Manual guy: "... under no circumstances."

If you creating your own application i'd look into Regex, to find text and replace it. To make the application a little nice, i'd create a class Called Email and in that class i have a property called RAW and a property called Stripped.
Just some hints, you'll gather the rest when you look into regex!

SigParser has an assembly you can use in .NET. It gives you the body back in both HTML and text forms with the rest of the stuff stripped out. If you give it an HTML email it will convert the email to text if you need that.
var parser = new SigParser.EmailParsing.EmailParser();
var result = await parser.GetCleanedBodyAsync(new SigParser.EmailParsing.Models.CleanedBodyInput {
FromEmailAddress = "john.smith#example.com",
FromName = "John Smith",
TextBody = #"Hi Mark,
This is my message.
Thanks
John Smith
888-333-4434"
});
// This would print "Hi Mark,\r\nThis is my message."
Console.WriteLine(result.CleanedBodyPlain);

Related

Plain text search in markdown text

I am trying to write code (in C#) that can search for any plain-text word or phrase in a markdown file. Currently I'm doing this by a long-winded method: convert the markdown to HTML, strip HTML element tags out of the HTML text and then use a simple regular expression to search that for the word/phrase in question. Needless to say, this can be pretty slow.
A concrete example might show the problem. Say the markdown file contains
Something ***significant***
I would like to be able to find that by providing the search phrase something significant (i.e. ignoring the ***'s).
Is there an efficient way of doing this (i.e. that avoids the conversion to HTML) and doesn't involve me writing my own markdown parser?
Edit:
I want a generic way to search for any text or phrase in markdown text that contains any valid markdown formatting. The first answers were ways to match the specific text example I gave.
Edit:
I should have made it clear: this is required for a simple user-facing search and the markdown files could contain any valid markdown formatting. For this reason I need to be able to ignore anything in the markdown that the user wouldn't see as text if they converted the markdown to HTML. E.g. the markdown text that specifies an image (like ![Valid XHTML](http://w3.org/Icons/valid-xhtml10). should be skipped during the search). Converting to HTML produces decent results for the user because it then reasonably accurately reflects what a user sees (but it's just a slow solution, esp when there's a lot of markdown text to look through).

Use a regexp
var str = "Something ***significant***";
var regexp = new Regex("Something.+significant.+");
Console.WriteLine(regexp.Match(str).Success);

I want to do the same thing. I think of one way to achieve that.
Your method has two steps.
Get the plain text out of the markdown source (which has also two steps. Markdown->HTML and HTML->stripped to plain text)
Search within the plain text
Now, if the markdown source is persisted in a data store, then you may be able to also persist the plain text for search purposes only. So the step to extract the plain text from the markdown may be executed only once when persisting the markdown source (or every time the markdown source is updated), but the code that actually searches in the markdown could be executed immediately on the already persisted plain text data as many times as you want.
For example, if you have a relational DB with a column like markdown_text, you could also create a plain_text column and recreate its value every time the markdown_text column is changed.
Users won't bother if saving their markdown takes a few milliseconds (or even seconds) more than before. Users tend to feel safe when something that alters the system's state takes some time (they feel that something is actually happening in the system), rather than happen immediately (they feel that something went wrong and their command did not execute). But they will feel frustrated if searching took more than a few ms to complete. In general users want queries to complete immediately but commands to take some time (not more than a few seconds though).

Try this:
string input = "Something ***significant***";
string v = input.Replace("***", "");
Console.WriteLine(v)
look this example: enter link description here

Is XSS possible through the MailAddress class?

Considering I parse user input, which is supposed to be an email address, into the MailAdress class:
var mailString = Request.QueryString["mail"];
var mail = new MailAddress(mailString);
Is there any possibility left for a cross-site-scripting attack if I output the MailAddress object later in any way? For example through a Literal control in WebForms:
litMessage.Text = "Your mail address is " + mail.Address;
Is it necessary to sanitize the outpout even though I made sure that the address is a valid email address by parsing the string?
From what I could gather the RFC for mail addresses is pretty complicated, so I am unsure if cross site scripts can be hidden in a mail address considered valid by .NET.
EDIT:
MSDN says that > and < brackets are allowed in an email address:
The address parameter can contain a display name and the associated e-mail address if you enclose the address in angle brackets. For example: "Tom Smith <tsmith#contoso.com>"
So the question remains if this is enough for an XSS attack and/or if the MailMessage class does anything to escape dangerous parts.

Generally speaking, you shouldn't need to validate the output later. However, I always recommend that you do so for the following reasons:
There may be a hole somewhere in your app that doesn't validate the input properly. This could be discovered by an attacker and used for XSS. This is especially possible when many different devs are working on the app.
There may be old data in the database that was stored before implementing/updating your filter on the input. This could contain malicious code that could be used for XSS.
Attackers are very clever and can usually figure out a way to beat a filter. Microsoft puts a lot of attention on preventing this, but it's never going to perfect. It makes the attackers job that much harder if they face and outgoing filter as well and as incoming filter.
I know it's a pain to constantly filter, but there is a lot of value in doing so. A Defense-in-Depth strategy is necessary in today's world.
Edit:
Sorry I didn't really answer the second part of your question. Based on the documentation I don't get the impression that the API is focused on sanitizing as much as it is on verifying valid formatting. Therefore I don't know that it is safe to rely on it for security purposes.
However, writing your own sanitizer isn't terribly hard, and you can update it immediately if you find flaws. First run the address through a good RegEx filter (see: Regex Email validation), then recursively remove every nonvalid character in an email address (these shouldn't get through at this point but do this for comprehensiveness and in case you want to reuse the class elsewhere), then escape every character with HTML meaning. I emphasize the recursive application of the filter because attackers can take advantage of a non-recursive filter with stuff like this:
<scr<script>ipt>
Notice that a nonrecursive filter would remove the middle occurence of <script> and leave the outer occurrence in tact.

Is it necessary to sanitize the outpout
You don't 'sanitise' output, you encode it. Every string that you output into an HTML document needs to be HTML-encoded, so if there was a < character in the mail address it wouldn't matter - you'd get < in the HTML source as a result and that would display correctly as a literal < on the page.
Many ASP.NET controls automatically take care of HTML-escaping for you, but Literal does not by default because it can be used to show markup. But if you set the Mode property of the Literal control to Encode then setting the Text like you're doing is perfectly fine.
You should make sure you always use safe HTML-encoded output every time you put content into an HTML page, regardless of whether you think the values you're using will ever be able to include a < character. This is a separation-of-concerns issue: HTML output code knows all about HTML formatting, but it shouldn't know anything about what characters are OK in an e-mail address or other application field.
Leaving out an escape because you think the value is 'safe' introduces an implicit and fragile coupling between the output stage and the input stage, making it difficult to verify that the code is safe and easy to make it unsafe when you make changes.

Complex String Processing - well complex to me

I am calling a web service and all I get back is a giant blob of text. I am left to process it myself. Problem is not all lines are necessarily the same. They each have 2 or 3 sections to them and they are similar. Here are the most common examples
text1 [text2] /text3/
text1/test3
text1[text2]/text3
text1 [text2] /text /3 here/
I am not exactly sure how to approach this problem. I am not too good at doing anything advanced as far as manipulating strings.
I was thinking using a regular expression might work, but not too sure on that either. If I can get each of these 3 sections broken up it is easier from there to do the rest. its just there doesn't seem to be any uniformity to the main 3 sections that I know how to work with.
EDIT: Thanks for mentioning i didn't actually say what I wanted to do.
Basically, I want to split these 3 sections of text into their own strings seperate stings so basically take it from one single string to an array of 3 strings.
string[0] = text1
string[1] = text2
string[2] = text3
Here is some of the text I get back from a call as an example
スルホ基 [スルホき] /(n) sulfo group/
鋭いナイフ [するどいナイフ] /(n) sharp knife/
鋭い批判 [するどいひはん] /(n) sharp criticism/
スルナーイ /(n) (See ズルナ) (obsc) surnay (Anatolian woodwind instrument) (per:)/zurna/
スルピリン /(n) sulpyrine/
スルファミン /(n) sulfamine/
剃る [そる(P);する] /(v5r,vt) to shave/(P)/
As the first line for an example I want to pull it out into an array
string[0] = スルホ基
string[0] = [スルホき]
string[0] = /(n) sulfo group/

Those example seem a bit random, there has to be some kind of order, isn't there a spec for the service? If not i suggest more example so that we can understand the rules.

Read up on some of the info here on finite state machines, and see if you can use some of the concepts on your input parsing problem.
If there is some order to the groups on each line, then maybe you can use a regex to separate the groups out.
Edit: after seeing your samples, you may get by with a regex, breaking on some of those specific delimiters. It will take maybe half an hour to test theory: pick yourself up a free regex tester, make yourself a regex that will isolate out just one of those groups, and pump a few sample lines through. If it performs reliably on the real data that you have, then expand it and see if you can also isolate out the other groups.
I should mention though that your regexes will break or just become a nightmare if there is any sort of vagaries in your data (and frequently there is). So test long and hard before settling on them. If you find you start to have exceptions in your data, then you will need to choose some sort of parsing algorithm (the FSM i mentioned above is a pattern you can follow if you implement a parsing mechanism).

The most stupid answer is "Use regex". But more information needed for better one.

What Encoding is this: <ESC>[00p<ESC>(125901/26/1011.05<CR>

I've got a .txt file given to me to parse through to pull out certain information and i'm not really wanting to write a scanner to do this. It resembles ANSI to me, with maybe a little more added to it. I don't know. It's automatic output from some hardware that's years and years old. Here is some more just to get a good idea of what i'm dealing with and what the output needs to look like.
<ESC>[00p<ESC>(1*259*01/26/10*11.05*<CR>
<ESC>[05pEJ LOG COPIED OK 247C0200 <CR>
<FF><ESC>[05p*3094*1*R*09<CR>
<ESC>[00p<ESC>(1*260*01/26/10*11.07*<CR>
<ESC>[05pSUPERVISOR MODE EXIT <CR>
Expected output:
*259*01/26/10*11.05*
EJ LOG COPIED OK 247C0200
*3094*1*R*09
*260*01/26/10*11.07*
SUPERVISOR MODE EXIT
Like I said, This is just a little bit in pages and pages of it. Could be ANSI I'm not definite. If I've left out some critical info let me know. I'm coding in C# btw. I would include the name/model of the device but I don't know it. Thanks!

That looks like to me a Electronic Journal of some cash register machine - where the log of the sales transactions were downloaded from...not sure which machine though - some of them are capable of being communicated via serial, by using the escape codes to trigger the opening of the log from the Electronic Journal - I am reasoning it, as I have seen EJ being used...could have been a Samsung Cash register....
Hope this helps,
Best regards,
Tom.

This is message for TELOCATOR ALPHANUMERIC PROTOCOL (TAP).
You can read it's description in this document or in the following article.

Try something like this:
string input = #"
<ESC>[00p<ESC>(1*259*01/26/10*11.05*<CR>
<ESC>[05pEJ LOG COPIED OK 247C0200 <CR>
<FF><ESC>[05p*3094*1*R*09<CR>
<ESC>[00p<ESC>(1*260*01/26/10*11.07*<CR>
<ESC>[05pSUPERVISOR MODE EXIT <CR>";
foreach (Match m in Regex.Matches(input,
#"(?:(?:<FF>)?(?:<ESC>[\[\(](?:\d{2}p|\d\*))+)(?<output>.*)",
RegexOptions.Multiline))
{
Console.WriteLine(m.Groups["output"].Value);
}
You'll need to replace:
<ESC> by \x1B
<FF> by \xFF
<CR> by \x0D

It looks as thought most of the 'tags' are the same. If it's a one time shot, you could just do a search/replace in a text editor to remove <ESC>, <CR>, [00p, <FF> and [05p rather than writing code to do it? Of course you only showed a snippet so perhaps there are a ton of different tags to remove...

This looks to me being very similar to ANSI Escape sequences. Searching for it will give you plenty of results. This paper might give you further insight in the ANSI standards.
What you are looking for is a parser which can read those code sequences. Here is a parser written in C which claims to remove the control sequences from an ANSI sequence input. Maybe you want to give it a try.

Capturing Keyboard strokes in C#

HI,
I have the following problem- the following text is in a rich text box .
The world is [[wonderful]] today .
If the user provides two brackets before and afer a word, as in the case of wonderful , the word in brackets, in this case, wonderful shall change to a link, ( with a green colour ) .
I am having problems in getting the sequence of the keystrokes, ie. how do I know that the user has entered [[ , so I can start parsing the rest of the text which follows it .
I can get it by handlng KeyDown, event, and a list , but it does not look to be elegant at all.
Please let me know what should be a proper way.
Thanks,
Sujay

You have two approaches that I can think of off-hand.
One is, as you suggest, maintain the current state with a list—was this key a bracket? was the last key a bracket?—and update on the fly.
The other approach would be to simply handle the TextChanged event and re-scan the text for the [[text-here]] pattern and update as appropriate.
The first requires more bookkeeping but will be much faster for longer text. The second approach is easier and can probably be done with a decent regex, but it will get slower as your text gets longer. If you know you have some upper limit, like 256 characters, then you're probably fine. But if you're expecting novels, probably not such a great idea.

I would recommend Google'ing: "richtextbox syntax highlighter", there are so many people that have done this, and there is a lot behind the scenes to make it work.
I dare myself to say, that EVERY SINGLE simple solution have major drawbacks. Proper way would be to use some control that already does this "syntax highlighting" and extending it to your syntax. It is also most likely the easiest way.
You can search free .net controls in Codeplex. link

I would try handling the KeyDown, and checking for the closing bracket instead "]". Once you receive one, you could check the last character in your text box for the second ], and if it's there, just replace out the last few characters.
This eliminates the need for maintaining state (ie: the list). As soon as the second ] was typed, the block would change to a link instantly.

Keeping a list will be rather complex I think. What if the user types a '[' character, clicks somewhere else in the text and then types a '[' character again. The user has then typed two consecutive '[' characters but in completely different parts of the text. Also, you may want to be able to handle text inserted from the clipboard as well.
I think the safest way is to analyze the full text and do what should be done from that context, using RegEx or some other technique.

(Sorry, don't have enough reputation to add comments yet, so have to add a new answer). As suggested by jeffamaphone I'd handle the TextChanged event and rescan the text each time - but to keep the cost constant, just scan a few characters ahead of the current cursor position instead of reading the entire text.
Trying to intercept the keystrokes and maintain an internal state is a bad approach - it is very easy for your idea of what has happened to get out of sync with the control you are monitoring and cause weird problems. (and how do you handle clicks? Alt-tab? Pastes? arrow keys? Other applicatiosn grabbing focus? Too many special cases to worry about...)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.