Regex Pattern to Extract Email Data - c#

I'm retrieving raw text (includes header, and message) from a POP server. I need to capture everything after the header which is terminated by a blank line between it and the user message.
At the same time I'm wanting to ignore anything from original messages if it's a reply. The start of a reply for the emails I'm parsing start with
------Original Message------
An example email might look like this
Return-Path: ...
...
More Email Metadata: ...
Hello from regex land, I'm glad to hear from you.
------Original Message------
Metadata: ...
...
Hey regex dude, can you help me? Thanks!
Sincerely, Me.
I need to extract "Hello from regex land, I'm glad to hear from you." and any other text/lines prior to the original message.
I'm using this regex right now (C# in multiline mode)and it seems to work except it's capturing ------Original Message------ if the body is blank. I'd rather just have a blank string instead.
^\s*$\n(.*)(\n------Original Message------)?
Edit
I haven't down voted anyone and if you happen to downvote, it's usually helpful to include comments.

The reason for this is that you have an extra \n inside the parenthesis. If the body is blank, there is no extra newline there. Therefore, try this:
^\s*$\r\n(.*)(^------Original Message------$)?
If you don’t want the newline at the end of the body, you can still use string.Trim() on the matched part.
Note: This assumes that the input uses \r\n line terminators (which is required in e-mail headers according to the MIME standard).

Why don't you not use DotnetOpenMail? Using a regex to do this is a wrong approach, you'd be better off using a dedicated email handler instead....

You need to replace (\n------Original Message------) with (?=(\n------Original Message------)) lookahead to not return that part, just to ensure it's there

Related

Using regex to split a formatted string to URL like StackOverFlow

I'm trying to write a parser that will create links found in posted text that are formatted like so:
[Site Description](http://www.stackoverflow.com)
to be rendered as a standard HTML link like this:
Site Description
So far what I have is the expression listed below and will work on the example above, but if will not work if the URL has anything after the ".com". Obviously there is no single regex expression that will find every URL but would like to be able to match as many as I can.
(\[)([A-Za-z0-9 -_]*)(\])(\()((http|https|ftp)\://[A-Za-z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?)(\))
Any help would be greatly appreciated. Thanks.
Darn. It seems #Jerry and #MikeH beat me to it. My answer is best, however, as the link tags are all uppercase ;)
Find what: \[([^]]+)\]\(([^)]+)\)
Replace with: $1
http://regex101.com/r/cY7lF0
Well, you could try negated classes so you don't have to worry about the parsing of the url itself?
\[([^]]+)\]\(([^)]+)\)
And replace with:
$1
regex101 demo
Or maybe use only the beginning parts to identify a url?
\[([^]]+)\]\(((?:https?|ftp)://[^)]+)\)
The replace is the same.

How can I transform this url with REGEX?

I have a dynamic web app built using DotNetNuke that uses the following url format:
/SeoDummy.aspx?template={VAR1}&keywords={VAR2}
My user friendly url format is like this:
http://domain.com/.{VAR1}/{VAR2}
I am really terrible with REGEX and need to somehow detect when the user friendly url is requested and rewrite it with the dynamic web app url. I have tried the following, but It is not catching it on the site, it is just 404'ing:
.*/^([^/]+)/([^/]+)/?$
I am sure you that know regex will find my attempt silly, but regex is my kryptonite!
Thanks for any help that can be offered.
Since you are using some custom url,I guess regex would be better than using URI class
In your regex you have misplaced ^..The regex should be
^https?://domain[.]com/[.]([^/]+)/([^/]+)/?$
I have not tested this, but give it a shot and tell me how it works out:
domain[.]com/\.([^/]+)/([^/]+)/?$
It looks like you had it mostly right except for the first carat, marking the beginning of the string... which is impossible since you specified .* right in front of it! Also you missed the period in front of {VAR1} (unless that is a typo?).
I also wouldn't put .* at the beginning because then you could be capturing VAR1 = domain.com, VAR2 = something that is actually VAR1
If you want to become immune to your kryptonite, then this website is really good for looking up stuff:
http://www.regular-expressions.info/reference.html

line break in alert popup

I am trying to add a new line Javascript alert message. I tried '\n' and 'Environment.NewLine'. I am getting Unterminated string constant error. Could you please let me know what could be the problem? I appreciate any help. I also tried \r\n.
string msg = "Your session will expire in 10 minutes. \n Please save your work to avoid this.";
if (!this.ClientScript.IsStartupScriptRegistered(ID))
this.ClientScript.RegisterStartupScript(GetType(), ID, String.Format("<script language=JavaScript>setTimeout(\'alert(\"{1}\");\',{0}*1000);</script>", sTime, msg));
I would suspect that you need to change your code to;
string msg = "Your session will expire in 10 minutes. \\n Please save your work to avoid this.";
And escape the \n otherwise your code outputted would actually include the line break rather than \n
Your output code would look like:
setTimeout('alert("Your Session....
Please save your work to ....");', 1000);
Rather than:
setTimeout('alert("Your Session....\n Please save your work to ....");', 1000);
I'm not sure, but I think that \n is escaped in the string.Format method, like \". Maybe you should use \\n instead.
Edited : and the first \ of \\n has been escaped when i posted that. xD
At first glance I would say the primary problem is that you're escaping the ' character around your alert. Since your string is defined by the double quotes, you don't need to escape this character.
add "#" at the beginning of your string - like this:
string msg = #"Your session ....";
The code looks fine, so I'm going to guess that you're using a message that itself has a ' quote in it, causing the JS syntax error. For inserting dynamic text into a Javascript code block, you really should use JSON to make your C# strings 'safe' for use in JS.
Consider JSON the go-to method for preventing the JS equivalent of SQL injection attacks.
Adding a # at the beginning should help.

C# - Send HTML email with line breaks

I'm attempting to send emails from my C# app using the System.Net.Mail.MailMessage class. For the body of the email, I want to include line breaks to create separate blocks of text. How can I get an email client, like Outlook, to respect the line breaks when display the email has HTML to users? What character(s) do I insert in the body of the email text so that the line breaks are treated as line breaks?
Note: The body of my email is pure text, not HTML.
Line breaks are white space characters, and all white space characters are interpreted as spaces in HTML.
You can use break tags (<br/>) to add line breaks in the HTML code, and you can use paragraph tags (<p>...</p>) around paragraphs to get a distance between them.
You can just use Environment.NewLine property. That will insert a newline string defined for the current environment.
You need to insert HTML line-breaks i.e. <br/>
using the parograph tag is the best way in my opinion because it keeps the format clean and easy to maintain.
I hope this helps.
<p>Dear Mike</p>
<p>This is a sample message</p>
<p><b>Thank You,</b></p>
If you want to send the email as HTML, use the break tag as others have described. If you want line breaks to work, you'll need to send the message as plain text, which can be specified by setting IsBodyHtml = false.
If your existing line breaks aren't being used, try forcing with "\n" which is an escape sequence for a new line.
email.Body = new MessageBody(BodyType.Text, stringThatCointainsMessage);
at least using Microsoft.Exchange.WebServices.Data this will solve the problem
you need to use the overloaded method and the enum BodyType.Text

Is it possible to programmatically 'clean' emails?

Does anyone have any suggestions as to how I can clean the body of incoming emails? I want to strip out disclaimers, images and maybe any previous email text that may be also be present so that I am left with just the body text content. My guess is it isn't going to be possible in any reliable way, but has anyone tried it? Are there any libraries geared towards this sort of thing?
In email, there is couple of agreed markings that mean something you wish to strip. You can look for these lines using regular expressions. I doubt you can't really well "sanitize" your emails, but some things you can look for:
Line starting with "> " (greater than then whitespace) marks a quote
Line with "-- " (two hyphens then whitespace then linefeed) marks the beginning of a signature, see Signature block on Wikipedia
Multipart messages, boundaries start with --, beyond that you need to do some searching to separate the message body parts from unwanted parts (like base64 images)
As for an actual C# implementation, I leave that for you or other SOers.
A few obvious things to look at:
if the mail is anything but pure plain text, the message will be multi-part mime. Any part whose type is "image/*" (image/jpeg, etc), can probably be dropped. In all likelyhood any part whose type is not "text/*" can go.
A HTML message will probably have a part of type "multipart/alternative" (I think), and will have 2 parts, one "text/plain" and one "text/html". The two parts should be just about equivalent, so you can drop the HTML part. If the only part present is the HTML bit, you may have to do a HTML to plain text conversion.
The usual format for quoted text is to precede the text by a ">" character. You should be able to drop these lines, unless the line starts ">From", in which case the ">" has been inserted to prevent the mail reader from thinking that the "From " is the start of a new mail.
The signature should start with "-- \r\n", though there is a very good chance that the trailing space will be missing.
Version 3 of OSBF-Lua has a mail-parsing library that will handle the MIME and split a message into its MIME parts and so on. I currently have a mess of Lua scripts that do
stuff like ignore most non-text attachments, prefer plain text to HTML, and so on. (I also wrap long lines to 80 characters while trying to preserve quoting.)
As far as removing previously quoted mail, the suggestions above are all good (you must subscribe to some ill-mannered mailing lists).
Removing disclaimers reliably is probably going to be hard. My first cut would be simply to maintain a library of disclaimers that would be stripped off the end of each mail message; I would write a script to make it easy for me to add to the library. For something more sophisticated I would try some kind of machine learning.
I've been working on spam filtering since Feb 2007 and I've learned that anything to do with email is a mess. A good rule of thumb is that whatever you want to do is a lot harder than you think it is :-(
Given your question "Is it possible to programmatically ‘clean’ emails?", I'd answer "No, not reliably".
The danger you face isn't really a technological one, but a sociological one.
It's easy enough to spot, and filter out, some aspects of the messages - like images. Filtering out signatures and disclaimers is, likewise, possible to achieve (though more of a challenge).
The real problem is the cost of getting it wrong.
What happens if your filter happens to remove a critical piece of the message? Can you trace it back to find the missing piece, or is your filtering desctructive? Worse, would you even notice that the piece was missing?
There's a classic comedy sketch I saw years ago that illustrates the point. Two guys working together on a car. One is underneath doing the work, the other sitting nearby reading instructions from a service manual - it's clear that neither guy knows what he's doing, but they're doing their best.
Manual guy, reading aloud: "Undo the bold in the centre of the oil pan ..." [turns page]
Tool guy: "Ok, it's out."
Manual guy: "... under no circumstances."
If you creating your own application i'd look into Regex, to find text and replace it. To make the application a little nice, i'd create a class Called Email and in that class i have a property called RAW and a property called Stripped.
Just some hints, you'll gather the rest when you look into regex!
SigParser has an assembly you can use in .NET. It gives you the body back in both HTML and text forms with the rest of the stuff stripped out. If you give it an HTML email it will convert the email to text if you need that.
var parser = new SigParser.EmailParsing.EmailParser();
var result = await parser.GetCleanedBodyAsync(new SigParser.EmailParsing.Models.CleanedBodyInput {
FromEmailAddress = "john.smith#example.com",
FromName = "John Smith",
TextBody = #"Hi Mark,
This is my message.
Thanks
John Smith
888-333-4434"
});
// This would print "Hi Mark,\r\nThis is my message."
Console.WriteLine(result.CleanedBodyPlain);

Categories