Email Parsing program - c#

I am writing email parsing program. Basically, I am trying to retrieve the emails from exchange server and they have different formats. Mail body contains p and span tags, and when I open the message in Outlook, it is adding additional classes such as "msonormal" to the html elements. And when I copy and paste it in GMail composer it is just removing the classes but html tags are intact.
I am using HTML agility pack to parse the tags independent of class names. Emails are sent via different automated systems. So, I am not completely sure if the emails from the exchange server contains p and span tags or the outlook/gmail editors are adding those tags as well.
Can any one shed some light, do these mail editors just add the classes or any additional attributes or they completely change the layout such as showing divs as tables.

I'm sorry but if you are getting emails from different sources, chances are that they will all be formatted differently.
You're on the right track using html agility pack. I would suggest putting a break point in your code and getting the full html source of each and then parsing.
They are from different sources so you can conditionally parse based on sender or subject.
I've had to do this in the past, it was a pain, sorry there is no way to normalize all so they can be parsed in a standard way. The only way would be for you to enforce a standard on your senders, which I'm guessing would be almost impossible.

Related

Creating PDFs Online

We are using Report Definition laguage (RDL) templates to define various reports in one of our Sharepoint applications. These reports are (then) saved as PDFs into various SharePoint Document Library's. One report in-particular renders, but is considered to be "failing" due to the styling needs of the report. So it appears RDL only understand "very simple" HTML.
For Example:
Trademark characters are not rendering as superscript (they render as normal text instead)
The ability to assign Line Height fails
The ability to assign Word Spacing fails (so printers "leading" requirements fail)
Both of these point to various marked Microsoft limitation for RDL's to interprint various HTML...of which we are now aware.
So...
I need a better tool...and we are scratching our heads on this one!
QUESTION:
What tools take-in HTML, understand CSS (well!) and can generate PDFs from C-Sharp objects?
Please keep in-mind I need the to PDF generator tools you recommend (below) to understand CSS and HTML.
NOTE:
I looked at the various other StackEchange sites to see if there is a better forum for this particular question, but this one was the only one that seemed to fit-the-bill. If you are a mediator, and feel this question is mis-placed, please feel free to move this question.
This HTML to PDF converter has the most accurate conversion of a complex html/css page. There is also a demo to try the conversion with your html
Maybe you can give Amyuni WebkitPDF a try. It is a Free component for converting HTML+CSS into PDF files. From the home page:
Directly convert HTML files into PDF without the use of a web browser or a printer driver
Convert HTML files into XAML/XPS for rendering within Silverlight
Integrate and deploy the HTML conversion feature within your applications
Generate either a single continuous PDF page or split the HTML into multiple PDF pages
Amyuni WebkitPDF is distributed as a library with a sample application, and sample code for C++ and C#.
Disclaimer: I currently work as software developer at Amyuni Technologies.
I only know a workaround for the "leading space" issue. This example "leads" the value with 10 spaces:
=space(10) & Fields!FieldName.Value
This should work for any renderer, I'll update this if I come around other tricks.
Have a look at Aspose.Pdf for .NET: http://www.aspose.com/categories/.net-components/aspose.pdf-for-.net/default.aspx

Parse HTML Page, include all css styles

I want to send a complete html page as an email and want to include all the css styles into the email. Is there any library that creates me one html page with all the css styles correctly included. (Conside you can import css files which also have to be opened and included.)
Any help is appreciated.
The closest I came to this was having to send a control without includes, and I built this as a server control, read the css files ( and js files ) and write them out. However for an entire page, you might have more difficulty.
I do not believe that there is anything to do this. If you can read the entire page code, doing find and replaces may be your easiest answer. Find the csss tag, replace the inner contents with the values from the file in the tag.
You could try using this: http://martinnormark.com/move-css-inline-premailer-net
Im in the process of testing it myself in order to generate some Word documents with inline styles...i will add the results later...
Update:
It works althought Word 2010 applies some of the inline styles as it likes...didnt try with previous versions of word

How to get rid of HTML-tags in a message?

For the last time I've been using EWS MAPI to connect to Exchange Server. After this is done I access my mails and firstly display their body (which contains a LOT HTML-tags) in a gridview. After you select a record of that gridview the body is shown in a freetextbox.
My problem is that I want to get rid of the HTML in the body. And configure the freetextbox so that it still displays the the text in his true format.
Thanks in advance.
You can use regular expressions.
have a look at: here
You can use Html Agility Pack to parse Html and strip out what you want. There are a lot of information on SO about it, for example: How to use HTML Agility pack
Maybe this can help
Convert special chars to HTML entities, without changing tags and parameters

How to create HTML text from C# application?

I have C# application that must store some information into MS SQL that
would be later sent to email with DB Mail.
Within C# application I have a class with several properties and I need to use it to generate email text. So what I would like is set up a template with placeholders for variables. I need to create text as HTML and plain text.
What tools, libraries would you
recommend for HTML?
Is String.Format() best alternative to
work with plain text?
I do this in other applications by having the e-mail body available somewhere (SharePoint list, data table) already in the right format, but with named placeholders, corresponding to the information you have in your application.
Then sending the e-mail means replacing the placeholder with the right information. StringBuilder.Replace works fine.
I would say the most important thing you need to decide is when to encode the text. If you are emailing text supplied byusers, you will want to HtmlEncode it before including it in an email. It's probably ok to store it "as recieved" in the data base as long as every consumer encodes it before using it. I typically do this in the data layer that "gets" data from the data base.

.NET library for processing HTML e-mails & stripping previous responses

Does anyone know of a .NET library that will process HTML e-mails and can be used to trim out the reply-chain? It needs to be able to accept HTML -or- text mails and then trim out everything but the actual response, removing the trail of messages that are not original content. I don't expect it to be able to handle responseswhen they're interleaved into the previous mail ("responses in-line") - that case can fail.
We have a home-built one based on SgmlReader and a series of XSL transforms, but it requires constant maintenance to deal with new e-mail clients. I'd like to find one I can buy... :)
Thanks,
Steve
This does not answer much of your question, but the W3C's Converting HTML to Other Formats has a section on converting HTML to text. I hope it helps someone develop a full answer to your question!
One free and very useful library we've used for dealing with HTML, including malformed HTML, is the HtmlAgilityPack.
There is no StripOutPreviousResponses() function, but it may help you with your home-made one.

Categories