.NET library for processing HTML e-mails & stripping previous responses - c#

Does anyone know of a .NET library that will process HTML e-mails and can be used to trim out the reply-chain? It needs to be able to accept HTML -or- text mails and then trim out everything but the actual response, removing the trail of messages that are not original content. I don't expect it to be able to handle responseswhen they're interleaved into the previous mail ("responses in-line") - that case can fail.
We have a home-built one based on SgmlReader and a series of XSL transforms, but it requires constant maintenance to deal with new e-mail clients. I'd like to find one I can buy... :)
Thanks,
Steve

This does not answer much of your question, but the W3C's Converting HTML to Other Formats has a section on converting HTML to text. I hope it helps someone develop a full answer to your question!

One free and very useful library we've used for dealing with HTML, including malformed HTML, is the HtmlAgilityPack.
There is no StripOutPreviousResponses() function, but it may help you with your home-made one.

Related

Email Parsing program

I am writing email parsing program. Basically, I am trying to retrieve the emails from exchange server and they have different formats. Mail body contains p and span tags, and when I open the message in Outlook, it is adding additional classes such as "msonormal" to the html elements. And when I copy and paste it in GMail composer it is just removing the classes but html tags are intact.
I am using HTML agility pack to parse the tags independent of class names. Emails are sent via different automated systems. So, I am not completely sure if the emails from the exchange server contains p and span tags or the outlook/gmail editors are adding those tags as well.
Can any one shed some light, do these mail editors just add the classes or any additional attributes or they completely change the layout such as showing divs as tables.
I'm sorry but if you are getting emails from different sources, chances are that they will all be formatted differently.
You're on the right track using html agility pack. I would suggest putting a break point in your code and getting the full html source of each and then parsing.
They are from different sources so you can conditionally parse based on sender or subject.
I've had to do this in the past, it was a pain, sorry there is no way to normalize all so they can be parsed in a standard way. The only way would be for you to enforce a standard on your senders, which I'm guessing would be almost impossible.

Create PDF from HTML form results in C#

I have a project where I need to create an HTML form (no problem) and then create a PDF file from the results using C#.
I have done this before in PHP using FPDF but this one needs to be C#. Ideally I want to put the code into a user control and then stick it in an Umbraco website.
Can anyone recommend a good way to do this? PDF doesn't need to be fancy, it'll just display text, we aim to create a generic purchase order based on what the customer wants from the form, which can then be emailed to them to print off on headed paper.
Thanks
There are a couple of recent problems with iTextSharp. The most annoying is that in the latest version they've deprecated the HTML parser. So now everything has to work through the XMLWorkerHelper singleton and parses through ParseXHtml. I find this a real pain, since HTML pages which aren't well formed appear fine on browser, parse OK in the old method and now crash out with an exception. So it necessitates an extra step to make sure your HTML is well formed (as XHTML) first. If you are generating your HTML from an ASPX page, then using Server.Execute() to get the stream, then this might be useful to you for iTextSharp:
http://jwcooney.com/2012/12/30/generate-a-pdf-from-an-asp-net-web-page-using-the-itextsharp-xmlworker-namespace/
Be mindful that iTextSharp has a distinct lack of any decent documentation of the modern changes (being mindful that the Java iText documents don't translate perfectly to C#), it makes the learning curve far too long and steep for any practical use in short spaces of time. I've basically given up on that platform, though may just create a baseline system to get something working lean whilst I then learn another framework.
As a result, I'm looking at PDFizer and PDFSharp libraries. If I have some success, I'll report back.
here is a library for converting HTML to PDF
http://pdfcrowd.com/web-html-to-pdf-net/
I like the PDFsharp library. Not sure how it would work for your needs, though.

Creating PDFs Online

We are using Report Definition laguage (RDL) templates to define various reports in one of our Sharepoint applications. These reports are (then) saved as PDFs into various SharePoint Document Library's. One report in-particular renders, but is considered to be "failing" due to the styling needs of the report. So it appears RDL only understand "very simple" HTML.
For Example:
Trademark characters are not rendering as superscript (they render as normal text instead)
The ability to assign Line Height fails
The ability to assign Word Spacing fails (so printers "leading" requirements fail)
Both of these point to various marked Microsoft limitation for RDL's to interprint various HTML...of which we are now aware.
So...
I need a better tool...and we are scratching our heads on this one!
QUESTION:
What tools take-in HTML, understand CSS (well!) and can generate PDFs from C-Sharp objects?
Please keep in-mind I need the to PDF generator tools you recommend (below) to understand CSS and HTML.
NOTE:
I looked at the various other StackEchange sites to see if there is a better forum for this particular question, but this one was the only one that seemed to fit-the-bill. If you are a mediator, and feel this question is mis-placed, please feel free to move this question.
This HTML to PDF converter has the most accurate conversion of a complex html/css page. There is also a demo to try the conversion with your html
Maybe you can give Amyuni WebkitPDF a try. It is a Free component for converting HTML+CSS into PDF files. From the home page:
Directly convert HTML files into PDF without the use of a web browser or a printer driver
Convert HTML files into XAML/XPS for rendering within Silverlight
Integrate and deploy the HTML conversion feature within your applications
Generate either a single continuous PDF page or split the HTML into multiple PDF pages
Amyuni WebkitPDF is distributed as a library with a sample application, and sample code for C++ and C#.
Disclaimer: I currently work as software developer at Amyuni Technologies.
I only know a workaround for the "leading space" issue. This example "leads" the value with 10 spaces:
=space(10) & Fields!FieldName.Value
This should work for any renderer, I'll update this if I come around other tricks.
Have a look at Aspose.Pdf for .NET: http://www.aspose.com/categories/.net-components/aspose.pdf-for-.net/default.aspx

HTML to PDF - Bad performance

I´m using ExpertPDF (library for .NET C#) for converting HTML to PDF and my problem is that it takes a lot of time to do this.
Are there any customizations that will improve the conversion?
The HTML-page contains table-data with just a few images, so it is not that complex.
Have anyone else ever experienced this problem, or do you recommend another library for doing this?
I´m thankful for all hints I can get, there must be a way to increase the performance of this action...
If you really need build from HTML, i suggest to have a look websupergoo, it is no free or open source library can export PDF from HTML.
There is a lot of questions in stackoverflow that speak for this subject
Generate PDF from ASP.NET from raw HTML/CSS content?
Printing a PDF in .NET
How do I programmatically create a PDF in my .NET application?
see search result

Converting between document formats in C#

What is the best way to convert between HTML, XML, and XSL-FO in C#?
I already have the HTML (piped in from FCKEditor) and I'd like to print a PDF (I have an XSL->PDF converter). I just can't seem to find a library that will convert from HTML into anything XSL friendly.
A year or two back, I had to generate pdfs from a C++/C# program. In the end I settled on launching Apache's Java FOP as a separate process to do the conversion. The experience with xsl-fo was not a pleasant one. At the time, there didn't appear to be a single tool that had implemented xsl-fo completely. Tools tended to pick a subset of the specification and hack away at that. Given the sprawling complexity of xsl-fo, I'm starting to wonder if there will ever be a full implementation.
FOP tended to be buggy and considerable time was spent working around issues. XSLT and XPaths were difficult to learn. It took a few weeks before I was seeing past the verbosity and could quickly get things done. I don't think I ever quite got my head around xsl-fo though. It makes the html and css model look like a child's toy. Luckily, the pdfs generate, and don't have too many problems. :-)
Anyway, the task at hand: generating pdfs from xhtml output from FCKEditor.
I just can't seem to find a library that will convert from HTML into anything XSL friendly.
Heh. Yeah, that's 'cos there isn't one, and probably won't be an html to xsl-fo converter that's any good. Such a converter has a few things against it: complexity of browsers and complexity of xsl-fo. For such a converter to deal with an average html document, it needs the guts of a web browser: the layout, css support probably even JavaScript. Then it has to take the rendered page, and figure out what xsl-fo is needed to get something which looks similar, and fits within the paged constraints of xsl-fo.
It's like the problem with making a word viewer: without reimplementing a lot of word, it sucks most of the time because it doesn't look the same.
So... what can you do? Well, having a small subset of html to work with is a good start. Hopefully the output from FCKEditor is xhtml, as getting html into xml is a world of pain in itself (which tidy can be useful for). Next, unless some poor soul has already made an FCKEditor xhtml -> xsl-fo xslt for your xsl-fo implementation, you'll have to make one. That involves learning xsl-fo, xslt and xpath. In my experience it'll take a few weeks and will be a cobbled together solution.
To get started with xsl-fo I found the following links useful:
XSL-FOTutorial
XSL Standard
Apache FOP Compliance Page
XSL-FO: Ready for Prime Time? outlines the problem xsl-fo tries to solve
For three quick intros see a, b and c
So what's all this xsl-fo, xslt stuff and all the other things? The XSL-FO: Ready for Prime Time? lays it out as:
The Extensible Stylesheet Language Family (XSL) XSL is a family of recommendations for defining XML document transformation and presentation. It consists of three parts:
XSL Transformations (XSLT), a language for transforming XML
The XML Path Language (XPath), an expression language used by XSLT to access or refer to parts of an XML document. (XPath is also used by the XML Linking specification)
XSL Formatting Objects (XSL-FO), an XML vocabulary for specifying formatting semantics
My advice? Run. Find another away. Find another solution. Generate LaTeX files, and convert them into pdfs. Generate something else. Make word documents and print them using PDFCreator. Generate images. Control Firefox to print pages as pdfs. Find away to avoid needing pdfs at all. Anything, as long as it isn't fighting html, xsl-fo, FOP, xslt and xpath.
PS: Let me know if you need any help. :-)
I'd first try XSLT. When you're talking about formatting XML documents (and that's pretty much what you're talking about), that's the tool designed to do it.
From Wiki:
"The general idea behind XSL-FO's use
is that the user writes a document,
not in FO, but in an XML language.
XHTML, DocBook, and TEI are all
possibilities, but it could be any XML
language. Then, the user obtains an
XSLT transform, either by writing one
themselves or by finding one for the
document type in question. This XSLT
transform converts the XML into
XSL-FO."
You need an XSLT transform for HTML to XSL-FO. Not sure where to get one, but apparently the concept isn't alien.
Very informative exchange here. I have created a web application using ASP.NET and C#.NET for my IT contract business. One of the major goals of the web app is to generate customized resumes in various formats. I store my resume content in a SQL Server database and build the XML mostly raw in a C# method. I used XSLT to convert to HTML and with a little akwardness have finally got a basic presentable resume. My next goal is to get a printable version of the resume. I got a book on XML from the library and touched up the XSLT a little. Then I came to the XSL-FO chapter. That's when the iceberg hit. I wanted to take on the challenge of having a PDF option that would be a menu choice and do a tranform to XSLT to XSL-FO to PDF. Thing is all the book recommendations had references to commercial products. It is just not worth the money as PDF is not neccessary. I looked at Altova XMLSpy on a 30 day trail basis but as soon as I tried my first transform of a XSL-FO example file I got a message stating that I needed to download more software. That download was taking forever from their site so I gave up and removed the software. Free versions of the commmercial software from other vendors do not have the transform option. After reading the notes here I have decided to avoid the XSL-FO myself. I am going to try getting an MS Word version now and if my clients want to convert it to PDF they can pay for the PDF create version from Adobe.
This is a dead question but I would like to add for future readers that the current incarnation on FCKEditor (CKEditor now) is better at producing high quality XHTML (even a user-definable set of tags is possible).
I have gotten around similar issues by actually not using XSL-FO but using a (X)HTML to PDF converter that renders the PDF from your source without XSL Transforms. I validate the produced XHTML and fix the rare issues with HtmlAgilityPack - that way will get you a long way from non-semantic HTML complexities. There are many converters to choose from, my choice is wkhtmltopdf (If money is not an issue PrinceXML is a superior alternative - I would love to use it but it's simply too expensive).

Categories