I have data in SQL that is in RTF as it contains a lot of superscript characters. I am trying to print the data on a PDF using PDFsharp (not MigraDoc) using DrawString, however, as I expected, it just shows the RTF string...
I tried putting it in a RichTextBox and then retrieving the Text property, this gives the correct plain text but not in superscript format, which I need.
Can anyone tell me how to correctly output the RTF data?
First from FAQ of pdfsharp:
Can I use PDFsharp to convert HTML or RTF to PDF?
No, not "out of the
box", and we do not plan to write such a converter in the near future.
Yes, PDFsharp with some extra code can do it. But we do not supply
that extra code. On NuGet and other sources you can find a third party
library "HTML Renderer for PDF using PdfSharp" that converts HTML to
PDF. And there may be other libraries for the same or similar
purposes, too. Maybe they work for you, maybe they get you started.
A workaround I think is using DrawToBitmap with a RichTextBox to render the RTF string into an image, then use DrawImage to put it in the pdf file.
Using a RichTextBox would also be my approach, but I would select only a single character in the text and query the relevant properties (subscript, superscript, maybe also bold, italic, underline, and anything else you need). And when any of those properties changes, draw the text you collected so far and continue collecting characters for the new set of properties until any relevant property changes.
I would probably use MigraDoc so I would not have to deal with line-breaks in my code, but that is up to you. I would not create bitmaps for the text as this voids the advantages of the PDF format.
Related
I'm trying to parse lines of bolded text from an RTF file. Right now, I'm sort of doing it by using Regex and looking for the "\b...\b0" tags in the file, but that leaves a lot of formatting text, and there are so many formatting tags in RTF that I can't just hard code it all out and call it a day. Is there a more elegant existing solution for parsing only lines with specific formatting?
I'd use an RTF parser... RichTextBox comes to mind. There are several ways of obtaining the formatting using the RTB.
No. I recently tackled a project in which we had to take an RTF document, complete with embedded media, and convert it to a MIME multipart message. We constructed several sets of RegEx to break apart the sections of the document and then converted each formatting option to an appropriate HTML/CSS tag. There really isn't an "elegant" way to do what you wish.
What are you trying to do with the RTF? Our end-goal was to have a HTML conversion of the RTF supplied. I know that RichTextBox, within the WPF world, has the ability to save out to several formats, such as XAML, which may get rid of the need to handle the parsing yourself.
Also, there are RTF Converters out on the market, so with some more context I could suggest something better.
You should take a look at RtfDomParser.
I found some cases where the parser does not work but globally it's ok.
Context
What we need is to capture some user input (formatted text) from a WPF application and output a PDF with some stored images AND the user input on the last page.
What we've tried
We create the WPF app, add the iTextSharp library, recover the images from the DB and add it to the PDF. That's working. Now, for the user input we added a RichTextBox control from the Extended WPF Toolkit. We added this control mainly because of its binding properties and formatters. Basically we can bind the rich content of the control to a property. That binding is working. We already have the RTF format, as (in example):
"{\rtf1\ansi\ansicpg1252\uc1\htmautsp\deff2{\fonttbl{\f0\fcharset0 Times New Roman;}{\f2\fcharset0 Segoe UI;}}{\colortbl\red0\green0\blue0;\red255\green255\blue255;}\loch\hich\dbch\pard\plain\ltrpar\itap0{\lang1033\fs18\f2\cf0 \cf0\ql{\f2 {\ltrch This is the }{\b\ltrch RichTextBox}\li0\ri0\sa0\sb0\fi0\ql\par}}}"
Problem
The thing is, the actual output of the PDF is precisely that previously shown RTF, but the expected output (for the example) must be:
"This is the **RichTextBox**\r\n"
This is happening obviously because we are inserting the binded RTF from the control as it comes to the PDF, the thing is: How can we add that content and specify its RTF?
PS. If you have other working idea or solution (without using a richtextbox, or something like that) it's welcome. Thanks in advance.
Unfortunately, iTextSharp does not directly support RTF format anymore. I would suggest to convert the RTF fragment to XHTML first and then import the resulting XHTML into the final document (it seems that the official HTML support is gone away, so XHTML is the only alternative in this case).
In short, I would suggest to:
convert the RTF fragment into XHTML;
place the XHTML stream into a new iTextSharp document (or directly into the final document, if you wish);
add the content of the aforementioned document into the target document you are going to export as PDF.
UPDATE
There is no built-in mechanism to convert from RTF to XHTML but many open source project exist; I would start coupling this RTF to HTML converter with the HTML Agility Pack (which will in turn convert your HTML to XHTML).
Frankly, however, the whole flow is a bit complex to follow and I would perhaps opt for a simpler solution, maybe by using an HTML editor (alternative) directly in your project or by reverting to the FlowDocument as others have suggested.
WPF already had a good FlowDocument and it does good rendering. So we created Xaml to PDF converter, its in beta, but most String, Table and Images are converted to PDF successfully, its an open source project available at, http://xamltopdf.codeplex.com/ , RTF can give you FlowDocument and you can convert it to XAML and pass it on to XamlToPDF converter.
I have a fairly simple task: I need to read a PDF file and write out its image contents while ignoring its text contents. So essentially I need to do the complement of "save as text".
Ideally, I would prefer to avoid any sort of re-compression of the image contents but if it's not possible, it's ok too.
Are the examples of how to do it?
Thanks!
Extracting text from a PDF file with PDFsharp is not a simple task.
It was discussed recently in this thread:
https://stackoverflow.com/a/9161732/162529
Extracting text from a PDF with PdfSharp can actually be very easy, depending on the document type and what you intend to do with it. If the text is in the document as text, and not an image, and you don't care about the position or format, then it's quite simple. This code gets all of the text of the first page in the PDFs I'm working with:
var doc = PdfReader.Open(docPath);
string pageText = doc.Pages[0].Contents.Elements.GetDictionary(0).Stream.ToString();
doc.Pages.Count gives you the total number of pages, and you access each one through the doc.Pages array with the index. I don't recommend using foreach and Linq here, as the interfaces aren't implemented well. The index passed into GetDictionary is for which PDF document element - this may vary based on how the documents are produced. If you don't get the text you're looking for, try looping through all of the elements.
The text that this produces will be full of various PDF formatting codes. If all you need to do is extract strings, though, you can find the ones you want using Regex or any other appropriate string searching code. If you need to do anything with the formatting or positioning, then good luck - from what I can tell, you'll need it.
Example of PDFSharp libraries extracting images from .pdf file:
link
library
EDIT:
Then if you want to extract text from image you have to use OCR libraries.
There are two good OCRs tessnet and MODI
Link to thread on stack
But I fully can recommend MODI which I am using now. Some sample # codeproject.
EDIT 2 :
If you don't want to read text from extracted images, you should write new PDF document and put all of them into it. For writing PDFs I use MigraDoc. It is not difficult to use that library.
I need to verify that the pdf report is text based (and not bitmap based; however it could contain some images). I do not need to extract the text, just to verify that it is text based.
Is there a way how to perform such a verification using ITextSharp library?
Thanks in advance,
Stefan
You can look for text drawing commands easily enough. The least work on your part would be to try to extract the text and see if anything is there. Ideally you'd know some of the text it should contain and search for it. A single sentence or phrase would be plenty for this sort of testing.
Text extraction with iText is pretty trivial these days. Lots of examples floating around SO, and the web.
I am able to read a pdf file using PDFBOX in my ASP.net application but it is not adding space for an empty cell in a table, So how to read empty fields from a pdf file using PDFBOX in C#. Is there any other method to read the pdf file .
Thanks .
You might be able to pull off this sort of thing if you know exactly where the text should be in advance and can get the locations of the text as you extract it.
If you don't know in advance where the rows and cells are, you'll have to guess based on the text locations. This will not be easy.
In general, extracting data from PDF is ill advised. PDFs don't have a concept of "tables" (unless the PDF creator goes well out of there way to use "Marked Content", which is still rare). PDFs have lines, glyphs, and images (a pile of pixels). It is Very Hard to extract formatting from that information... and sometimes it is all but impossible.
I don't know if PDFBox will give you the locations of extracted text, but iTextSharp will.