I am using iTextSharp to read PDF files using C#.
The text extraction using PdfTextExtractor.GetTextFromPage() function returns all text as expected.
But for a PDF which has say, Table of Content, Index & say page number should be dropped.
And I just want to get the paragraph of text.
I checked for availability of options by exploring the ITextExtractionStrategy.
I am really clueless and any pointers will help.
I explored to isolate the fields using AcroFields, but that looks like a long shot.
Thank you.
Regards,
~Mayur
Related
I need to get a text that's being written by a user (in CKEditor HTML), and then add that text to a MigraDoc document, as a paragraph or whatever I need it to be.
My idea was converting the text to an MDDDL document (in memory) and add it to the document. But I don't know if there are any DLLs that permit that behaviour.
So, my question is, can someone give me pointers or advice on how I could make this happen? Should I parse the HMTL text? If so, to what should I parse it? How can I add it afterwards?
Neither PDFsharp nor MigraDoc can parse HTML, so either write your own code or try to find a third-party library (which may not exist yet).
I would probably convert the HTML directly to MigraDoc document objects in memory.
MigraDoc / PDFSharp can't do this.
But, you could use HtmlAgilityPack nuget and then use its htmlDoc.DocumentNode.Descendants() to pull out the pieces of text from html in a flat list kind of a structure, and node.ParentNode.Name to figure out the tag that the text is wrapped in. And then insert the text into your MigraDoc document with something like .AddFormattedText() and apply custom MigraDoc styles to it - i.e. if the parent tag is "strong" then apply a MigraDoc style where Font.Italic = true; etc..
I am converting our current PDF export using ITextSharp over to Migradoc. We currently render Rich Text to ITextSharp from a string stored in the database - for example:
<p><strong style=\"color: rgb(230, 0, 0);\"><u>test</u></strong></p>
ITextSharp is able to pick out the elements of this and render appropriately using (I think) Cell.AddElement(ElementListItem).
Ideally I am looking for something identical to this for MigraDoc but any help on RTF in MigraDoc would be greatly appreciated.
MigraDoc does not parse HTML.
You can use the AddFormattedText method of the Paragraph class to mix various rich formats within one paragraph, but parsing the HTML is up to you.
See also:
http://pdfsharp.net/wiki/HelloMigraDoc-sample.ashx
I need to create and insert a QR code into existing word documents using .NET.
I've done the QR generation part. The 2 things I need to accomplish are:
Inserting the QR code in the footer of an existing word document (preferably using Open XML).
Each page of the word document has a unique QR code. This means that each footer would have to be different. (I could eliminate the footer and place the QR code as part of the body, but that word make flow of text complicated.)
Is it possible to accomplish this?
I haven't done this, but I believe that what you will need to do is
put each page in a separate Word section (and that means, in effect,
that you will need to decide what your page size and layout is)
create a footer containing one QR code to find out what XML Word
expects, and what type of image data you need to store in the .docx
(assuming that you are not attempting to store your image data
externally in spearate files).
create a footer for each section (and ensure that the footers are
not "linked to previous"), replicating the format you discovered in
point (2)
create a part for each QR code image, and a relationship to that
part
What I am even less sure about is whether Word will insist that you also store each image in another format (e.g. Windows Metafile or Extended metafile format). My guess is that Word will generate what it needs from your .jpg (or whatever). Or maybe you can use "AltChunks" in some useful way here.
The background to this is that if it were a .doc format document, you could have created a single footer containing a set of nested field codes that used the { PAGE } page number field to link to the correct image for each page - e.g.
{ INCLUDETEXT "c:\\myqrcodes\\qr{ PAGE }.jpg" }
or more likely, the slightly more complicated
{ PAGE \#"'{ INCLUDETEXT "c:\\myqrcodes\\qr{ PAGE }.jpg" }'" }
But if you try to save that as .docx format, even in compatibility mode, when you close and re-open, I think you wil just see one image on all pages. Further, even though that approach works with .doc format, it only works if the external image files are actually there and located at absolute addresses in the file system. If they are located at releative addresses (there is a way to do that) you or the end user will probably have to update the footer field codes to get the correct results.
I am doing an R&D for converting HTML to PDF.
We have created a page in Asp.net and has placed a CKEditor on it with simple options of selecting Fonts, Font Size, Bond, Italic etc. There is two more text boxes from where user can enter height and width of PDF to be generated. In addition to this we have div which shows preview of the content on the basis of text inserted in Editor. The Div height and width are set at run time, basically with this div we want to show how pdf is going to look like.
We are using wkhtmltopdf exe for generating PDF.
Now my problem is that the PDF being created is not exact replica of content shown in Div, sometimes it show exact content line by line but some times some words move out to next line in PDF
We have tried lot and lot of things to achieve exact result but could not successes any help is appriciable.
One thing you might try is using DOMPDF:
https://code.google.com/p/dompdf/
It's updated pretty frequently and has always given me great results.
Another suggestion would be to create styles specific to the PDF to get the desired result and then have those applied when the PDF is created.
In otherwords if you have a button "generate pdf" have it take all HTML content and then insert PDF specific style tags that you've tested in your PDF generator that makes the rendering the same as the browser. You could do this by just replacing:
</head>
With
<style>STYLES SPECIFIC TO PDF</style>
</head>
Upon creation of the PDF.
PDF rendering in my experience usually requires a few extra rules to make it look like web rending engines.
Hope that helps.
I'm using iTextSharp to render an html page in a PDF file.
Rendering is ok but i can't find a way to put some blank space before every table.
Now the text before the table is too close.
Is there an attribute I can add to the stylesheet?
Thank you
Are you using iTextSharp's XMLWorker? In that case, the CSS property "margin-top" should help you.
If you're referring to HTMLWorker: it is deprecated in favor of XMLWorker, so you may consider migrating.