Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
tl;dr :
Basically I'm just wondering what is the best/easiest way to design a PDF document? Is it remotely legit to actually design a whole PDF document with iTextSharp with code(i.e not loading external files)? I want the final result to look similar to a webpage with various colours, borders, images and everything.
Or do you have to rely on other documents like .doc, .html files to achieve a good design?
Originally I thought that I would use HTML markup to generate a PDF, however seeing how bad the support seems to be when it comes to styling/CSS so am I rethinking the whole situation.
Which led me to think, why even use a HTML markup or a .doc(x) file to create the PDF design when I could just do it right within the PDF without having to rely on on various files that serves no real purpose.
I started looking on this guide(it is several parts)
http://www.c-sharpcorner.com/uploadfile/f2e803/basic-pdf-creation-using-itextsharp-part-i/
A part of the guide :
using (MemoryStream ms = new MemoryStream())
{
Document document = new Document(PageSize.A4, 25, 25, 30, 30);
PdfWriter writer = PdfWriter.GetInstance(document, ms);
document.Open();
document.Add(new Paragraph("Hello World"));
document.Close();
writer.Close();
Response.ContentType = "pdf/application";
Response.AddHeader("content-disposition",
"attachment;filename=First PDF document.pdf");
Response.OutputStream.Write(ms.GetBuffer(), 0, ms.GetBuffer().Length);
}
And got a basic hang about generating PDFs and design them a bit.
But is it possible to generate and design big PDF documents this way and are there any more proper guides or similar with all the various commands to generate texts, images, borders and everything since I have no real clue about generating PDF with code.
The question is too broad, so I can only give you a very broad answer.
Option 1: you create your layout by using iText's high-level objects. There are countless applications out there that are using PdfPTable to generate complex reports. For instance: the time tables for a German Railway company are created from scratch through code; the invoices for a Belgian Telco company are created this way,... The advantage of this approach is that you can really fine-tune the layout. The disadvantage is that you need to change source code as soon as you want to change the layout.
Option 2: you create your layout by creating an AcroForm template. Every field in this template has a name and is visualized at exact positions (defined by its coordinates) on specific pages. The code to fill out such a form consists of only a handful of lines. Whenever you need to change the layout, you alter the AcroForm template. You do not need to change your code. The disadvantage is that AcroForms are very static. Compare it to a paper form: you can't insert a row in a paper form either.
Option 3: you create your data in XHTML format and your styles in CSS. A Belgian printing company responsible for creating invoices for its customers is streaming data into very simple HTML files involving a sequences of tables that never span more than a handful of pages. These files are then fed to iText's XML worker along with a CSS that is different for each of its customers. The advantage of this approach is that no extra programming is needed when a new customer joins. It's just a matter of creating a new CSS. The disadvantage is that you are limited by the HTML format. Elementary logic also tells you that you shouldn't expect URL2PDF: have you ever tried printing a website? Well, the bad quality of that print should give you an indication of the problems you'll encounter when trying to convert HTML to PDF. If you anticipate them, you can get good results. If you don't... it's a poor craftsman who blames his tools...
Option 4: define your template using the XML Forms Architecture. Such templates are usually created using Adobe LiveCycle Designer. An XSD is fed into LC Designer and the result is an empty form where the PDF format acts as a container for an XML stream. You can then use iText to inject your custom XML containing data that conforms with the XSD into the PDF and you can use XFA Worker to flatten such a form. XFA Worker is only available as a closed source product (givers need to set limits because takers rarely do).
Option 5: right now XML Worker is used to convert XHTML+CSS and XFA to PDF (ordinary PDF, PDF/A, PDF/UA). You could use the generic XML Worker engine to support your own XML format. The advantage would be a very powerful engine that you can tune to meet your exact needs. The disadvantage is that this involves a serious up-front development investment.
Option 6: use a third party tool to define the template and a third party server that uses iText under the hood to create PDFs based on the template. An example of such a third party tool is Scriptura developed by Inventive Designers. There are other tools, but Inventive Designers is a customer of iText and we know that they are using iText correctly whereas we don't have this guarantee from other vendors.
I' sure you can achieve your goal with iText. But imho, you should first investigate things like report generator:
Sql Server Reporting Services, or
Crystal Report, or
...
For quite the same time invest, you will have several format : HTML, DOCX, PDF, XLSX...
Of course the pertinence of the answer may vary according to the nature of document to produce: if we are talking about 1000 pages documentations, the report generator is not necessarily the best way.
Related
I've just downloaded iTextSharp and before I put a lot of effort into this I'd like to know if this scenario is possible with it. We have a client that is insisting that their SSRS report PDFs contain a table of contents, preferably with page numbers. The various components of these reports have highly variable lengths so we can't hard code actual page numbers. As you all probably know, there is no direct way to create a Table of Contents in SSRS. (We've even had a special session with the Microsoft rep about this.)
What I would like to do is as follows:
Mark the target locations in the SSRS report by setting their
DocumentMapLabel property.
Generate the pdf in the usual fashion, either from the report server
or a ReportViewer control. (This will be in c#.)
Open the pdf in my hypothetical code.
Insert a blank page at or near the front.
Scan the pdf for DocumentMapLabels (and, ideally, detect which page
they're on.)
Populate the blank page with links to the various sections.
Is this possible?
I wouldn't use your design. As soon as the TOC needs more than one page, you're in trouble. Maybe you're confident that this won't happen today, but what if that's needed tomorrow?
You have different options:
Create your document in one go. Add the TOC at the end. Reorder the pages before closing the document.
Create a document (e.g. in memory) using named destinations for the targets. Create a document (e.g. in memory) with the TOC referring to the named destinations. Merge the two documents into one document, consolidating the named destinations.
Create a document with bookmarks (this will result in a bookmarks panel to the left in Adobe Reader). Then read the bookmarks to create a TOC in PDF and merge the PDF with the TOC with the document that has the bookmarks.
All of this is documented.
In The Best iText Questions on StackOverflow, you'll find the answers to these (and many other) questions:
How can I add titles of chapters in ColumnText? (this sounds exactly like what you need)
Create Index File(TOC) for merged pdf using itext library in java
PDF Page re-ordering using itext
How to reorder the pages of a PDF file?
What you want to do is possible, but not the way you describe it. Read the book, pick an option and post a new question if you have a problem with the option you picked. Just download that book; it's free of charge.
Note: iText(Sharp) is free software, NOT freeware. This means that it is only free of charge if you agree with the open source license (the AGPL). It is not free of charge in all situations as explained in this video. That's also important to know before you start an iText(Sharp) project.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm looking for is a C# solution to import data from PDF documents into our database, in a commercial application. Our customers will be looking to import any arbitrary document. Ordinarily I'd write this off as a complete impossibility, but the documents they're importing will be in their own set layout.
My plan is to have the PDFs rendered to static images, then allow the users to set up their own templates, which essentially pull out text at predefined pixel-offsets in the PDF, using OCR. For tables, they define a location of the table and a bunch of further values for column and row sizes. We can then apply the template onto that document type.
So, what I'm really looking for is two libraries: one to convert PDFs to images, another to OCR those images.
Requirements:
Is pure-C# or has a supported C# wrapper onto a native DLL.
Doesn't fork out processes - wrappers that essentially just create command line parameters and launch an external executable aren't allowed in this case.
In the case of FOSS, allows us to exempt ourselves from normal FOSS license requirements (i.e. publishing our sourcecode) by paying a license fee.
We certainly don't mind paying for a commercial solution, but we'd rather not get stuck with paying a fee per individual distribution of the software.
I know this is quite a specific requirement set - perhaps enough for some people to deem this question too localised, but I'm hoping that someone can suggest an approach and some libraries that can be helpful to me, as well as others in the future.
Stuff I've looked into for the PDF side:
iTextSharp - Documentation is a book you have to buy, not a good start. Doesn't seem to be much useful documentation regarding turning PDFs into images in the public domain. Licensing is opaque, looks like we have to pay per client we distribute to.
Docotic.Pdf - Text only, no use to us.
pdftohtml - Again, doesn't produce images. Would be a mess to port to C# too.
PdfFileParser - Still not what we need.
GhostScript - Pretty much exactly what we want, but requires forking out to a program.
For the OCR side, I'll probably end up using Tesseract, since the Apache license is permissive and it's got good reviews. If there's an alternative, I'd be interested in that too.
I would like to recommend Amyuni PDF Creator .Net for this task.
1st Scenario:
If your PDF files are well defined (no missing font information etc) you could directly extract the text from the PDF by specifying a rectangular region in the method GetObjectsInRectangle. You should also use the option acGetRectObjectsOptimize:
Optimize text objects before returning them. That is, combine text
objects that are close to each other into a single text object.
2nd Scenario:
If there are images involved that also contain text, rendering the whole page into an image and then applying OCR might be a better choice. You can do this with Amyuni PDF Creator .Net by using the methods ExportToTiff, ExportToJPeg, or RasterizePageRange.
From the documentation:
IacDocument.RasterizePageRange Method The RasterizePageRange method converts page contents into a color or grey scale image. When
archiving documents or performing OCR, it is sometimes preferable for
all pages to be stored as images rather than complex text and graphic
operations.
Then you can use our OCR add-in that integrates with Tesseract OCR and finally we fall again into the 1st Scenario (GetObjectsInRectangle). In order to apply OCR to your files you can use the method OCRPageRange.
void OCRPageRange(int startPage, int EndPage, string Language,
acOCROptions Options)
About licensing, Amyuni PDF Creator .Net provides a (per application) royalty free license.
Usual disclaimer applies
I think you might want to give Docotic.Pdf another chance.
The library can extract text chunks, words and even individual characters with their bounding rectangles. Please have a look at the sample for extraction of words from PDFs.
Also, Docotic.Pdf can create images from PDFs and draw pages on a System.Drawing.Graphics. Please have a look at Draw and print Pdf group of samples.
Disclaimer: I am one of developers of the library.
I need to generate a high quality report based on information in a SQL Server database, and I want very explicit control of the layout and appearance from inside C#.
I have several choices that I know of that are already being used for various other reports at our company:
1) SQL Server's built in Reporting Services
2) Adobe Forms
3) Crystal Reports
This information I need as PDF directly parallels what is already being displayed in the user's web browser as HTML, so creating a print stylesheet and converting the browser body to PDF is an option as well.
So this creates option 4:
4) JavaScript convert HTML to PDF (my preference at this time)
Does anybody have a recommendation as to which approach I should take, or even better an alternative? All the choices seem pretty horrible.
I've used iTextSharp with very good results. It is an open-source .NET port of a java library. It works really well for creating PDFs from scratch. Remember that editing PDFs will always be hacky with any library, because PDF is an output format, not a read-write format.
Provided your HTML is fairly clean (remove javascript postbacks, anchors, ...),the iText HtmlWorker can convert HTML to PDF, if you prefer that route.
HTML to PDF in using iTextSharp:
Document doc = new Document(PageSize.A4);
HTMLWorker parser = new HTMLWorker(doc);
PdfWriter.GetInstance(doc, Response.OutputStream);
Also here.
Use SSRS, it has a built in PDF rendering mode.
I have used two other PDF report libraries with great success; Active Reports and Telerik Reporting. Personally I prefer the latter when it comes to programmatic control of layout and such.
Take a look also at the DevExpress Reporting (non-free 3rd party tool):
Overview
Online Demos
Documentation
Yes, you should use the best tools to get the best solution. The best tool in this case probably is SSRS.
But that's just looking at the capabilities of the tool.
Don't forget to look at your own capabilities!
My story: I know SQL, I know C#. (Both intermediate, I'm not a guru.)
Then I lay my hands on SSRS. And burnt them, once, twice, etc.
At the end, there was a nice result. So burning your fingers is not a wrong thing to do.
But first try to pull your html through an html to pdf converter (demo version) and see if the result it serves your needs.
Currently I'm using both:
SSRS for creating invoices, because amounts have to be transported from one page to the next
Winnovative to generate documents that only need page numbers
I would suggest using .Net ReportViewer control in local mode (no report server required). It works in both webforms and winforms. You create a client-side report (.rdlc) file (which contains all the visuals as well as placement of data fields), link it up to the ReportViewer, and supply the data (DataTable or collection of objects, as long as the fields match, it doesn't matter). In client mode it supports exporting to pdf and excel (and Word too? don't remember). By default these done by a dropdown in the control itself however you can programmatically export to any of the supported formats as well. You'll end up with a byte array you can shove into a file stream.
Basically you get most of the good parts of SSRS without all of that backend complexity. There should be a ReportViewer folder in %programFiles%\Microsoft Visual Studio 10.0\ReportViewer - but versions exist for 2005 and 2008 as well. Check out http://gotreportviewer.com/
I think the 4th option is the best. In this case you don't need to change either layout of the HTML page or a layout of PDF, if one of them has been changed.
It is also more convenient making a nice design via HTML than programmatically via C# :)
Take a look at WebToPDF.NET which is a .NET component written in C# that converts HTML to PDF. The converter supports HTML 4.01, XHTML 1.0, XHTML 1.1 and CSS 2.1 including page breaks, forms and links. It passes all W3C tests (except BIDI).
You can use Fast Report it's good tool and i has a free version
what is the best possible way to merge multiple documents and convert them to pdf. also we need to insert blank pages for every odd pages.
A fully supported, server side automated version of this (mostly baked into the the MS camp though) involves using the OpenXMLSDK to do any field inserts, then using Sharepoint's Word Automation Services (SP 2010) to convert the documents to PDF, and then pick your favorite PDF toolkit (iTextSharp for me) for any post processing (merging documents, inserting blank pages, or images that must be positioned relative to specific pages).
The reason for doing the document merge in PDF rather than OpenXML is simplicity - you don't have to deal with merging styles, headers etc.
The reason for doing the blank pages and image insertion is that OpenXML has no idea how to render the content, and so it has no idea where page breaks would occur naturally (you can still insert breaks like you would in Word though).
If you are using C# and you are OK with a server based solution then have a look at this post. It uses a .net friendly web services interface.
There is an optional SharePoint version available as well, but as you did not include a SharePoint tag I assume that won't be of interest to you.
Full disclosure, I wrote that post.
I'm using itextsharp to generate the PDFs, but I need to change some text dynamically.
I know that it's possible to change if there's any AcroField, but my PDF doen's have any of it. It just has some pure texts and I need to change some of them.
Does anyone know how to do it?
Actually, I have a blog post on how to do it! But like IanGilham said, it depends on whether you have control over the original PDF. The basic idea is you setup a form on the page and replace the form fields with the text you want. (You can style the form so it doesn't look like a form)
If you don't have control over the PDF, let me know how to do it!
Here is a link to the full post:
Using a template to programmatically create PDFs with C# and iTextSharp
I haven't used itextsharp, but I have been using PDFNet SDK to explore the content of a large pile of PDFs for localisation over the last few weeks.
I would say that what you require is absolutely achievable, but how difficult it is will depend entirely on how much control you have over the quality of the files. In my case, the files can be constructed from any combination of images, text in any random order, tables, forms, paths, single pixel graphics and scanned pages, some of which are composed from hundreds of smaller images. Let's just say we're having fun with it.
In the PDFTron way of doing things, you would have to implement a viewer (sample available), and add some code over a text selection. Given the complexities of the format, it may be necessary to implement a simple editor in a secondary dialog with the ability to expand the selection to the next line (or whatever other fundamental object is used to make up text). The string could then be edited and applied by copying the entire page of the document into a new page, replacing the selected elements with your new string. You would probably have to do some mathematics to get this to work well though, as just about everything in PDF is located on the page by means of an affine transform.
Good luck. I'm sure there are people on here with some experience of itextsharp and PDF in general.
This question comes up from time to time on the mailing list. The same answer is given time and time again - NO. See this thread for the official answer from the person who created iText.
This question should be a FAQ on the itextsharp tag wiki.