I have generated pdfs using rdlc and then combined multiple pdf files to a single document using iTextSharp pdfsmartcopy class. But my pdf size is large and I want to reduce the size of that pdf file. I have tried compressing it using iTextSharp but that's unable to compress it. When I upload the pdf file to ilivepdf.com online for compression ,then it compresses the 21MB file to 1MB.
Often, the problem is related to embedded fonts.
You see, PDF really strives to preserve your document exactly how you made.
To do that, a PDF library can decide to embed a font. You can imagine this as simply putting the font file into the PDF document.
But, here comes the tricky part.
The PDF specification took into account that this may be overkill.
I mean, if you are only using the 50-something characters typically used in Western languages, it makes little sense to embed the entire font.
So PDF supports a feature called "font subsetting". This means, instead of embedding the entire font, only those characters that are actually used are embedded in the document.
So what is going wrong exactly when you're merging these documents?
(I will skip a lot of the technical details.)
In order to differentiate between a fully embedded font, system font, or subset embedded font, iText generates a new font name for your fonts whenever it embeds them.
So a document containing a subset of Times New Roman might have "Times-AUHFDI" in its resources.
Similarly, a second document (again containing a subset of Times New Roman) might list "Times-VHUIEF" as one of its resources.
I believe it simply adds a random 6-character suffix. (ex-iText developer here)
PdfSmartCopy has to decide what to do with these resources. And sadly, it doesn't know whether these fonts are actually the same. So it decides to embed both these subsets into the new document.
This is a huge memory penalty.
If you have 100 documents, all using a subset of the same font, that subset will be embedded 100 times.
The other tool you listed might actually check whether these fonts are the same (and if they are, embed them only once). Or the other tool might simply not care that much and assume based on the partial name match that they are the same.
The ideal solution would of course be to compare the actual characters in the font, to see whether these two subsets can be merged.
But that would be much more difficult (and might potentially be a performance penalty).
What can you do?
There are 12 fonts that are never embedded. They are assumed to be present on every system (hence why they are never embedded.)
If you have control over the process that generates the PDF documents, you could simply decide to create them using only these fonts.
Alternatively you could write a smarter PdfSmartCopy. You would need to look into how fonts are built and stored, and perform the actual comparison I mentioned earlier.
Ask for technical support at iText. If enough people request this particular feature, you may get it.
Related
I have a folder where multiple clients upload multiple PDF files.
Some of them are using embedded fonts, some doesn't.
I've been working on a service that optimizes (in terms of file size) the PDF files in this folder.
Each user may be uploading around 400 files, weighing anywhere between 80K to 10M, and my task is to optimize all of them to the smallest possible file size with minimal quality lose.
the PDF Library is doing a great job with it. My only problem is that I can't remove all embedded fonts from all files, since some of the files might use these fonts and the result would be a file that I can't use.
So my questions are:
How can I detect what files use and what files doesn't use embedded fonts?
When optimizing the files that use embedded fonts, How can I remove only the unused fonts?
what I want to achieve is to remove all embedded fonts from most of the files, but keep the embedded fonts in the files where I actually need them. I understand that it depends on the fonts I have on my system (these files should stay on a single system so portability is not that important to me), so I try to find a way to identify, before optimizing, what files will look OK without embedded fonts, and what files I need to keep the embedded fonts.
APDFL has a PDFontIsEmbedded() call. The DotNet interface's Font class has an Embedded property. Saving with the GarbageCollect SaveFlag should remove any unreferenced indirect objects, including fonts.
Note that Resource Dictionaries could potentially be shared by multiple pages so that fonts not used by one page might be used by another page that uses the same resource dictionary.
The Adobe PDF Library version 15 and up have a service that will optimize PDF files for you.
The Optimizer has a function to subset all embedded fonts. What that will do is create a subset of each font limited to only the glyphs of that font actually used by the document. The API is below.
void Datalogics::PDFL::PDFOptimizer::SetOption (OptimizerOption option, bool value)
void Datalogics::PDFL::PDFOptimizer::Optimize (Document document, string newPath)
This is the option that you need
SubsetAllEmbeddedFonts
I've just downloaded iTextSharp and before I put a lot of effort into this I'd like to know if this scenario is possible with it. We have a client that is insisting that their SSRS report PDFs contain a table of contents, preferably with page numbers. The various components of these reports have highly variable lengths so we can't hard code actual page numbers. As you all probably know, there is no direct way to create a Table of Contents in SSRS. (We've even had a special session with the Microsoft rep about this.)
What I would like to do is as follows:
Mark the target locations in the SSRS report by setting their
DocumentMapLabel property.
Generate the pdf in the usual fashion, either from the report server
or a ReportViewer control. (This will be in c#.)
Open the pdf in my hypothetical code.
Insert a blank page at or near the front.
Scan the pdf for DocumentMapLabels (and, ideally, detect which page
they're on.)
Populate the blank page with links to the various sections.
Is this possible?
I wouldn't use your design. As soon as the TOC needs more than one page, you're in trouble. Maybe you're confident that this won't happen today, but what if that's needed tomorrow?
You have different options:
Create your document in one go. Add the TOC at the end. Reorder the pages before closing the document.
Create a document (e.g. in memory) using named destinations for the targets. Create a document (e.g. in memory) with the TOC referring to the named destinations. Merge the two documents into one document, consolidating the named destinations.
Create a document with bookmarks (this will result in a bookmarks panel to the left in Adobe Reader). Then read the bookmarks to create a TOC in PDF and merge the PDF with the TOC with the document that has the bookmarks.
All of this is documented.
In The Best iText Questions on StackOverflow, you'll find the answers to these (and many other) questions:
How can I add titles of chapters in ColumnText? (this sounds exactly like what you need)
Create Index File(TOC) for merged pdf using itext library in java
PDF Page re-ordering using itext
How to reorder the pages of a PDF file?
What you want to do is possible, but not the way you describe it. Read the book, pick an option and post a new question if you have a problem with the option you picked. Just download that book; it's free of charge.
Note: iText(Sharp) is free software, NOT freeware. This means that it is only free of charge if you agree with the open source license (the AGPL). It is not free of charge in all situations as explained in this video. That's also important to know before you start an iText(Sharp) project.
Did any one of you guys have experience with the accuracy of iTextSharp when reading text from a multi-page scanned pdf?
the things is i have tried to read a pdf with both the basic search-function within the adobe reader, and also using the iTextSharp.
itextsharp manages to find roughly 50% of the occurrences of a given word compared to (what i call) 100% by adobe
[iTextSharp 1000 occ // Adobe Reader >2000]
is this a known "problem"?
edit: i should add: its already been ocr'ed by the time i'm searching.
As #ChrisHaas already explained, without code and PDF samples its hard to be specific.
First of all, saying itextsharp manages to find roughly 50% of the occurences of a given word is a bit misleading as iText(Sharp) does not directly expose methods to find a specific text in a PDF and, therefore, actually finds 0%. It merely provides a framework and some simple examples for text extraction.
Using this framework for seriously searching for a given word requires more than applying those simple sample usages (provided by the SimpleTextExtractionStrategy and the LocationTextExtractionStrategy, also working under the hood when using PdfTextExtractor.GetTextFromPage(myReader, pageNum)) in combination with some Contains(word) call. You have to:
create a better text extraction strategy which
has a better algorithm to recognize which glyphs belong to which line; e.g. the sample strategies can fail utterly for scanned pages with OCR'ed text with the text lines not 100% straight but instead minimally ascending;
recognizes poor man's bold (printing the same letter twice with a very small offset to achieve the impression of a bold character style) and similar constructs and transforms them accordingly;
create a text normalization which
resolves ligatures;
unifies alternative glyphs of the semantically same or similar characters;
normalize both the extracted text and your search term and only then search.
Furthermore, as #ChrisHaas mentioned, special attention has to be paid to spaces in the text.
If you create an iText-based text search with those criteria in mind, you'll surely get an acceptable hit rate. Getting as good as Adobe Reader is quite a task as they already have invested quite some resources into this feature.
For completeness sake, you should not only search the page content and everything referred to from there but also the annotations which can have quite some text content, too, which may even appear as if it was part of the page, e.g. in case of free text annotations.
Without knowing the specifics of your situation (PDF in question, code used, etc) we can't help you too much.
However I can tell you that iTextSharp has more of a literal text extractor. Since text in a PDF can be and often is non-contiguous and non-linear, iTextSharp takes any contiguous characters and builds what we think of as words and sentences. It then also tries to combine characters that appears to be "pretty much on the same line" and does the same (such as text on a slight angle as OCR'd text often is). There's also "spaces" which should be simple ASCII 32 characters but often isn't. iTextSharp goes the extra mile and attempts to calculate whether two text runs should be separated by spaces.
Adobe probably has further heuristics that are able to guess even more about text. My guess would be that they have a larger threshold for guessing at combining non-linear text.
So, I have used Pdf995's PDF print driver from a web browser to print web pages and eventually use PdfEdit995 to join these various PDF files into one large PDF.
Now I have a lot of large PDF documents that I wish to add bookmarks to, but am hoping there is a relatively easy way of doing this programmatically (using C#, preferably) - basically, I want to find, within each PDF, text that is large enough to qualify as a header, and use that text as the bookmark.
Any tips/advice/direction? Thanks!
It's definitely possible to do this, but I would recommend finding a PDF library that does most of the leg work. Technically you could do it all yourself with the aid of the PDF specification, but that'd probably take more time than it's worth.
The library will need to be able to let you find text in a document and then return the page and size, font, etc, of the text and create bookmarks (also known as outlines) based on that information programmatically.
My companies product, Quick PDF Library, can help you do this and so can PDFKit.NET. I'm sure there are other libraries out there that support this functionality too. As far as free libraries go, from what I've seen I don't believe that PDFSharp or iText will meet all of your requirements in this case, but I'm sure someone will correct me if I'm wrong.
If you'd prefer to develop a solution for this entirely yourself, then the PDF reference is available online for free.
I'm using itextsharp to generate the PDFs, but I need to change some text dynamically.
I know that it's possible to change if there's any AcroField, but my PDF doen's have any of it. It just has some pure texts and I need to change some of them.
Does anyone know how to do it?
Actually, I have a blog post on how to do it! But like IanGilham said, it depends on whether you have control over the original PDF. The basic idea is you setup a form on the page and replace the form fields with the text you want. (You can style the form so it doesn't look like a form)
If you don't have control over the PDF, let me know how to do it!
Here is a link to the full post:
Using a template to programmatically create PDFs with C# and iTextSharp
I haven't used itextsharp, but I have been using PDFNet SDK to explore the content of a large pile of PDFs for localisation over the last few weeks.
I would say that what you require is absolutely achievable, but how difficult it is will depend entirely on how much control you have over the quality of the files. In my case, the files can be constructed from any combination of images, text in any random order, tables, forms, paths, single pixel graphics and scanned pages, some of which are composed from hundreds of smaller images. Let's just say we're having fun with it.
In the PDFTron way of doing things, you would have to implement a viewer (sample available), and add some code over a text selection. Given the complexities of the format, it may be necessary to implement a simple editor in a secondary dialog with the ability to expand the selection to the next line (or whatever other fundamental object is used to make up text). The string could then be edited and applied by copying the entire page of the document into a new page, replacing the selected elements with your new string. You would probably have to do some mathematics to get this to work well though, as just about everything in PDF is located on the page by means of an affine transform.
Good luck. I'm sure there are people on here with some experience of itextsharp and PDF in general.
This question comes up from time to time on the mailing list. The same answer is given time and time again - NO. See this thread for the official answer from the person who created iText.
This question should be a FAQ on the itextsharp tag wiki.