consolidate the fonts between merges pdfs itextsharp C# - c#

I need to merge multiple pdfs together. I am using itextsharp to create all the pdfs. I need to reduce the size of the pdfs to the lowest possible size. I know the fonts are being duplicated for each pdf. Is there to use only one set of fonts throughout the merged pdf? For example, pdf1 is 2.8mb and pdf2 is 2.8 mb I merge them together and its about 5.7mb. I know for a fact that both of those pdfs are using the same font but the data for the font is being duclpicated even though its in the same pdf.
I tried using setting the compression properties to best compression and set full compression and that barely reduced the size.
Though when i ran the pdf through Acrobat X pro and optimize its reduce almost 90%+ from like 160 mb to 5 mb. The usage audit says its 90% of the pdf is fonts before optimizing.
Now is there a way to consolidate the fonts between merges pdfs ?

My answer consists of two parts:
You're not telling us how you're merging the PDFs. Let's hope you've read the official documentation and that you're using PdfSmartCopy. If not, you're doing it wrong. PdfSmartCopy examines the content of the different PDFs and reuses possibly redundant objects (such as reused images, XObjects, fonts). Note that there were some bugs in earlier versions of PdfSmartCopy so please make sure you're using the latest version.
If the different PDFs use different subsets of the same font, you're out of luck. iText doesn't merge font subsets. Merging different font subsets would involve rewriting content streams, creating new fonts if we're talking about simple font sets that require more than 256 characters if the subsets are merged, etc...

You could rename subsets.
As if you had
Helvetica (subset)
and
Helvetica (subset)
you would create
Helvetica-1 (subset)
and
Helvetiva-2 (subset)
once they were different implementations (binary stream compare)

According to
https://turreta.com/2013/12/13/remove-duplicate-fonts-in-pdf-files/
iTextSharp.text.Document tdocument = new iTextSharp.text.Document();
iTextSharp.text.pdf.PdfSmartCopy smart =
new iTextSharp.text.pdf.PdfSmartCopy(tdocument,
new FileStream(#"newAddressPath", FileMode.Create));
tdocument.Open();
iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(#"yourPDFMergeFile");
// Where the magic happens
for (int i = 1; i <= reader.NumberOfPages; i++)
{
smart.AddPage(smart.GetImportedPage(reader, i));
}
tdocument.Close();

Related

Convert iTextSharp.text.pdf.BarcodeQRCode to System.Drawing.Image

Looking for a way to convert iTextSharp.text.pdf.BarcodeQRCode to System.Drawing.Image
This is what I have so far...
public System.Drawing.Image GetQRCode(string content)
{
iTextSharp.text.pdf.BarcodeQRCode qrcode = new iTextSharp.text.pdf.BarcodeQRCode(content, 115, 115, null);
iTextSharp.text.Image img = qrcode.GetImage();
MemoryStream ms = new MemoryStream(img.OriginalData);
return System.Drawing.Image.FromStream(ms);
}
In line 3 above using img.OriginalData returns null
Using img.RawData on line 3 instead thows invalid parameter error on line 4.
I've googled some of the code samples on how to perform the thing you want and your code (the "OriginalData" approach) is basicaly the same: https://csharp.hotexamples.com/examples/iTextSharp.text.pdf/BarcodeQRCode/-/php-barcodeqrcode-class-examples.html .
However, I don't see how it could work. From my investigations of BarcodeQRCode#getImage it seems that OriginalData is not set while processing such a barcode, so it will always be null.
More than that, the code you mention belongs to iText 5, which is end of life and no longer maintained (with an exception of considerable security fixes), so it's recommended to update to iText 7.
As for iText 7, I do see how to achieve the same in Java, since barcode classes do have a createAwtImage method there. .NET, on the other hand, lacks such a functionality, so I'd day that one unfortunately couldn't do it in .NET.
There are some good reasons for that. iText's Images (and a barcode could be easily converted to an iText's Image object as shown here: https://kb.itextpdf.com/home/it7kb/faq/how-to-generate-2d-barcode-as-vector-image) represent a PDF's XObject. In PDF syntax, an image file (jpg, png, etc.) is an XObject with the raw image data stored inside. However, an XObject can also contain PDF syntaxt content (it is not just used for image files). So to render such a content one needs to process the data from PDF syntax to image syntax, which is not that easy. There are some means in Java's awt to do so, that's why it's implemented in Java. As for .NET, since there is no out-of-the-box means to convert PDF images to System.Drawing.Image, it was decided not to implement it.
To conclude, there is another iText product, pdfRender, which allows one to convert PDF files (and you could create a page just for a barcode) to images. Perhaps you might want to play with it: https://itextpdf.com/en/products/itext-7/convert-pdf-to-image-pdfrender

Try To Understand ITextSharp

I try to build an application that can convert a PDF to an excel with C#.
I have searched for some library to help me with this, but most of them are commercially licensed, so I ended up to iTextSharp.dll
It's good that is free, but I rarely find any good open source documentation for it.
These are some link that I have read:
https://yoda.entelect.co.za/view/9902/extracting-data-from-pdf-files
https://www.mikesdotnetting.com/article/80/create-pdfs-in-asp-net-getting-started-with-itextsharp
http://www.thedevelopertips.com/DotNet/ASPDotNet/Read-PDF-and-Convert-to-Stream.aspx?id=34
there're more. But, most of them did not really explain what use of the code.
So this is most common code in IText with C#:
StringBuilder text = new StringBuilder(); // my new file that will have pdf content?
PdfReader pdfReader = new PdfReader(myPath); // This maybe how IText read the pdf?
for (int page = 1; page <= pdfReader.NumberOfPages; page++) // looping for read all content in pdf?
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy(); // ?
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); // ?
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText))); // maybe how IText convert the data to text?
text.Append(currentText); // maybe the full content?
}
pdfReader.Close(); // to close the PdfReader?
As you can see, I still do not have a clear knowledge of the IText code that I have. Tell me, if my knowledge is correct and give me an answer for code that I still not understand.
Thank You.
Let me start by explaining a bit about PDF.
PDF is not a 'what you see is what you get'-format.
Internally, PDF is more like a file containing instructions for rendering software. Unless you are working with a tagged PDF file, a PDF document does not naturally have a concept of 'paragraph' or 'table'.
If you open a PDF in notepad for instance, you might see something like
7 0 obj
<</BaseFont/Helvetica-Oblique/Encoding/WinAnsiEncoding/Subtype/Type1/Type/Font>>
endobj
Instructions in the document get gathered into 'objects' and objects are numbered, and can be cross-referenced.
As Bruno already indicated in the comments, this means that finding out what a table is, or what the content of a table is, can be really hard.
The PDF document itself can only tell you things like:
object 8 is a line from [50, 100] to [150, 100]
object 125 is a piece of text, in font Helvetica, at position [50, 110]
With the iText core library you can
get all of these objects (which iText calls PathRenderInfo, TextRenderInfo and ImageRenderInfo objects)
get the graphics state when the object was rendered (which font, font-size, color, etc)
This can allow you to write your own parsing logic.
For instance:
gather all the PathRenderInfo objects
remove everything that is not a perfect horizontal or vertical line
make clusters of everything that intersects at 90 degree angles
if a cluster contains more than a given threshold of lines, consider it a table
Luckily, the pdf2Data solution (an iText add-on) already does that kind of thing for you.
For more information go to http://pdf2data.online/

How can I tell ghostscript not to rasterize gradients in eps files?

I was searching for solution that would allow me to read, edit and save .eps files. I found out that ghostscript can give all of this opportunities. The algoritm I need is simple: read several .eps files, concatenate them in one big file and save new .eps file. I can do that already but there is a problem: new generated and saved files don't preserve gradients. Gradients are rasterized and shapes which use that gradients are converted to clipping masks. Is there a way to tell ghostscript not to rasterize gradients in eps?
I'm using latest 32 bit version of ghostscript library though my Windows is 64 bit (there were problems running solution on 64 bit version of ghostscript). Actually it's not so important but I'm writting using C# and Ghostscript.Net.
This is the sample code:
using (GhostscriptProcessor processor = new GhostscriptProcessor(lastInstalledVersion, true))
{
List<string> switches = new List<string>();
switches.Add("-o");
switches.Add(#"-sOutputFile=" + outputFile);
switches.Add("-sDEVICE=eps2write");
switches.Add("-dUseCIEColor=true");
switches.Add("-c");
switches.Add("<</Install {0.5 0.5 scale}>> setpagedevice");
switches.Add("-f");
switches.Add(inputFile);
processor.Process(switches.ToArray());
}
The answer to the question you have asked is simple; you can't. The eps2write device is called that for a reason, it only produces level 2 PostScript, and the shfill operator, or type 2 pattern (shading dictionary in PDF) is a level 3 PostScript primitive.
However, there seems to be no good reason to run the exiting files through Ghostscript anyway. You say you already have a number of EPS files. The whole point of EPS files is that they can be treated as a 'black box', you do not need to know what's in them in order to concatenate them, rearrange them etc.
All you do is write some 'wrapper' PostScript that alters the CTM before including the EPS file in its entirety. You can work out what the arguments to scale and translate should be, because the EPS file will have a %%BoundingBox comment that tells you where it sits in user space. All you need to do is alter the scale, and offset the 0,0 origin (bottom left) using translate.
Note that the eps2write device, because it is limited to producing level 2 PostScript, also does not support some other features of PostScript beyond the original level 2 specification, such as CIDFonts.

How to set CurrentCulture for PDF document ITextsharp c#

Document doc = new Document(iTextSharp.text.PageSize.LETTER.Rotate(), 10, 10, 5, 5);
string nazivPDFa = txt_datumFiskalnogIsecka.Text +" "+ txt_nazivKompanije.Text;
PdfWriter pdf = PdfWriter.GetInstance(doc, new FileStream(nazivPDFa + ".pdf", FileMode.CreateNew));
doc.Open();
Paragraph klijent = new Paragraph(ispisiKlijenta.Text);
PdfPTable tabelaNK = new PdfPTable(1);
PdfPCell kl = new PdfPCell(new Phrase(klijent));
kl.BorderColor = BaseColor.BLACK;
tabelaNK.AddCell(kl);
doc.Add(tabelaNK);
I have create PDF document with itextSharp and when I fill PDF with some text who is in Serbian, he doesnt show me chars like š,ć,č,đ,ž.
Example: I wrote "nešto" and I get "neto".
I have a lot of thinks at that PDF and it will take forever to give to all elements current culture.
You aren't using a font when you create your Paragraph. In that case, the Standard Type 1 font Helvetica will be used and it won't be embedded. As Helvetica only supports a limited set of characters, your glyphs won't appear. This is very well documented in the official documentation. It's a pity you try to run before you've learned how to walk.
Several things can be at play.
First, you need to make sure that the encoding of ispisiKlijenta.Text is correct. For instance, is that string in CP1250 or in Unicode? When you write KlijenT.Add("Tekući račun: " +txt_brRacunaKompanije.Text);, you are writing bad code (at least if you were writing Java) because you introduce special characters in your code that may disappear when the code is compiled or executed using a different environment using a different encoding.
Then, you need to provide a font program that knows how to draw the glyphs you need. For instance: Helvetica doesn't know about CP1250, but arial.ttf does (and so do many other fonts, but you need to check first).
Then, you need to decide how you'll use that font. Will you use embed the font as a simple font, as is done in the EncodingExample where we create this PDF, or will you embed the font as a composite font, as is done in the UnicodeExample where we create this PDF. Both PDFs may look identical to you, but they aren't. The choice you make will have an impact on the design of your application.
Once you've made a decision about the font and once you've create a Font object, e.g. named font, you need to use that object when creating a Paragraph.

itext - pdf to html

I have spent about 20 hours of coding to produce invoices using iText in c#.
Now, i want to use the same code to transform some of the tables to html.
Do you know if i can do this?
For instance i have this:
PdfPTable table = new PdfPTable(3);
table.DefaultCell.Border = 0;
table.DefaultCell.Padding = 3;
table.WidthPercentage = 100;
int[] widths = { 100, 200, 100};
table.SetWidths(widths);
List listOfCompanyData = (List)getCompanyData();
List listOfCumparatorDreaptaData = (List)getCumparatorDreaptaData(proformaInvoice.getCumparatorDreapta());
table.AddCell((Phrase)listOfCompanyData.Items[0]);
table.AddCell("");
table.AddCell((Phrase)listOfCumparatorDreaptaData.Items[0]);
and i want to transform this table into html...
Is it possible?
PDFs and HTML are fundamentally different display technologies. PDF is much more complex then HTML is, which is why you find so many HTML to PDF converters. The other way around is much more difficult.
iText can only do do it from HTML to PDF.
There are online converters that will take a PDF and convert it to HTML. There are also downloadable utilities.
I am not aware of any .NET library that will do this.
PDF is almost a write-only format. Any time your workflow calls for "get the data out of a PDF", you've probably screwed up.
Having said that, there are several ways to stash data within a PDF:
Form fields have no particular length limit and need not be visible. Getting form data with iText is trivial.
You can attach a file to a PDF and suck it out later, both with iText.
DocInfo fields. You can stuff a string into one of the author/title/keywords/etc metadata fields. An ugly hack, but effective.
XML metadata. The "new-fangled" metadata is stored in an XML schema. You can put pretty much whatever you want in there... though iText regenerates some of it every time it makes changes (mod date and such).
Custom keys/values. You can tack any old key/value pairs you like into any old dictionary within a PDF. Adobe would like you to register a company-specific prefix for your custom tags to avoid collisions, but I've never felt the need.
From the book iText in Action it seems that it is doable using the original java library, but it does not seem like it is no longer ported in the c# lib. I'm pretty sure it was in version 4 :-/
Try look at some old source here: http://www.koders.com/csharp/fid60B0985D3A89152128B73F54EDD4EB5420A5E4D8.aspx?s=%22Ken+Auer%22
nFOP + XSLT + XML = pdf | doc | HTML
nfop.sourceforge.net/article.html should give you an idea on how to use it, you need "Microsoft Visual J # NET Redistributable Package" to run nFOP
open source no cost :)
K

Categories