iTextSharp change outcome quality / compression like PDF24Creator

iTextSharp change outcome quality / compression like PDF24Creator - c#

I was wondering if I can compress/Change the Quality of my outcoming pdf-file with iTextSharp and C# like I can do with Adobe Acrobat Pro or PDF24Creator.
Using the PDF24Creator I can open the pdf, save the file again and set the "Quality of the PDF" to "Low Quality" and my file size decreases from 88,6MB to 12,5MB while the Quality is still good enough.
I am already using the
writer = new PdfCopy(doc, fs);
writer.SetPdfVersion(PdfCopy.PDF_VERSION_1_7);
writer.CompressionLevel = PdfStream.BEST_COMPRESSION;
writer.SetFullCompression();
which decreases the file size from about 92MB to 88MB.
Alternatively: Can I run the pdf24 Program through my C# code using command line arguments or starting Parameters? Something like that:
pdf24Creator.exe -save -Quality:low -inputfile -outputfile
Thanks for your help (Bruno)!

Short answer: no.
Long answer: yes but you must do a lot of the work yourself.
If you read the third and fourth paragraphs here you'll hopefully get a better understanding of what "compression" actually means from a PDF perspective.
Programs like Adobe Acrobat and PDF24 Creator allow you to reduce the size of a file by destroying the data within the PDF. When you select a low quality setting one of the most common changes these programs make is to actually extract all of the images, reduce their quality and replace the original files in the PDF. So a JPEG originally saved without any compression might be knocked down to 60% quality. And just to be clear, that 60% is non-reversible, it isn't zipping the file, it is literally destroying the data in order to save space.
Another setting is to reduce the effective DPI of an image. A 500 pixel wide image placed into a 2 inch wide box is effectively 250 DPI. These programs will extract the image, reduce the image to maybe 96 or 72 DPI which means the 500 pixel image be reduced to 192 or 144 pixels in width and replace the original file in the PDF. Once again, this is a destructive non-reversible change.
(And by destructive non-reversible, you still probably have the original file, I just want to be clear that this isn't true "compression" like ZIP.)
However, if you really want to do it you can look at code like this which shows how you can use iText to perform the extraction and re-insertion of images. It is 100% up to you, however, to change the images because iText won't make destructive changes to your data (and that's a good thing I'd say!)

Related

Can ghostscript.net divide a PDF file to multiple sections?

I have a very long PDF file (58x500 inches). The goal is to divide one large vector pdf file to a certain percentage. For example %25 = 125 inches in height while the width stay the same. So one large pdf will be divided into 4 pages.
ImageMagick was able to do this but it crashes if I changed the dpi to 300. Is it possible to do this with Ghostscript? I am currenlty using Ghostscipt.net and C#.
Can someone point me to the right direction?

I mentioned netvips in a comment -- it will do progressive PDF rendering (it uses poppler rather than ghostscript), so you can load the whole page at 300 DPI and write it out as four huge raster files.
I don't actually have C# on this laptop, but here's what you'd do in Python. The C# code would be almost the same.
import sys
import pyvips
image = pyvips.Image.image_new_from_file(sys.argv[1], dpi=300, access="sequential")
n_pages = 4
for n in range(n_pages):
filename = f"page-{n}.tif"
print(f"rendering {filename} ...")
y = int(n * image.height / n_pages)
page_height = int(min(image.height / n_pages, image.height - y))
page = image.crop(0, y, image.width, page_height)
page.write_to_file(filename)
The access="sequential" puts libvips into sequential mode -- pixels will only be computed on demand from the final write operation. You should be able to render your 200,000 pixel high image using only a modest amount of memory.
You don't need to use tif of course, jpg might be more sensible, and if this is for printing, few people will notice.
As everyone said, it would be better to keep as a vector format for as long as you can.

See this previous answer of mine. It demonstrates how to render a portion of the original input file to a bitmap. I'd suggest you use the exact same technique, but use the pdfwrite device instead of the png16m device, so that you get a PDF file as the output, thus maintaining the vector nature of the input.
So to paraphrase the answer there, this:
gs -sDEVICEWIDTHPOINTS=72 -dDEVICEHEIGHTPOINTS=144 -dFIXEDMEDIA -r300 -sDEVICE=pdfwrite -o out.pdf -c "<</PageOffset [-180 -108]>> setpagedevice" -f input.pdf
Will create a 'window' 1 inch wide by 2 inches high, starting 2.5 inches from the left of the original and 1.5 inches up from the bottom. It then runs the input and every position of it which lies within that window is preserved, everything which lies outside it is dropped.
You'd need to do that multiple times, once for each section you want.
I should mention that Ghostscript itself is perfectly capable of rendering the entire PDF file to a document. It uses the same kind of display list approach to very large output files where it creates a (simplified) representation of the original input, and runs that description multiple times. Each time it renders one horizontal band of the final output, then moves down to the next band and so on.
In my opinion, it's likely that the limiting factor of 300 dpi in your original experience is ImageMagick rather than Ghostscript, I know that Ghostscript is able to render input which is several metres in each dimension at 1200 dpi or more, though it does, of course, take a long time to produce the gigabytes of data.

Why don't some images show proper Fourier and Inverse spectrums?

I am trying to develop an application for image processing.
Here is my complete code in DotNetFiddle.
I have tested my application with different images from the Internet:
Cameraman is GIF.
Baboon is PNG.
Butterfly is PNG.
Pheasant is JPG.
Butterfly and Pheasant are re-sized to 300x300.
The following two images show correct Fourier and Inverse Fourier spectrum:
The following two images do not show the expected outcome:
What could be the reason?
Are there any problem with the later two images?
Do we need to use images of specific quality to test Image-processing applications?

The code you linked to is a radix-2 FFT implementation which would work for any image with sizes that are exact powers of 2.
Incidentally, the Cameraman image is 256 x 256 (powers of 2) and the Baboon image is 512 x 512 (again powers of 2). The other two images, being resized to 300 x 300 are not powers of 2. After resizing those images to an exact power of 2 (for example 256 or 512), the output of FrequencyPlot for the brightness component of the last two images should look somewhat like the following:
butterfly
pheasant
A common workaround for images of other sizes is to pad the image to sizes that are exact powers of 2. Otherwise, if you must process arbitrary sized images, you should consider other 2D discrete Fourier transform (DFT) algorithms or libraries which will often support sizes that are the product of small primes.
Note that for the purpose of validating your output, you also have option to use the direct DFT formula (though you should not expect the same performance).

I got not time to dig through your code. Like I said in my comments you should focus on the difference between those images.
There is no reason why you should not be able to calculate the FFT of one image and fail for another. Unless you have some problem in your code that can't handle some difference between those images. If you can display them you should be able to process them.
So the first thing that catches my eye is that both images you succeed with have even dimensions while the images your algorithm produces garbage for have at least one odd dimension. I won't look into it any further as from experience I'm pretty confident that this causes your issue.
So befor you do anything else:
Take one of those images that work fine, remove one line or row and see if you get a good result. Then fix your code.

Compress existing PDF using C# programming using freeware libraries

I have been searching a lot on Google about how to compress existing pdf (size).
My problem is
I can't use any application, because it needs to be done by a C# program.
I can't use any paid library as my clients don't want to go out of Budget. So a PAID library is certainly a NO
I did my home-work for last 2 days and came upon a solution using iTextSharp, BitMiracle but to no avail as the former decrease just 1% of a file and later one is a paid.
I also came across PDFcompressNET and pdftk but i wasn't able to find their .dll.
Actually the pdf is insurance policy with 2-3 images (black and white) and around 70 pages accounting to size of 5 MB.
I need the output in pdf only(can't be in any other format)

Here's an approach to do this (and this should work without regard to the toolkit you use):
If you have a 24-bit rgb or 32 bit cmyk image do the following:
determine if the image is really what it is. If it's cmyk, convert to rgb. If it's rgb and really gray, convert to gray. If it's gray or paletted and only has 2 real colors, convert to 1-bit. If it's gray and there is relatively little in the way of gray variations, consider converting to 1 bit with a suitable binarization technique.
measure the image dimensions in relation to how it is being placed on the page - if it's 300 dpi or greater, consider resampling the image to a smaller size depending on the bit depth of the image - for example, you can probably go from 300 dpi gray or rgb to 200 dpi and not lose too much detail.
if you have an rgb image that is really color, consider palettizing it.
Examine the contents of the image to see if you can help make it more compressible. For example, if you run through a color/gray image and fine a lot of colors that cluster, consider smoothing them. If it's gray or black and white and contains a number of specks, consider despeckling.
choose your final compression wisely. JPEG2000 can do better than JPEG. JBIG2 does much better than G4. Flate is probably the best non-destructive compression for gray. Most implementations of JPEG2000 and JBIG2 are not free.
if you're a rock star, you want to try to segment the image and break it into areas that are really black and white and really color.
That said, if you do can do all of this well in an unsupervised manner, you have a commercial product in its own right.
I will say that you can do most of this with Atalasoft dotImage (disclaimers: it's not free; I work there; I've written nearly all the PDF tools; I used to work on Acrobat).
One particular way to that with dotImage is to pull out all the pages that are image only, recompress them and save them out to a new PDF then build a new PDF by taking all the pages from the original document and replacing them the recompressed pages, then saving again. It's not that hard.
List<int> pagesToReplace = new List<int>();
PdfImageCollection pagesToEncode = new PdfImageCollection();
using (Document doc = new Document(sourceStream, password)) {
for (int i=0; i < doc.Pages.Count; i++) {
Page page = doc.Pages[i];
if (page.SingleImageOnly) {
pagesToReplace.Add(i);
// a PDF image encapsulates an image an compression parameters
PdfImage image = ProcessImage(sourceStream, doc, page, i);
pagesToEncode.Add(i);
}
}
PdfEncoder encoder = new PdfEncoder();
encoder.Save(tempOutStream, pagesToEncode, null); // re-encoded pages
tempOutStream.Seek(0, SeekOrigin.Begin);
sourceStream.Seek(0, SeekOrigin.Begin);
PdfDocument finalDoc = new PdfDocument(sourceStream, password);
PdfDocument replacementPages = new PdfDocument(tempOutStream);
for (int i=0; i < pagesToReplace.Count; i++) {
finalDoc.Pages[pagesToReplace[i]] = replacementPages.Pages[i];
}
finalDoc.Save(finalOutputStream);
What's missing here is ProcessImage(). ProcessImage will rasterize the page (and you wouldn't need to understand that the image might have been scaled to be on the PDF) or extract the image (and track the transformation matrix on the image), and go through the steps listed above. This is non-trivial, but it's doable.

I think you might want to make your clients aware that any of the libraries you mentioned is not completely free:
iTextSharp is AGPL-licensed, so you must release source code of your solution or buy a commercial license.
PDFcompressNET is a commercial library.
pdftk is GPL-licensed, so you must release source code of your solution or buy a commercial license.
Docotic.Pdf is a commercial library.
Given all of the above I assume I can drop freeware requirement.
Docotic.Pdf can reduce size of compressed and uncompressed PDFs to different degrees without introducing any destructive changes.
Gains depend on the size and structure of a PDF: For small files or files that are mostly scanned images the reduction might not be that great, so you should try the library with your files and see for yourself.
If you are most concerned about size and there are many images in your files and you are fine with loosing some of the quality of those images then you can easily recompress existing images using Docotic.Pdf.
Here is the code that makes all images bilevel and compressed with fax compression:
static void RecompressExistingImages(string fileName, string outputName)
{
using (PdfDocument doc = new PdfDocument(fileName))
{
foreach (PdfImage image in doc.Images)
image.RecompressWithGroup4Fax();
doc.Save(outputName);
}
}
There are also RecompressWithFlate, RecompressWithGroup3Fax and RecompressWithJpeg methods.
The library will convert color images to bilevel ones if needed. You can specify deflate compression level, JPEG quality etc.
Docotic.Pdf can also resize big images (and recompress them at the same time) in PDF. This might be useful if images in a document are actually bigger then needed or if quality of images is not that important.
Below is a code that scales all images that have width or height greater or equal to 256. Scaled images are then encoded using JPEG compression.
public static void RecompressToJpeg(string path, string outputPath)
{
using (PdfDocument doc = new PdfDocument(path))
{
foreach (PdfImage image in doc.Images)
{
// image that is used as mask or image with attached mask are
// not good candidates for recompression
if (!image.IsMask && image.Mask == null && (image.Width >= 256 || image.Height >= 256))
image.Scale(0.5, PdfImageCompression.Jpeg, 65);
}
doc.Save(outputPath);
}
}
Images can be resized to specified width and height using one of the ResizeTo methods. Please note that ResizeTo method won't try to preserve aspect ratio of images. You should calculate proper width and height yourself.
Disclaimer: I work for Bit Miracle.

Using PdfSharp
public static void CompressPdf(string targetPath)
{
using (var stream = new MemoryStream(File.ReadAllBytes(targetPath)) {Position = 0})
using (var source = PdfReader.Open(stream, PdfDocumentOpenMode.Import))
using (var document = new PdfDocument())
{
var options = document.Options;
options.FlateEncodeMode = PdfFlateEncodeMode.BestCompression;
options.UseFlateDecoderForJpegImages = PdfUseFlateDecoderForJpegImages.Automatic;
options.CompressContentStreams = true;
options.NoCompression = false;
foreach (var page in source.Pages)
{
document.AddPage(page);
}
document.Save(targetPath);
}
}

GhostScript is AGPL licensed software that can compress PDFs. There is also an AGPL licensed C# wrapper for it on github here.
You could use the GhostscriptProcessor class from that wrapper to pass custom commands to GhostScript, like the ones found in this AskUbuntu answer describing PDF compression.

Generate huge image in C#

I need to create a huge image (aprox 24000 x 22000) with PixelFormat.Format24bppRgb encoding. I know it will barely impossible to open it...
What I'm trying to do is this:
Bitmap final = new Bitmap(width, height, PixelFormat.Format24bppRgb);
As expected, an exception is thrown as I can't handle a 11GB file in memory easy that way.
But I had an idea: could I write the file as I'm generating it? So, instead of working on RAM, I would be working on the HD.
Just to better explain: I have about 13K tiles and I plan to stitch it together in this stupidly humongous file. As I can iterate them in a give order, I thing I could write it down directly to the memory using unsafe code.
Any suggestions?

ImageMagick's Large Image Support (tera-pixel) can help you put the image together once you have the tiles that compose it. You can either use use the command line and issue commands to it using this wrapper or use this ImageMagick.NET as an API.

You could write it in a non-compressed format like BMP. BMP saves raw color bytes in rows. So you would load first row of tiles, read their separate pixel rows and write it as composite single row in output image. This way, you can have open only few tiles and imediately write down the output image.
But I don't know how to write it as compressed image, like JPG or PNG. But I'm sure some specialised software exists for that.

Depending on what you intend to do with this image upon completion, I would suggest dividing it into 4 and working with it that way. I have worked with 10,000 x 10,000 pixels without the OOM exception being thrown.

Generating High Quality (or Readable) Thumbnails using Ghostscript

I'm currently attempting to generate thumbnails of PDFs using Ghostscript (or more specifically GhostscriptSharp, the C# wrapper version) and have run into some issues with the image quality that is being output.
Using the following method:
GeneratePageThumbs(string inputPath, string outputPath, int firstPage, int lastPage, int width, int height)
and changing the width and height to smaller numbers that will generate a thumbnail roughly the size that I am looking for, for example a height of 12 and width of 8 will generate a set of thumbnails with the size of 102 x 88 pixels.
Ideally - I am trying to generate thumbnails with a size of 100 x 80 that look reasonably well when rendered as HTML (in an image tag) so that the reader could get a decent idea of what they are looking at from a thumbnail (as it is currently completely unreadable)
These are the current settings (from the C# wrapper):
private static readonly string[] ARGS = new string[] {
// Keep gs from writing information to standard output
"-q",
"-dQUIET",
"-dPARANOIDSAFER", // Run this command in safe mode
"-dBATCH", // Keep gs from going into interactive mode
"-dNOPAUSE", // Do not prompt and pause for each page
"-dNOPROMPT", // Disable prompts for user interaction
"-dMaxBitmap=500000000", // Set high for better performance
"-dNumRenderingThreads=4", // Multi-core, come-on!
// Configure the output anti-aliasing, resolution, etc
"-dAlignToPixels=0",
"-dGridFitTT=0",
"-dTextAlphaBits=4",
"-dGraphicsAlphaBits=4"
};
However - I am not very familiar with Ghostsharp and its settings to strike a balance between size and quality. I wouldn't be opposed to creating larger images and scaling them for the thumbnails, although I would prefer to get the thumbnails to work if possible.

Without seeing the original documents I can't be sure, but it seems unlikely to me that 102x88 pixels is going to be sufficient to create readable text.
The TextAlphaBits is probably too large for this size, all you will get is a blur. Try not setting TextAlphaBits at all. NumRenderingThreads won't do anything useful with a page this small (though it won't do any harm either).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.