Compress existing PDF using C# programming using freeware libraries

Compress existing PDF using C# programming using freeware libraries - c#

I have been searching a lot on Google about how to compress existing pdf (size).
My problem is
I can't use any application, because it needs to be done by a C# program.
I can't use any paid library as my clients don't want to go out of Budget. So a PAID library is certainly a NO
I did my home-work for last 2 days and came upon a solution using iTextSharp, BitMiracle but to no avail as the former decrease just 1% of a file and later one is a paid.
I also came across PDFcompressNET and pdftk but i wasn't able to find their .dll.
Actually the pdf is insurance policy with 2-3 images (black and white) and around 70 pages accounting to size of 5 MB.
I need the output in pdf only(can't be in any other format)

Here's an approach to do this (and this should work without regard to the toolkit you use):
If you have a 24-bit rgb or 32 bit cmyk image do the following:
determine if the image is really what it is. If it's cmyk, convert to rgb. If it's rgb and really gray, convert to gray. If it's gray or paletted and only has 2 real colors, convert to 1-bit. If it's gray and there is relatively little in the way of gray variations, consider converting to 1 bit with a suitable binarization technique.
measure the image dimensions in relation to how it is being placed on the page - if it's 300 dpi or greater, consider resampling the image to a smaller size depending on the bit depth of the image - for example, you can probably go from 300 dpi gray or rgb to 200 dpi and not lose too much detail.
if you have an rgb image that is really color, consider palettizing it.
Examine the contents of the image to see if you can help make it more compressible. For example, if you run through a color/gray image and fine a lot of colors that cluster, consider smoothing them. If it's gray or black and white and contains a number of specks, consider despeckling.
choose your final compression wisely. JPEG2000 can do better than JPEG. JBIG2 does much better than G4. Flate is probably the best non-destructive compression for gray. Most implementations of JPEG2000 and JBIG2 are not free.
if you're a rock star, you want to try to segment the image and break it into areas that are really black and white and really color.
That said, if you do can do all of this well in an unsupervised manner, you have a commercial product in its own right.
I will say that you can do most of this with Atalasoft dotImage (disclaimers: it's not free; I work there; I've written nearly all the PDF tools; I used to work on Acrobat).
One particular way to that with dotImage is to pull out all the pages that are image only, recompress them and save them out to a new PDF then build a new PDF by taking all the pages from the original document and replacing them the recompressed pages, then saving again. It's not that hard.
List<int> pagesToReplace = new List<int>();
PdfImageCollection pagesToEncode = new PdfImageCollection();
using (Document doc = new Document(sourceStream, password)) {
for (int i=0; i < doc.Pages.Count; i++) {
Page page = doc.Pages[i];
if (page.SingleImageOnly) {
pagesToReplace.Add(i);
// a PDF image encapsulates an image an compression parameters
PdfImage image = ProcessImage(sourceStream, doc, page, i);
pagesToEncode.Add(i);
}
}
PdfEncoder encoder = new PdfEncoder();
encoder.Save(tempOutStream, pagesToEncode, null); // re-encoded pages
tempOutStream.Seek(0, SeekOrigin.Begin);
sourceStream.Seek(0, SeekOrigin.Begin);
PdfDocument finalDoc = new PdfDocument(sourceStream, password);
PdfDocument replacementPages = new PdfDocument(tempOutStream);
for (int i=0; i < pagesToReplace.Count; i++) {
finalDoc.Pages[pagesToReplace[i]] = replacementPages.Pages[i];
}
finalDoc.Save(finalOutputStream);
What's missing here is ProcessImage(). ProcessImage will rasterize the page (and you wouldn't need to understand that the image might have been scaled to be on the PDF) or extract the image (and track the transformation matrix on the image), and go through the steps listed above. This is non-trivial, but it's doable.

I think you might want to make your clients aware that any of the libraries you mentioned is not completely free:
iTextSharp is AGPL-licensed, so you must release source code of your solution or buy a commercial license.
PDFcompressNET is a commercial library.
pdftk is GPL-licensed, so you must release source code of your solution or buy a commercial license.
Docotic.Pdf is a commercial library.
Given all of the above I assume I can drop freeware requirement.
Docotic.Pdf can reduce size of compressed and uncompressed PDFs to different degrees without introducing any destructive changes.
Gains depend on the size and structure of a PDF: For small files or files that are mostly scanned images the reduction might not be that great, so you should try the library with your files and see for yourself.
If you are most concerned about size and there are many images in your files and you are fine with loosing some of the quality of those images then you can easily recompress existing images using Docotic.Pdf.
Here is the code that makes all images bilevel and compressed with fax compression:
static void RecompressExistingImages(string fileName, string outputName)
{
using (PdfDocument doc = new PdfDocument(fileName))
{
foreach (PdfImage image in doc.Images)
image.RecompressWithGroup4Fax();
doc.Save(outputName);
}
}
There are also RecompressWithFlate, RecompressWithGroup3Fax and RecompressWithJpeg methods.
The library will convert color images to bilevel ones if needed. You can specify deflate compression level, JPEG quality etc.
Docotic.Pdf can also resize big images (and recompress them at the same time) in PDF. This might be useful if images in a document are actually bigger then needed or if quality of images is not that important.
Below is a code that scales all images that have width or height greater or equal to 256. Scaled images are then encoded using JPEG compression.
public static void RecompressToJpeg(string path, string outputPath)
{
using (PdfDocument doc = new PdfDocument(path))
{
foreach (PdfImage image in doc.Images)
{
// image that is used as mask or image with attached mask are
// not good candidates for recompression
if (!image.IsMask && image.Mask == null && (image.Width >= 256 || image.Height >= 256))
image.Scale(0.5, PdfImageCompression.Jpeg, 65);
}
doc.Save(outputPath);
}
}
Images can be resized to specified width and height using one of the ResizeTo methods. Please note that ResizeTo method won't try to preserve aspect ratio of images. You should calculate proper width and height yourself.
Disclaimer: I work for Bit Miracle.

Using PdfSharp
public static void CompressPdf(string targetPath)
{
using (var stream = new MemoryStream(File.ReadAllBytes(targetPath)) {Position = 0})
using (var source = PdfReader.Open(stream, PdfDocumentOpenMode.Import))
using (var document = new PdfDocument())
{
var options = document.Options;
options.FlateEncodeMode = PdfFlateEncodeMode.BestCompression;
options.UseFlateDecoderForJpegImages = PdfUseFlateDecoderForJpegImages.Automatic;
options.CompressContentStreams = true;
options.NoCompression = false;
foreach (var page in source.Pages)
{
document.AddPage(page);
}
document.Save(targetPath);
}
}

GhostScript is AGPL licensed software that can compress PDFs. There is also an AGPL licensed C# wrapper for it on github here.
You could use the GhostscriptProcessor class from that wrapper to pass custom commands to GhostScript, like the ones found in this AskUbuntu answer describing PDF compression.

Related

iTextSharp change outcome quality / compression like PDF24Creator

I was wondering if I can compress/Change the Quality of my outcoming pdf-file with iTextSharp and C# like I can do with Adobe Acrobat Pro or PDF24Creator.
Using the PDF24Creator I can open the pdf, save the file again and set the "Quality of the PDF" to "Low Quality" and my file size decreases from 88,6MB to 12,5MB while the Quality is still good enough.
I am already using the
writer = new PdfCopy(doc, fs);
writer.SetPdfVersion(PdfCopy.PDF_VERSION_1_7);
writer.CompressionLevel = PdfStream.BEST_COMPRESSION;
writer.SetFullCompression();
which decreases the file size from about 92MB to 88MB.
Alternatively: Can I run the pdf24 Program through my C# code using command line arguments or starting Parameters? Something like that:
pdf24Creator.exe -save -Quality:low -inputfile -outputfile
Thanks for your help (Bruno)!

Short answer: no.
Long answer: yes but you must do a lot of the work yourself.
If you read the third and fourth paragraphs here you'll hopefully get a better understanding of what "compression" actually means from a PDF perspective.
Programs like Adobe Acrobat and PDF24 Creator allow you to reduce the size of a file by destroying the data within the PDF. When you select a low quality setting one of the most common changes these programs make is to actually extract all of the images, reduce their quality and replace the original files in the PDF. So a JPEG originally saved without any compression might be knocked down to 60% quality. And just to be clear, that 60% is non-reversible, it isn't zipping the file, it is literally destroying the data in order to save space.
Another setting is to reduce the effective DPI of an image. A 500 pixel wide image placed into a 2 inch wide box is effectively 250 DPI. These programs will extract the image, reduce the image to maybe 96 or 72 DPI which means the 500 pixel image be reduced to 192 or 144 pixels in width and replace the original file in the PDF. Once again, this is a destructive non-reversible change.
(And by destructive non-reversible, you still probably have the original file, I just want to be clear that this isn't true "compression" like ZIP.)
However, if you really want to do it you can look at code like this which shows how you can use iText to perform the extraction and re-insertion of images. It is 100% up to you, however, to change the images because iText won't make destructive changes to your data (and that's a good thing I'd say!)

C# - Working with high resolution screenshots

I need to capture an area within my desktop. But I need this area to be very high resolution (like, at least few thousand's pixels horizontal, same goes for vertical). Is it possible to get a screen capture that has high density of pixels? How can I do this? I tried capturing the screen with some AutoIt script, and got some very good results (images that were 350MB big), now I would like to do the same using C#.
Edit:
I am doing my read/write of a .tif file like that, and it already loses most of the data:
using (Bitmap bitmap = (Bitmap)Image.FromFile(#"ScreenShot.tif")) //this file has 350MB
{
using (Bitmap newBitmap = new Bitmap(bitmap))
{
newBitmap.Save("TESTRES.TIF", ImageFormat.Tiff); //now this file has about 60MB, Why?
}
}
I am trying to capture my screen like that, but the best I can get from this is few megabytes (nowhere near 350MB):
using (var bmpScreenCapture = new Bitmap(window[2], window[3], PixelFormat.Format32bppArgb))
{
using (var i = Graphics.FromImage(bmpScreenCapture))
{
i.InterpolationMode = InterpolationMode.High;
i.CopyFromScreen(window[0], window[1], 0, 0, bmpScreenCapture.Size, CopyPixelOperation.SourceCopy);
}
bmpScreenCapture.Save("test2.tif", ImageFormat.Tiff);
}

You can't gather more information than the source has.
This is a basic truth and it does apply here, too.
So you can't capture more the your 1920x1080 pixels at their color depth.
OTOH, since you want to feed the captured image into OCR, there a few more things to consider and in fact to do..
OCR is very happy if you help it by optimizing the image. This should involve
reducing colors and adding contrast
enlarging to the recommended dpi resolution
adding even more contrast
Funnily, this will help OCR although the real information cannot increase above the original source. But a good resizing algorithm will add invented data and these often will be just what the OCR software needs.
You should also take care to use a good i.e. non lossy format when you store the image to a file like png or tif and never jpg.
The best way will have to be adjusted by trial and error until the OCR results are good enough.
Hint: Due to font antialiasing most text on screenshots is surrounded by a halo of colorful pixels. Getting rid of it by the reducing or even removing saturation is one way; maybe you want to turn it off in your display properties? (Check out ClearType!)

Generating High Quality (or Readable) Thumbnails using Ghostscript

I'm currently attempting to generate thumbnails of PDFs using Ghostscript (or more specifically GhostscriptSharp, the C# wrapper version) and have run into some issues with the image quality that is being output.
Using the following method:
GeneratePageThumbs(string inputPath, string outputPath, int firstPage, int lastPage, int width, int height)
and changing the width and height to smaller numbers that will generate a thumbnail roughly the size that I am looking for, for example a height of 12 and width of 8 will generate a set of thumbnails with the size of 102 x 88 pixels.
Ideally - I am trying to generate thumbnails with a size of 100 x 80 that look reasonably well when rendered as HTML (in an image tag) so that the reader could get a decent idea of what they are looking at from a thumbnail (as it is currently completely unreadable)
These are the current settings (from the C# wrapper):
private static readonly string[] ARGS = new string[] {
// Keep gs from writing information to standard output
"-q",
"-dQUIET",
"-dPARANOIDSAFER", // Run this command in safe mode
"-dBATCH", // Keep gs from going into interactive mode
"-dNOPAUSE", // Do not prompt and pause for each page
"-dNOPROMPT", // Disable prompts for user interaction
"-dMaxBitmap=500000000", // Set high for better performance
"-dNumRenderingThreads=4", // Multi-core, come-on!
// Configure the output anti-aliasing, resolution, etc
"-dAlignToPixels=0",
"-dGridFitTT=0",
"-dTextAlphaBits=4",
"-dGraphicsAlphaBits=4"
};
However - I am not very familiar with Ghostsharp and its settings to strike a balance between size and quality. I wouldn't be opposed to creating larger images and scaling them for the thumbnails, although I would prefer to get the thumbnails to work if possible.

Without seeing the original documents I can't be sure, but it seems unlikely to me that 102x88 pixels is going to be sufficient to create readable text.
The TextAlphaBits is probably too large for this size, all you will get is a blur. Try not setting TextAlphaBits at all. NumRenderingThreads won't do anything useful with a page this small (though it won't do any harm either).

What quality level does Image.Save() use for jpeg files?

I just got a real surprise when I loaded a jpg file and turned around and saved it with a quality of 100 and the size was almost 4x the original. To further investigate I open and saved without explicitly setting the quality and the file size was exactly the same. I figured this was because nothing changed so it's just writing the exact same bits back to a file. To test this assumption I drew a big fat line diagonally across the image and saved again without setting quality (this time I expected the file to jump up because it would be "dirty") but it decreased ~10Kb!
At this point I really don't understand what is happening when I simply call Image.Save() w/out specifying a compression quality. How is the file size so close (after the image is modified) to the original size when no quality is set yet when I set quality to 100 (basically no compression) the file size is several times larger than the original?
I've read the documentation on Image.Save() and it's lacking any detail about what is happening behind the scenes. I've googled every which way I can think of but I can't find any additional information that would explain what I'm seeing. I have been working for 31 hours straight so maybe I'm missing something obvious ;0)
All of this has come about while I implement some library methods to save images to a database. I've overloaded our "SaveImage" method to allow explicitly setting a quality and during my testing I came across the odd (to me) results explained above. Any light you can shed will be appreciated.
Here is some code that will illustrate what I'm experiencing:
string filename = #"C:\temp\image testing\hh.jpg";
string destPath = #"C:\temp\image testing\";
using(Image image = Image.FromFile(filename))
{
ImageCodecInfo codecInfo = ImageUtils.GetEncoderInfo(ImageFormat.Jpeg);
// Set the quality
EncoderParameters parameters = new EncoderParameters(1);
// Quality: 10
parameters.Param[0] = new EncoderParameter(
System.Drawing.Imaging.Encoder.Quality, 10L);
image.Save(destPath + "10.jpg", codecInfo, parameters);
// Quality: 75
parameters.Param[0] = new EncoderParameter(
System.Drawing.Imaging.Encoder.Quality, 75L);
image.Save(destPath + "75.jpg", codecInfo, parameters);
// Quality: 100
parameters.Param[0] = new EncoderParameter(
System.Drawing.Imaging.Encoder.Quality, 100L);
image.Save(destPath + "100.jpg", codecInfo, parameters);
// default
image.Save(destPath + "default.jpg", ImageFormat.Jpeg);
// Big line across image
using (Graphics g = Graphics.FromImage(image))
{
using(Pen pen = new Pen(Color.Red, 50F))
{
g.DrawLine(pen, 0, 0, image.Width, image.Height);
}
}
image.Save(destPath + "big red line.jpg", ImageFormat.Jpeg);
}
public static ImageCodecInfo GetEncoderInfo(ImageFormat format)
{
return ImageCodecInfo.GetImageEncoders().ToList().Find(delegate(ImageCodecInfo codec)
{
return codec.FormatID == format.Guid;
});
}

Using reflector, it turns out Image.Save() boils down to the GDI+ function GdipSaveImageToFile, with the encoderParams NULL. So I think the question is what the JPEG encoder does when it gets a null encoderParams. 75% has been suggested here, but I can't find any solid reference.
EDIT You could probably find out for yourself by running your program above for quality values of 1..100 and comparing them with the jpg saved with the default quality (using, say, fc.exe /B)

IIRC, it is 75%, but I dont recall where I read this.

I don't know much about the Image.Save method, but I can tell you that adding that fat line would logicly reduce the size of the jpg image. This is due to the way a jpg is saved (and encoded).
The thick black line makes for a very simple and smaller encoding (If I remember correctly this is relevent mostly after the Discrete cosine transform), so the modified image can be stored using less data (bytes).
jpg encoding steps
Regarding the changes in size (without the added line), I'm not sure which image you reopened and resaved
To further investigate I open and saved without explicitly setting the quality and the file size was exactly the same
If you opened the old (original normal size) image and resaved it, then maybe the default compression and the original image compression are the same.
If you opened the new (4X larger) image and resaved it, then maybe the default compression for the save method is derived from the image (as it was when loaded).
Again, I don't know the save method, so I'm just throwing ideas (maybe they'll give you a lead).

When you save an image as a JPEG file with a quality level of <100%, you are introducing artefacts into the saved-off image, which are a side-effect of the compression process. This is why re-saving the image at 100% is actually increasing the size of your file beyond the original - ironically there's more information present in the bitmap.
This is also why you should always attempt to save in a non-lossy format (such as PNG) if you intend to do any edits to your file afterwards, otherwise you'll be affecting the quality of the output through multiple lossy transformations.

Image manipulation in C#

I am loading a JPG image from hard disk into a byte[]. Is there a way to resize the image (reduce resolution) without the need to put it in a Bitmap object?
thanks

There are always ways but whether they are better... a JPG is a compressed image format which means that to do any image manipulation on it you need something to interpret that data. The bimap object will do this for you but if you want to go another route you'll need to look into understanding the jpeg spec, creating some kind of parser, etc. It might be that there are shortcuts that can be used without needing to do full intepretation of the original jpg but I think it would be a bad idea.
Oh, and not to forget there are different file formats for JPG apparently (JFIF and EXIF) that you will ened to understand...
I'd think very hard before avoiding objects that are specifically designed for the sort of thing you are trying to do.

A .jpeg file is just a bag o' bytes without a JPEG decoder. There's one built into the Bitmap class, it does a fine job decoding .jpeg files. The result is a Bitmap object, you can't get around that.
And it supports resizing through the Graphics class as well as the Bitmap(Image, Size) constructor. But yes, making a .jpeg image smaller often produces a file that's larger. That's an unavoidable side-effect of Graphics.Interpolation mode. It tries to improve the appearance of the reduced image by running the pixels through a filter. The Bicubic filter does an excellent job of it.
Looks great to the human eye, doesn't look so great to the JPEG encoder. The filter produces interpolated pixel colors, designed to avoid making image details disappear completely when the size is reduced. These blended pixel values however make it harder on the encoder to compress the image, thus producing a larger file.
You can tinker with Graphics.InterpolationMode and select a lower quality filter. Produces a poorer image, but easier to compress. I doubt you'll appreciate the result though.

Here's what I'm doing.
And no, I don't think you can resize an image without first processing it in-memory (i.e. in a Bitmap of some kind).
Decent quality resizing involves using an interpolation/extrapolation algorithm; it can't just be "pick out every n pixels", unless you can settle with nearest neighbor.
Here's some explanation: http://www.cambridgeincolour.com/tutorials/image-interpolation.htm
protected virtual byte[] Resize(byte[] data, int width, int height) {
var inStream = new MemoryStream(data);
var outStream = new MemoryStream();
var bmp = System.Drawing.Bitmap.FromStream(inStream);
var th = bmp.GetThumbnailImage(width, height, null, IntPtr.Zero);
th.Save(outStream, System.Drawing.Imaging.ImageFormat.Jpeg);
return outStream.ToArray(); }

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.