Determining Best compression algorithm for given PDF file

Determining Best compression algorithm for given PDF file - c#

I'm currently using Docotic PDF library to write a compression program for a PDF file server hosting large scanned documents. (Intention is to get the smallest size in black and white that maintains a readable document- mostly legal briefs)
In testing I notice that certain files will respond better to JPEG compression while others respond better to Group3Fax or Flate. Is it possible to analyze the file and make an intelligent decision on which algorithm will produce the smallest PDF or would I actually have compress each file with all three algorithms and choose the smallest - which is incurs a ton of additional CPU overhead.
Any guidance is greatly appreciated. Thanks

Related

parse jpeg binary file

I exploring internet for two days and still can't find a good head start for this. I want to write a code with c# to get a .jpeg binary file and decode it and display the image. everywhere I looked there is lots of explanation about jpeg algorithm but still I can't find good explanation about how to parse and decode this file. I mean for example how can I know Huffman DC table starts with what number and end's with what number?
I appreciate if someone can link me somewhere that I can find explanation about parsing binary jpeg file.
thank you and sorry for my english.

Trust me, it isn't something you can do. I wouldn't touch the thing with a pole long various meters...
http://ijg.org/
Here there is the site of:
IJG is an informal group that writes and distributes a widely used free library for JPEG image compression. The first version was released on 7-Oct-1991.
There is the source code for libjpeg.
if you just want to take a look, here http://elm-chan.org/fsw/tjpgd/00index.html there is the source of
TJpgDec is a generic JPEG image decompressor module that highly optimized for small embedded systems.
it is even
Platform independent. Written in ANSI-C.
Being tiny it will be probably easy to reimplement in C# :-)

Reduce & Optimize Scanned Documents File Size

My customer has about 100,000 scanned documents (jpg) which they work with everyday. I want to know how can I reduce the file size of those images for faster file transfer and browsing.
The documents are scanned in black/white, saved in jpg format. They have a resolution of 150dpi and size of 1275x1753 (width x height). The main problem is their size which is between ~150kb and ~500kb which I think is too high for a black/white picture.
Is there a chance that I can reduce their size with changing the resolution, changing some color mode or something? Tried playing around with Photoshop but no luck.
The scanned documents are just for the sole purpose of Reviewing. So I don't think they need much detail or the original pic size.
Gonna write the program in c#, So tell me if there is a good image library for this purpose.

If your images are JPEG-compressed than they are either grayscale (8 bits per pixel) or full color (24 or 32 bits per pixel). I am not aware of any other JPEG types out there.
Given that, you probably won't get much benefit if you try to convert these images to other formats without changes to their size (number of pixels in both directions) and/or color space.
There is a possibility that JPEG 2000 might compress your images better than JPEG, but another lossy compression will introduce some more artifacts. You might try for yourself and see if this approach is acceptable for you. I can't recommend you any tools for this approach, though.
I would recommend you to try and convert your images to bilevel ones (i.e. with only two colors) and compress them with one of the FAX compression schemes (Group 3 or Group 4). You might try to reduce images sizes at the same time, too. This can be easily achieved using Docotic.Pdf library (Disclaimer: I work for the vendor of the library).
Please take a look at my answer to a question similar to yours. The answer shows how to use RecompressWithGroup4Fax and/or Scale methods to recompress existing images in PDF.
There is also valuable advice from #plinth about JBIG2 compression and other stuff. Well worth reading.

C# compressing an image to 10KB

I am writing a program in C# that involves image compression. I need to be able to take any picture and compress it so that it is less than 10KB. Quality is not a huge concern, but I would like to keep as much as possible. I have searched for and tried numerous solutions. Any of the ones I have found cannot resize to a specific size and even with iteration, reach maximum compression much before 10KB. Also it would be great if it was open source or free!
Thanks in advance!

Convert High Resolution PDF to Low Resolution PDF file C#

I´m looking for a way to convert a High Resolution PDF file to a Low Resolution PDF file from an ASP.NET applicaitn (C#).
Users will import High Resolution PDF's and the solution should then have the possibility to provide both High Resolution PDF and Low Resolution PDF.
I´m looking for a API to do that. I have found a lot of PDF apis but none of them seems to do what I´m looking for.

ABCpdf .NET will do this for you. There are a variety of functions for resizing, resampling or recompressing the images within a PDF document. However given your requirements you probably just want to use the document reduce size operation.
To do this you just need code of the following form:
Doc doc = new Doc();
doc.Read(Server.MapPath("../mypics/sample.pdf"));
using (ReduceSizeOperation op = new ReduceSizeOperation(doc)) {
op.UnembedSimpleFonts = false; // though of course making these true...
op.UnembedComplexFonts = false; // ... would further reduce file size.
op.MonochromeImageDpi = 72;
op.GrayImageDpi = 72;
op.ColorImageDpi = 144;
op.Compact(true);
}
doc.Save(Server.MapPath("ReduceSizeOperation.pdf"));
I work on the ABCpdf .NET software component so my replies may feature concepts based around ABCpdf. It's just what I know. :-)

There are a number of possible approaches to this problem, one of which would simply be to export from InDesign twice (which would allow you to make the two required versions of PDF). If that is not feasible (which it might not be as exporting from InDesign can take a bit of time) there are definitely libraries on the market that can perform what you want to do. Here to you have a number of approaches:
1) While this will get me shot by most Adobe employees, you could re-distill your PDF file into a smaller file. I would not advocate to do this, I'm mentioning it to be complete. Risks involved would be introducing quality loss, artefacts and so on (mostly because PostScript doesn't support a number of features PDF does support and by redistilling you'd loose those features).
2) Use an application or library that was made for this task. callas pdfToolbox for example (warning, I'm affiliated with this company!) is capable of taking a PDF file and running corrections on it. Amongst those corrections are image downsampling, conversion to RGB, image re-compression (for example with JPEG-2000), throwing away unnecessary data in the PDF and much more. Given that the application has a command-line interface, it can easily be integrated into a c# process. Other applications with similar functionality would be from companies such as Enfocus, Apago and others.
3) Use a low-level PDF library (such as the Adobe PDF library that can be licensed from Adobe through DataLogics) and implement the necessary changes yourself. More work, but potentially more flexible.
Whatever approach you choose, know that you are starting from a high-quality PDF file and that your process should try to retain as much of that quality as possible (dependant on which application you have for the low resolution PDF file of course). Make sure you don't get into trouble by loosing proper overprint, transparency etc...

Probably you can just resize images in high resolution version PDF and this will give you much smaller files.
You can resize and/or recompress images using Docotic.Pdf library. For more details please take a look at my answer to a similar question.
Disclaimer: I am one of the developers of Docotic.Pdf library.

Image Steganography

I'm working on Steganography application. I need to hide a message inside an image file and secure it with a password, with not much difference in the file size. I am using Least Significant Bit algorithm and could do it successfully with BMP files but it does not work with JPEG, PNG or TIFF files. Does this algorithm work with these files at all? Is there a better way to achieve this? Thanks.

This heavily depends on the way the particular image format works. You'll need to dive into the internals of the format you want to use.
For JPEG, you could fiddle with the last bits of the DCT coefficients for each block.
For palette-based files (GIFs, and some PNGs), you could add extra colours to the palette that look identical to the existing ones, and encode information based on which one you use.

You'll have to distinguish between pixel-based (Bitmap) and palette-based formats (GIF) for which the steganographic technique is quite different. Also be aware that there are image formats like JPG that lose information in the compression process.
I'd also advice some general introduction to steganography including different formats.

Least Significant Bit approach does not work with JPEG and GIF images because you are using the pixel data (raw image) to store hidden information before compression. A pixel p, with data 0x123456 will probably not have this value after compression because its value depends on the compression rate and neighbour pixels. In this case we are talking about algorithms that does not only compact the image (like a ZIP, that keeps the content), but changes the color distribution, texture, and quality in order to decrease the number of bits to represent it.
However, PNG can be used just to compact the image in the same sense of ZIP file, keeping the content. Therefore, you can use the Least Significant Bit for PNG images, so that Wikipedia Steganography page shows example in this format.

As long as the image format is lossless, you can use the LSB steganography in pixels (BMP, PNG, TIFF, PPM). If it is lossy, you have to try something else, as compression and subsequent decompression cause small changes in the pixels and the message is gone. In GIF, you can embed your message into the palette. In JPEG you change the DCT coefficients, a low-level frequency representation of the image, which can be read from and saved as JPEG file losslessly.
There is an extensive research on steganography in JPEG. For introduction, I personally recommend Steganography in Digital Media: Principles, Algorithms, and Applications by Jessica Fridrich - must-read material for serious attempts in steganography. The approaches for various image formats are discussed in-depth there.
Also, LSB is inefficient and very easily detectable, you should not use that. There are better algorithms, however usually heavy on math and complex. Look for "steganography embedding distortion" and "steganography codes".

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.