Reduce & Optimize Scanned Documents File Size - c#

My customer has about 100,000 scanned documents (jpg) which they work with everyday. I want to know how can I reduce the file size of those images for faster file transfer and browsing.
The documents are scanned in black/white, saved in jpg format. They have a resolution of 150dpi and size of 1275x1753 (width x height). The main problem is their size which is between ~150kb and ~500kb which I think is too high for a black/white picture.
Is there a chance that I can reduce their size with changing the resolution, changing some color mode or something? Tried playing around with Photoshop but no luck.
The scanned documents are just for the sole purpose of Reviewing. So I don't think they need much detail or the original pic size.
Gonna write the program in c#, So tell me if there is a good image library for this purpose.

If your images are JPEG-compressed than they are either grayscale (8 bits per pixel) or full color (24 or 32 bits per pixel). I am not aware of any other JPEG types out there.
Given that, you probably won't get much benefit if you try to convert these images to other formats without changes to their size (number of pixels in both directions) and/or color space.
There is a possibility that JPEG 2000 might compress your images better than JPEG, but another lossy compression will introduce some more artifacts. You might try for yourself and see if this approach is acceptable for you. I can't recommend you any tools for this approach, though.
I would recommend you to try and convert your images to bilevel ones (i.e. with only two colors) and compress them with one of the FAX compression schemes (Group 3 or Group 4). You might try to reduce images sizes at the same time, too. This can be easily achieved using Docotic.Pdf library (Disclaimer: I work for the vendor of the library).
Please take a look at my answer to a question similar to yours. The answer shows how to use RecompressWithGroup4Fax and/or Scale methods to recompress existing images in PDF.
There is also valuable advice from #plinth about JBIG2 compression and other stuff. Well worth reading.

Related

File size converting pdf to tiff

I'm using ghostscriptSharp to convert PDF files to TIFF files for faxing. The PDF files sometimes contain photocopies of receipts.
I'm using the tiffg3 driver with a height x width of 400 x 400. I've noticed that the PDFs that contain photocopies tend to expand in size when converting to TIFFs, while the ones without those shrink in size. A typical increase that I'm seeing is going from 1 MB to 25 MB.
I've tried adding compression to the TIFF, but then the fax process can't read it. Is there a way to reduce the output size in ghostscriptSharp without reducing the resolution?
Creating a bitmap, even a low resolution monochrome bitmap, is likely to be larger than a vector-based description language.
Consider:
(Hello World) Tj
That's 16 bytes in a PDF file, and it doens't change if you change the font size. If you turn it into a bitmap, even at low resolution and compressed, it probably exceeds that size.
That's why rendering a page description language to a bitmap produces larger files and is one of the reasons for using a page description language for printing, instead of sending large bitmaps around.
The tiffg3 and tiffg4 devices in Ghostscript only produce monochrome output, because that's all you can encode with G3 and G4 encoding. TIFF G3 is already compressed using the Fax CCITT group 3 compression scheme (Group 3 = g3). If you try to compress that using some other scheme, then your fax software wont be able to read it.
You could try using CCITT Group 4 fax compression instead (the tiffg4 device) but if that doesn't help then basically that's what you get. Your only other option is to create the TIFF at a lower resolution. You don't say what resolution you are currently using. Fax normally supports 3 resolutions; 408x391, 204x196 and 204x98. If you are using superfine (408x391) then you could switch to a lower resolution.
I'm at a loss to see why this is a problem since you are sending the files by fax anyway, why do you care how large an intermediate TIFF file you get ?
If compression won't work and you can't reduce resolution, then the only remaining option would be color depth. It's plausible that the conversion could be using more colors when a photocopy is attached (because of gradients in shadows, or the particular color of the paper, or whatever); yet the receipt might be totally readable without all the colors (as long as the "ink" shows up as distinct from the "paper").
If your conversion tool has a setting for selecting a color depth, tinkering with that is likely your best bet.
If your toolkit allows encoding options, for faxing, your best bet will be to produce a bitonal (black and white) tiff with Group 4 encoding. The downside of that compression scheme is that the more "gray" you have (typical with color pictures converted to grayscale), the bigger your file will be, otherwise, for most things, the compression ratio will be just fine.

Improving the quality of TIFF images

We have a around 600,000 images that were converted from JPEG to TIFF files and uploaded to our FileNet repository. These TIFF images are multi-page, made by stitching multiple JPEGs.
This was done couple of years ago. Now we started getting complaints from users the quality of the TIFF images are not the same as they were when they were JPEGs.
Is there any way we can improve the quality of TIFF files? If I have to re-migrate this data, can JPEGs be of multiple pages? Please advice.
You can't just add quality to an image, so you can either try improving the appearance of the current information or you'll need to re-create the images to get better information.
To me, it sounds like the initial creation process is the most likely cause of the quality issue. How you create the image is important.
For example, I had a large number of photos I needed to re-size, so I used irfanview's batch convert and the results were horrible. Perhaps I had the settings wrong, I don't know.
I then tried using ImageMagick, and the results were great.
The point being, the conversion process isn't trivial.
If I were you, I'd look at how the images were created, experiment with different settings to determine what gives the best appearance, then re-create your photo gallery.
For photographic material, there's no real reason to use anything other than a jpeg if the target market is the general consumer.
Both TIFF and JPEG support lossless and lossy storage of your images. You mentioned that there was a previous conversion. The conversion was probably a lossy conversion as such you probably won't be able to recover that data to the way it was previously.
That said if you have the original source images you might be able to get back to where you where. Regarding multi-image jpegs, there is such a format *.mpo but I haven't seen it used before so your millage may vary.
You probably converted gray scale or color Jpeg to Tiff. The most common is Tiff G4 which is only 1 bit per pixel. So 24 or 8 bits was converted to 1 bit and you will see a lot of images losses. There are multiple methods to improve image quality but I would have to see the images first to suggest a method.

Large Bitmap Serialization

Is there an easy way, or free library, that will allow you to append small bitmaps into one large bitmap on file? I'm doing a capture of a web page that sometimes is quite large vertically. To avoid OOM exceptions I load small vertical by full horizontal slices of the capture into memory and would like to save those to disk. An append to an open filestream would be great. I'm not an expert on the Bitmap format but I am aware there is probably header / footer information that may prevent this.
There is header information, but it's a fixed size. You could write the header information and then append rows of pixels, keeping track of the height and other information. When you're done, position to the front of the file and update the header.
Bitmap File Format is a pretty good description of the format.
I would suggest using the version 3 format unless there's something you really need from the V4 structure. 24 bits per pixel is the easiest to deal with, since you don't have to mess with a color palette. 16 and 32 bit color are easier than 4 and 8 bit color (which require a palette).

Large bitmap images memory allocation in blob detectin, C# .Net

I have bitmap images like 14000x18000(~30MB ) height and width. I am trying to process them with different image processing libraries (OpenCV (using the wrapper OpenCvSharp), Aforge.NET..) in order to do blob detection. However, labeling the bitmap image causes memory allocation problems. The libraries tries to map the labeled image to 32bit image.
Is there a way to da the labeling operation with a less amount of memory? (Cropping the image is not a solution)
For example labeling the bitmap image to a 8bit image instead of 32?
In case there isn't an answer for the 8-bit thing... and even if there is...
For speed and memory purposes, I would highly recommend resizing the image down (not cropping). Use high-quality interpolation like this sample does, only just resize to 50%, not thumbnail (7.5MB im memory).
You didn't mention that you don't want to do this, and I am assuming you probably don't want to try it, thinking the library will do better blob detection at full resolution. Before you pooh-pooh the idea you need to test it with a full-resolution subsection of a sample image, of a size that the library will handle, compared to the same subsection at 50%.
Unless you've actually done this, you can't know. You can also figure a maximum amount of memory that the picture can use, compute a resize factor to target that number (reduce it for safety - you'll figure this out when things blow up in testing). If you care where the stuff is in the original image, scale it back up by the factor.
This may not solve your particular problem (or it might), but have you considered splitting / segmenting the frame into a 2x2 (or 3x3) matrix and try to work on each of them separately. Then based on where you find the blobs in the 4 (or 9) frames, correlate and coalesce the adjoining blobs to make single blob. Of course, this high level blob coalescing would have to be your own logic.
PS> Admittedly, working off highly superficial knowledge of Aforge. No hands-on experience what-so-ever.

Image Steganography

I'm working on Steganography application. I need to hide a message inside an image file and secure it with a password, with not much difference in the file size. I am using Least Significant Bit algorithm and could do it successfully with BMP files but it does not work with JPEG, PNG or TIFF files. Does this algorithm work with these files at all? Is there a better way to achieve this? Thanks.
This heavily depends on the way the particular image format works. You'll need to dive into the internals of the format you want to use.
For JPEG, you could fiddle with the last bits of the DCT coefficients for each block.
For palette-based files (GIFs, and some PNGs), you could add extra colours to the palette that look identical to the existing ones, and encode information based on which one you use.
You'll have to distinguish between pixel-based (Bitmap) and palette-based formats (GIF) for which the steganographic technique is quite different. Also be aware that there are image formats like JPG that lose information in the compression process.
I'd also advice some general introduction to steganography including different formats.
Least Significant Bit approach does not work with JPEG and GIF images because you are using the pixel data (raw image) to store hidden information before compression. A pixel p, with data 0x123456 will probably not have this value after compression because its value depends on the compression rate and neighbour pixels. In this case we are talking about algorithms that does not only compact the image (like a ZIP, that keeps the content), but changes the color distribution, texture, and quality in order to decrease the number of bits to represent it.
However, PNG can be used just to compact the image in the same sense of ZIP file, keeping the content. Therefore, you can use the Least Significant Bit for PNG images, so that Wikipedia Steganography page shows example in this format.
As long as the image format is lossless, you can use the LSB steganography in pixels (BMP, PNG, TIFF, PPM). If it is lossy, you have to try something else, as compression and subsequent decompression cause small changes in the pixels and the message is gone. In GIF, you can embed your message into the palette. In JPEG you change the DCT coefficients, a low-level frequency representation of the image, which can be read from and saved as JPEG file losslessly.
There is an extensive research on steganography in JPEG. For introduction, I personally recommend Steganography in Digital Media: Principles, Algorithms, and Applications by Jessica Fridrich - must-read material for serious attempts in steganography. The approaches for various image formats are discussed in-depth there.
Also, LSB is inefficient and very easily detectable, you should not use that. There are better algorithms, however usually heavy on math and complex. Look for "steganography embedding distortion" and "steganography codes".

Categories