PDF to bmp Images (12 pages = 12 images) - c#

I have to deconstruct/extract a pdf page by page into bitmap images. This will be done on a server via a web service which I've setup. How do I get this right? It has to be page by page (1 page per image).
I am really stuck and I know one of you geniuses have the answer that I've been looking for.
I have tried: http://www.pdfsharp.net/wiki/ExportImages-sample.ashx Which didn't work correctly.
I am using C#;
The PDF is not password protected;
If this solution could take a Uri as a parameter for the location of the PDF it would be excellent!
The solution should not be reliant on Acrobat PDF Reader at all
I have been struggling for a very long time trying to use MigraDoc and PDFSharp and their alternatives to achieve the aforementioned problem.
ANY help/advice/code would be greatly appreciated!!
Thanks in advance!

LibPdf
This library converts converts PDF file to an image. Supported image formats are PNG and BMP, but you can easily add more.
Usage example:
using (FileStream file = File.OpenRead(#"..\path\to\pdf\file.pdf")) // in file
{
var bytes = new byte[file.Length];
file.Read(bytes, 0, bytes.Length);
using (var pdf = new LibPdf(bytes))
{
byte[] pngBytes = pdf.GetImage(0,ImageType.BMP); // image type
using (var outFile = File.Create(#"..\path\to\pdf\file.bmp")) // out file
{
outFile.Write(pngBytes, 0, pngBytes.Length);
}
}
}
Or Bytescout PDF Renderer SDK
using System;
using Bytescout.PDFRenderer;
namespace PDF2BMP
{
class Program
{
static void Main(string[] args)
{
// Create an instance of Bytescout.PDFRenderer.RasterRenderer object and register it.
RasterRenderer renderer = new RasterRenderer();
renderer.RegistrationName = "demo";
renderer.RegistrationKey = "demo";
// Load PDF document.
renderer.LoadDocumentFromFile("multipage.pdf");
for (int i = 0; i < renderer.GetPageCount(); i++)
{
// Render first page of the document to BMP image file.
renderer.RenderPageToFile(i, RasterOutputFormat.BMP, "image" + i + ".bmp");
}
// Open the first output file in default image viewer.
System.Diagnostics.Process.Start("image0.bmp");
}
}
}

Related

Reduce the DPI of pdf stream in c#

I have a PDF document with an image. And tried to preview it using PdfDocumentProcessor from DevExpress. But due to memory constraints want to reduce the DPI value of this pdf stream. So is there any possibility to do that?
public static void Main(string[] args)
{
var processor = new PdfDocumentProcessor();
var newDocument = new MemoryStream();
for (int i = 10; i <= 19; i++)
{
using (var stream = File.OpenRead($"D:\\Office\\photos\\BigPDF\\{i}.pdf"))
{
processor.AppendDocument(stream);
processor.SaveDocument(newDocument);
}
}
}
You can only control the DPI when printing, not when saving a local file.
You can however tweak the images within the PDF to convert them to JPEG and change the file quality of those.
See:
PdfGraphics.ConvertImagesToJpeg property
JpegImageQuality property

How do I read and write XMP metadata in C#?

I have this method for resizing images, and I have managed to input all of the metadata into the new image except for the XMP data. Now, I can only find topics on how manage the XMP part in C++ but I need it in C#. The closest I've gotten is the xmp-sharp project which is based on some old port of Adobe's SDK, but I can't get that working for me. The MetaDataExtractor project gives me the same results - that is, file format/encoding not supported. I've tried this with .jpg, .png and .tif files.
Is there no good way of reading and writing XMP in C#?
Here is my code if it's of any help (omitting all irrelevant parts):
public Task<Stream> Resize(Size size, Stream image)
{
using (var bitmap = Image.FromStream(image))
{
var newSize = new Size(size.Width, size.Height);
var ms = new MemoryStream();
using (var bmPhoto = new Bitmap(newSize.Width, newSize.Height, PixelFormat.Format24bppRgb))
{
// This saves all metadata except XMP
foreach (var id in bitmap.PropertyIdList)
bmPhoto.SetPropertyItem(bitmap.GetPropertyItem(id));
// Trying to use xmp-sharp for the XMP part
try
{
IXmpMeta xmp = XmpMetaFactory.Parse(image);
}
catch (XmpException e)
{
// Here, I always get "Unsupported Encoding, XML parsing failure"
}
// Trying to use MetadataExtractor for the XMP part
try
{
var xmpDirs = ImageMetadataReader.ReadMetadata(image).Where(d => d.Name == "XMP");
}
catch (Exception e)
{
// Here, I always get "File format is not supported"
}
// more code to modify image and save to stream
}
ms.Position = 0;
return Task.FromResult<Stream>(ms);
}
}
The reason you get "File format is not supported" is because you already consumed the image from the stream when you called Image.FromStream(image) in the first few lines.
If you don't do that, you should find that you can read out the XMP just fine.
var xmp = ImageMetadataReader.ReadMetadata(stream).OfType<XmpDirectory().FirstOrDefault();
If your stream is seekable, you might be able to seek back to the origin (using the Seek method, or by setting Position to zero.)

How to convert PDF files to images

I need to convert PDF files to images. If the PDF file is multi-page,I just need one image that contains all of the PDF pages.
Is there an open source solution which is not charged like the Acrobat product?
The thread "converting PDF file to a JPEG image" is suitable for your request.
One solution is to use a third-party library. ImageMagick is a very popular and is freely available too. You can get a .NET wrapper for it here. The original ImageMagick download page is here.
Convert PDF pages to image files using the Solid Framework Convert PDF pages to image files using the Solid Framework (dead link, the deleted document is available on Internet Archive).
Convert PDF to JPG Universal Document Converter
6 Ways to Convert a PDF to a JPG Image
And you also can take a look at the thread
"How to open a page from a pdf file in pictureBox in C#".
If you use this process to convert a PDF to tiff, you can use this class to retrieve the bitmap from TIFF.
public class TiffImage
{
private string myPath;
private Guid myGuid;
private FrameDimension myDimension;
public ArrayList myImages = new ArrayList();
private int myPageCount;
private Bitmap myBMP;
public TiffImage(string path)
{
MemoryStream ms;
Image myImage;
myPath = path;
FileStream fs = new FileStream(myPath, FileMode.Open);
myImage = Image.FromStream(fs);
myGuid = myImage.FrameDimensionsList[0];
myDimension = new FrameDimension(myGuid);
myPageCount = myImage.GetFrameCount(myDimension);
for (int i = 0; i < myPageCount; i++)
{
ms = new MemoryStream();
myImage.SelectActiveFrame(myDimension, i);
myImage.Save(ms, ImageFormat.Bmp);
myBMP = new Bitmap(ms);
myImages.Add(myBMP);
ms.Close();
}
fs.Close();
}
}
Use it like so:
private void button1_Click(object sender, EventArgs e)
{
TiffImage myTiff = new TiffImage("D:\\Some.tif");
//imageBox is a PictureBox control, and the [] operators pass back
//the Bitmap stored at that position in the myImages ArrayList in the TiffImage
this.pictureBox1.Image = (Bitmap)myTiff.myImages[0];
this.pictureBox2.Image = (Bitmap)myTiff.myImages[1];
this.pictureBox3.Image = (Bitmap)myTiff.myImages[2];
}
You can use Ghostscript to convert PDF to images.
To use Ghostscript from .NET you can take a look at Ghostscript.NET library (managed wrapper around the Ghostscript library).
To produce image from the PDF by using Ghostscript.NET, take a look at RasterizerSample.
To combine multiple images into the single image, check out this sample: http://www.niteshluharuka.com/2012/08/combine-several-images-to-form-a-single-image-using-c/#
As for 2018 there is still not a simple answer to the question of how to convert a PDF document to an image in C#; many libraries use Ghostscript licensed under AGPL and in most cases an expensive commercial license is required for production use.
A good alternative might be using the popular 'pdftoppm' utility which has a GPL license; it can be used from C# as command line tool executed with System.Diagnostics.Process. Popular tools are well known in the Linux world, but a windows build is also available.
If you don't want to integrate pdftoppm by yourself, you can use my PdfRenderer popular wrapper (supports both classic .NET Framework and .NET Core) - it is not free, but pricing is very affordable.
I used PDFiumSharp and ImageSharp in a .NET Standard 2.1 class library.
/// <summary>
/// Saves a thumbnail (jpg) to the same folder as the PDF file, using dimensions 300x423,
/// which corresponds to the aspect ratio of 'A' paper sizes like A4 (ratio h/w=sqrt(2))
/// </summary>
/// <param name="pdfPath">Source path of the pdf file.</param>
/// <param name="thumbnailPath">Target path of the thumbnail file.</param>
/// <param name="width"></param>
/// <param name="height"></param>
public static void SaveThumbnail(string pdfPath, string thumbnailPath = "", int width = 300, int height = 423)
{
using var pdfDocument = new PdfDocument(pdfPath);
var firstPage = pdfDocument.Pages[0];
using var pageBitmap = new PDFiumBitmap(width, height, true);
firstPage.Render(pageBitmap);
var imageJpgPath = string.IsNullOrWhiteSpace(thumbnailPath)
? Path.ChangeExtension(pdfPath, "jpg")
: thumbnailPath;
var image = Image.Load(pageBitmap.AsBmpStream());
// Set the background to white, otherwise it's black. https://github.com/SixLabors/ImageSharp/issues/355#issuecomment-333133991
image.Mutate(x => x.BackgroundColor(Rgba32.White));
image.Save(imageJpgPath, new JpegEncoder());
}
Searching for a powerful and free solution in dotnet core that works on Windows and Linux got me to https://github.com/Dtronix/PDFiumCore and https://github.com/GowenGit/docnet. As PDFiumCore use a much newer version of Pdfium (that seems to be a critical point for using a pdf library) I ended up using it.
Note: If you want to use it on Linux you should install 'libgdiplus' as https://stackoverflow.com/a/59252639/6339469 suggests.
Here's a simple single thread code:
var pageIndex = 0;
var scale = 2;
fpdfview.FPDF_InitLibrary();
var document = fpdfview.FPDF_LoadDocument("test.pdf", null);
var page = fpdfview.FPDF_LoadPage(document, pageIndex);
var size = new FS_SIZEF_();
fpdfview.FPDF_GetPageSizeByIndexF(document, 0, size);
var width = (int)Math.Round(size.Width * scale);
var height = (int)Math.Round(size.Height * scale);
var bitmap = fpdfview.FPDFBitmapCreateEx(
width,
height,
4, // BGRA
IntPtr.Zero,
0);
fpdfview.FPDFBitmapFillRect(bitmap, 0, 0, width, height, (uint)Color.White.ToArgb());
// | | a b 0 |
// | matrix = | c d 0 |
// | | e f 1 |
using var matrix = new FS_MATRIX_();
using var clipping = new FS_RECTF_();
matrix.A = scale;
matrix.B = 0;
matrix.C = 0;
matrix.D = scale;
matrix.E = 0;
matrix.F = 0;
clipping.Left = 0;
clipping.Right = width;
clipping.Bottom = 0;
clipping.Top = height;
fpdfview.FPDF_RenderPageBitmapWithMatrix(bitmap, page, matrix, clipping, (int)RenderFlags.RenderAnnotations);
var bitmapImage = new Bitmap(
width,
height,
fpdfview.FPDFBitmapGetStride(bitmap),
PixelFormat.Format32bppArgb,
fpdfview.FPDFBitmapGetBuffer(bitmap));
bitmapImage.Save("test.jpg", ImageFormat.Jpeg);
For a thread safe implementation see this:
https://github.com/hmdhasani/DtronixPdf/blob/master/src/DtronixPdfBenchmark/Program.cs
The PDF engine used in Google Chrome, called PDFium, is open source under the "BSD 3-clause" license. I believe this allows redistribution when used in a commercial product.
There is a .NET wrapper for it called PdfiumViewer (NuGet) which works well to the extent I have tried it. It is under the Apache license which also allows redistribution.
(Note that this is NOT the same 'wrapper' as https://pdfium.patagames.com/ which requires a commercial license)
(There is one other PDFium .NET wrapper, PDFiumSharp, but I have not evaluated it.)
In my opinion, so far, this may be the best choice of open-source (free as in beer) PDF libraries to do the job which do not put restrictions on the closed-source / commercial nature of the software utilizing them. I don't think anything else in the answers here satisfy that criteria, to the best of my knowledge.
Regarding PDFiumSharp: After elaboration I was able to create PNG files from a PDF solution.
This is my code:
using PDFiumSharp;
using System.Collections.Generic;
using System.Drawing;
using System.IO;
public class Program
{
static public void Main(String[] args)
{
var renderfoo = new Renderfoo()
renderfoo.RenderPDFAsImages(#"C:\Temp\example.pdf", #"C:\temp");
}
}
public class Renderfoo
{
public void RenderPDFAsImages(string Inputfile, string OutputFolder)
{
string fileName = Path.GetFileNameWithoutExtension(Inputfile);
using (PDFiumSharp.PdfDocument doc = new PDFiumSharp.PdfDocument(Inputfile))
{
for (int i = 0; i < doc.Pages.Count; i++)
{
var page = doc.Pages[i];
using (var bitmap = new System.Drawing.Bitmap((int)page.Width, (int)page.Height))
{
var grahpics = Graphics.FromImage(bitmap);
grahpics.Clear(Color.White);
page.Render(bitmap);
var targetFile = Path.Combine(OutputFolder, fileName + "_" + i + ".png");
bitmap.Save(targetFile);
}
}
}
}
}
For starters, you need to take the following steps to get the PDFium wrapper up and running:
Run the Custom Code tool for both tt files via right click in Visual Studio
Compile the GDIPlus Project
Copy the compiled assemblies (from the GDIPlus project) to your project
Reference both PDFiumSharp and PDFiumsharp.GdiPlus assemblies in your project
Make sure that pdfium_x64.dll and/or pdfium_x86.dll are both found in your project output directory.
You may check Freeware.Pdf2Png MIT license.
Just find in nuget those name.
var dd = System.IO.File.ReadAllBytes("pdffile.pdf");
byte[] pngByte = Freeware.Pdf2Png.Convert(dd, 1);
System.IO.File.WriteAllBytes(Path.Combine(#"C:\temp", "dd.png"), pngByte );
The NuGet package Pdf2Png is available for free and is only protected by the MIT License, which is very open.
I've tested around a bit and this is the code to get it to convert a PDF file to an image (tt does save the image in the debug folder).
using cs_pdf_to_image;
using PdfToImage;
private void BtnConvert_Click(object sender, EventArgs e)
{
if(openFileDialog1.ShowDialog() == DialogResult.OK)
{
try
{
string PdfFile = openFileDialog1.FileName;
string PngFile = "Convert.png";
List<string> Conversion = cs_pdf_to_image.Pdf2Image.Convert(PdfFile, PngFile);
Bitmap Output = new Bitmap(PngFile);
PbConversion.Image = Output;
}
catch(Exception E)
{
MessageBox.Show(E.Message);
}
}
}
Apache PDFBox also works great for me.
Usage with the command line tool:
javar -jar pdfbox-app-2.0.19.jar PDFToImage -quality 1.0 -dpi 150 -prefix out_dir/page -format png
There is a free nuget package (Pdf2Image), which allows the extraction of pdf pages to jpg files or to a collection of images (List ) in just one line
string file = "c:\\tmp\\test.pdf";
List<System.Drawing.Image> images = PdfSplitter.GetImages(file, PdfSplitter.Scale.High);
PdfSplitter.WriteImages(file, "c:\\tmp", PdfSplitter.Scale.High, PdfSplitter.CompressionLevel.Medium);
All source is also available on github Pdf2Image
Using Android default libraries like AppCompat, you can convert all the PDF pages into images. This way is very fast and optimized. The below code is for getting separate images of a PDF page. It is very fast and quick.
ParcelFileDescriptor fileDescriptor = ParcelFileDescriptor.open(new File("pdfFilePath.pdf"), MODE_READ_ONLY);
PdfRenderer renderer = new PdfRenderer(fileDescriptor);
final int pageCount = renderer.getPageCount();
for (int i = 0; i < pageCount; i++) {
PdfRenderer.Page page = renderer.openPage(i);
Bitmap bitmap = Bitmap.createBitmap(page.getWidth(), page.getHeight(),Bitmap.Config.ARGB_8888);
Canvas canvas = new Canvas(bitmap);
canvas.drawColor(Color.WHITE);
canvas.drawBitmap(bitmap, 0, 0, null);
page.render(bitmap, null, null, PdfRenderer.Page.RENDER_MODE_FOR_DISPLAY);
page.close();
if (bitmap == null)
return null;
if (bitmapIsBlankOrWhite(bitmap))
return null;
String root = Environment.getExternalStorageDirectory().toString();
File file = new File(root + filename + ".png");
if (file.exists()) file.delete();
try {
FileOutputStream out = new FileOutputStream(file);
bitmap.compress(Bitmap.CompressFormat.PNG, 100, out);
Log.v("Saved Image - ", file.getAbsolutePath());
out.flush();
out.close();
} catch (Exception e) {
e.printStackTrace();
}
}
=======================================================
private static boolean bitmapIsBlankOrWhite(Bitmap bitmap) {
if (bitmap == null)
return true;
int w = bitmap.getWidth();
int h = bitmap.getHeight();
for (int i = 0; i < w; i++) {
for (int j = 0; j < h; j++) {
int pixel = bitmap.getPixel(i, j);
if (pixel != Color.WHITE) {
return false;
}
}
}
return true;
}
I kind of bumped into this project at SourceForge. It seems to me it's still active.
PDF convert to JPEG at SourceForge
Developer's site
My two cents.
https://www.codeproject.com/articles/317700/convert-a-pdf-into-a-series-of-images-using-csharp
I found this GhostScript wrapper to be working like a charm for converting the PDFs to PNGs, page by page.
Usage:
string pdf_filename = #"C:\TEMP\test.pdf";
var pdf2Image = new Cyotek.GhostScript.PdfConversion.Pdf2Image(pdf_filename);
for (var page = 1; page < pdf2Image.PageCount; page++)
{
string png_filename = #"C:\TEMP\test" + page + ".png";
pdf2Image.ConvertPdfPageToImage(png_filename, page);
}
Being built on GhostScript, obviously for commercial application the licensing question remains.
(Disclaimer I worked on this component at Software Siglo XXI)
You could use Super Pdf2Image Converter to generate a TIFF multi-page file with all the rendered pages from the PDF in high resolution. It's available for both 32 and 64 bit and is very cheap and effective. I'd recommend you to try it.
Just one line of code...
GetImage(outputFileName, firstPage, lastPage, resolution, imageFormat)
Converts specifies pages to image and save them to outputFileName (tiff allows multi-page or creates several files)
You can take a look here: http://softwaresigloxxi.com/SuperPdf2ImageConverter.html

Aspose pdf417 recognition

I want to read the content of a pdf417 barcode contained in a pdf file using C#. I wrote the following code:
[...]
// bind the pdf document
Aspose.Pdf.Facades.PdfExtractor pdfExtractor = new Aspose.Pdf.Facades.PdfExtractor();
pdfExtractor.BindPdf(ImageFullPath);
pdfExtractor.StartPage = 1;
pdfExtractor.EndPage = 1;
// extract the images
pdfExtractor.ExtractImage();
//save images to stream in a loop
while (pdfExtractor.HasNextImage())
{
// save image to stream
MemoryStream imageStream = new MemoryStream();
pdfExtractor.GetNextImage(imageStream);
imageStream.Position = 0;
// recognize the barcode from the image stream above
System.Drawing.Image img = Image.FromStream(imageStream);
Aspose.BarCodeRecognition.BarCodeReader barcodeReader = new Aspose.BarCodeRecognition.BarCodeReader(imageStream, BarCodeReadType.Pdf417);
while (barcodeReader.Read())
{
Console.WriteLine("Codetext found: " + barcodeReader.GetCodeBytes());
}
// close the reader
barcodeReader.Close();
}
Console.WriteLine("Done");
[...]
I know that the content of the barcode is "OB|090547db800b6c47": the problem is that the output I obtain is "Codetext found: OBAQAQOB|0*6AJAFEHdbhDrh".
Does anyone know what I'm doing wrong?
Copied your code and did just one change below and got "Codetext found: OB|090547db800b6c47" output.
Console.WriteLine("Codetext found: " + barcodeReader.GetCodeText());
I used Aspose.BarCode for .NET version 5.5 in a .NET 4.5 project. Which version are you using?
PS. I am a Developer Evangelist at Aspose.

Generating a multipage TIFF is not working

I'm trying to generate a multipage TIFF file from an existing picture using code by Bob Powell:
picture.SelectActiveFrame(FrameDimension.Page, 0);
var image = new Bitmap(picture);
using (var stream = new MemoryStream())
{
ImageCodecInfo codecInfo = null;
foreach (var imageEncoder in ImageCodecInfo.GetImageEncoders())
{
if (imageEncoder.MimeType != "image/tiff") continue;
codecInfo = imageEncoder;
break;
}
var parameters = new EncoderParameters
{
Param = new []
{
new EncoderParameter(Encoder.SaveFlag, (long) EncoderValue.MultiFrame)
}
};
image.Save(stream, codecInfo, parameters);
parameters = new EncoderParameters
{
Param = new[]
{
new EncoderParameter(Encoder.SaveFlag, (long) EncoderValue.FrameDimensionPage)
}
};
for (var i = 1; i < picture.GetFrameCount(FrameDimension.Page); i++)
{
picture.SelectActiveFrame(FrameDimension.Page, i);
var img = new Bitmap(picture);
image.SaveAdd(img, parameters);
}
parameters = new EncoderParameters
{
Param = new[]
{
new EncoderParameter(Encoder.SaveFlag, (long)EncoderValue.Flush)
}
};
image.SaveAdd(parameters);
stream.Flush();
}
But it's not working (only the first frame is included in the image) and I don't know why.
What I want to do is to change a particular frame of a TIFF file (add annotations to it).
I don't know if there's a simpler way to do it but what I have in mind is to create a multipage TIFF from the original picture and add my own picture instead of that frame.
[deleted first part after comment]
I'm working with multi-page TIFFs using LibTIFF.NET; I found many quicks in handling of TIFF using the standard libraries (memory related and also consistent crashes on 16-bit gray scale images).
What is your test image? Have you tried a many-frame tiff (preferably with a large '1' on the first frame, a '2 on the next etc; this could help you to be certain on the frame included in the file.
Another useful diagnosis may be tiffdump utility, as included in LibTiff binaries (also for windows). This will tell you exactly what frames you have.
See Using LibTiff from c# to access tiled tiff images
[Edit] If you want to understand the .NET stuff: I've found a new resource on multi-page tiffs using the standard .NET functionality (although I'll stick with LibTIFF.NET): TheCodeProject : Save images into a multi-page TIFF file... If you download it, the code snippet in Form1.cs function saveMultipage(..) is similar (but still slightly different) than your code. Especially the flushing at the end is done in a differnt way, and the file is deleted before the first frame...
[/Edit]
It seems that this process doesn't change image object but it changes the stream so I should get the memory stream buffer and build another image object:
var buffer=stream.GetBuffer();
using(var newStream=new MemoryStream(buffer))
{
var result=Image.FromStream(newStream);
}
Now result will include all frames.

Categories