Image extraction encoding issue with itextsharp - c#

I extract the pictures found in a PDF document with itextsharp using this snippet (thanks #Scott Stanford from this topic) :
private static IList<System.Drawing.Image> GetImagesFromPdfDict(PdfDictionary dict, PdfReader doc)
{
List<Image> images = new List<Image>();
if (dict == null)
return images;
PdfDictionary res = (PdfDictionary)(PdfReader.GetPdfObject(dict.Get(PdfName.RESOURCES)));
if (res == null)
return images;
PdfDictionary xobj = (PdfDictionary)(PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT)));
if (xobj == null)
return images;
foreach (PdfName name in xobj.Keys)
{
PdfObject obj = xobj.Get(name);
if (obj.IsIndirect())
{
PdfDictionary tg = (PdfDictionary)(PdfReader.GetPdfObject(obj));
PdfName subtype = (PdfName)(PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE)));
if (PdfName.IMAGE.Equals(subtype))
{
int xrefIdx = ((PRIndirectReference)obj).Number;
PdfObject pdfObj = doc.GetPdfObject(xrefIdx);
PdfStream str = (PdfStream)(pdfObj);
iTextSharp.text.pdf.parser.PdfImageObject pdfImage =
new iTextSharp.text.pdf.parser.PdfImageObject((PRStream)str);
System.Drawing.Image img = pdfImage.GetDrawingImage();
images.Add(img);
}
else if (PdfName.FORM.Equals(subtype) || PdfName.GROUP.Equals(subtype))
{
images.AddRange(GetImagesFromPdfDict(tg, doc));
}
}
}
return images;
}
Then I save the extracted System.Drawing.Image into jpeg files like this :
image.Save(path, ImageFormat.Jpeg);
This works well for most pictures, but in some rare cases, the saved pictures look like this :
(I have added the black stroke after the generation of the image because these pictures concern real people).
The white color turns into pink, and the black colors turn into green shades.
I tried to save the System.Drawing.Image with several encodings (System.Drawing.Imaging.EncoderParameter, also with PNG...) but I did not managed to change its output. So I think this problem come from the extraction of the image from the PDF and the creation of the System.Drawing.Image.
To test if the pictures are not corrupted, I tried with the online PDF extractor http://www.extractpdf.com/. This tool managed to extract these pictures without any problem.
Does anybody have an idea to solve this issue ?

Related

Working with images of different file types [duplicate]

I am working on reading text from an image through OCR. It only supports TIFF format images.
So, I need to convert other formats to TIFF format. Can it be done? Please help by providing some references.
If you create an Image object in .NET, you can save it as a TIFF. It is one of the many ImageFormat choices at your disposal.
Example:
var png = Image.FromFile("some.png");
png.Save("a.tiff", ImageFormat.Tiff);
You'll need to include the System.Drawing assembly in your project. That assembly will give you a lot of image manipulation capabilities. Hope that helps.
Intro Note:
This answer cover the Bounty Question; which is: How do we
convert multiple files into 1 tiff? For example, let's say have pdfs,
jpegs, pngs, and I'd like to create 1 tiff out of them?
In this answer I use .net implementation of https://imagemagick.org/index.php for image manipulation and and Ghostscript for helping read an AI/EPS/PDF/PS file so we can translate it to image files both are credible and official source.
After I answered this question I got some extra question in my email asking other merging options, I have therefore extended my answer.
IMO there are 2 steps to your goal:
Install required tools for pdf conversion
Take all images including pdf formatted files from source and merge them together
in one tiff file.
1. Install tools that helps Pdf to Image conversion:
Step 1 is only required if you intend to convert AI/EPS/PDF/PS file formats. Otherwise just jump to step2.
To make it possible converting pdf to any image format, we need a library that can read pdf files and we need a tool to convert it to image type. For this purpose, we will need to install Ghostscript (GNU Affero General Public License).
Here after, we need to install ImageMagick.net for .net in Visual Studio, nuget link.
So far so good.
2. Code part
Second and Last step is we need to read files (png, jpg, bmp, pdf etc) from folder location and add each file to MagickImageCollection, then we have several options to merge use AppendHorizontally, AppendVertically, Montage or Multiple page Tiff. ImageMagick has tons of features, like resizing, resolution etc, this is just example to demonstrate merging features:
public static void MergeImage(string src, string dest, MergeType type = MergeType.MultiplePage)
{
var files = new DirectoryInfo(src).GetFiles();
using (var images = new MagickImageCollection())
{
foreach (var file in files)
{
var image = new MagickImage(file)
{
Format = MagickFormat.Tif,
Depth = 8,
};
images.Add(image);
}
switch (type)
{
case MergeType.Vertical:
using (var result = images.AppendVertically())
{
result.AdaptiveResize(new MagickGeometry(){Height = 600, Width = 800});
result.Write(dest);
}
break;
case MergeType.Horizontal:
using (var result = images.AppendHorizontally())
{
result.AdaptiveResize(new MagickGeometry(){Height = 600, Width = 800});
result.Write(dest);
}
break;
case MergeType.Montage:
var settings = new MontageSettings
{
BackgroundColor = new MagickColor("#FFF"),
Geometry = new MagickGeometry("1x1<")
};
using (var result = images.Montage(settings))
{
result.Write(dest);
}
break;
case MergeType.MultiplePage:
images.Write(dest);
break;
default:
throw new ArgumentOutOfRangeException(nameof(type), type, "Un-support choice");
}
images.Dispose();
}
}
public enum MergeType
{
MultiplePage,
Vertical,
Horizontal,
Montage
}
To run the code
public static void Main(string[] args)
{
var src = #"C:\temp\Images";
var dest1 = #"C:\temp\Output\MultiplePage.tiff";
var dest2 = #"C:\temp\Output\Vertical.tiff";
var dest3 = #"C:\temp\Output\Horizontal.tiff";
var dest4 = #"C:\temp\Output\Montage.tiff";
MergeImage(src, dest1);
MergeImage(src, dest2, MergeType.Vertical);
MergeImage(src, dest3, MergeType.Horizontal);
MergeImage(src, dest4, MergeType.Montage);
}
Here is 4 input files in C:\temp\Images:
After running the code, we get 4 new files under C:\temp\Output looks like this:
4 page Multiple Page Tiff
4 image Vertical Merge
4 image Horizontal Merge
4 image Montage Merge
Final note:
it is possible to merge multiple images to tiff using System.Drawing; and using System.Drawing.Imaging; with out using ImageMagick, but pdf does require a third party conversion library or tool, therefore I use Ghostscript and ImageMagick for C#.
ImageMagick has many features, so you can change the resolution, size of output file etc. it is well recognized library.
Disclaimer: A part of this answer is taken from my my personal web site https://itbackyard.com/how-to-convert-ai-eps-pdf-ps-to-image-file/ with source code to github.
To be covert the image in tif format.In the below example to be convert the image and set to a text box.to be see the image in text box is (.tif formate).This sources code is working.
private void btn_Convert(object sender, EventArgs e)
{
string newName = System.IO.Path.GetFileNameWithoutExtension(CurrentFile);
newName = newName + ".tif";
try
{
img.Save(newName, ImageFormat.Tiff);
}
catch (Exception ex)
{
string error = ee.Message.ToString();
MessageBox.Show(MessageBoxIcon.Error);
}
textBox2.Text = System.IO.Path.GetFullPath(newName.ToString());
}
I tested this with jpg, bmp, png, and gif. Works for single and multipage creation of tiffs. Pass it a full pathname to the file. Hope it helps someone. (extracted from MSDN)
public static string[] ConvertJpegToTiff(string[] fileNames, bool isMultipage)
{
EncoderParameters encoderParams = new EncoderParameters(1);
ImageCodecInfo tiffCodecInfo = ImageCodecInfo.GetImageEncoders()
.First(ie => ie.MimeType == "image/tiff");
string[] tiffPaths = null;
if (isMultipage)
{
tiffPaths = new string[1];
System.Drawing.Image tiffImg = null;
try
{
for (int i = 0; i < fileNames.Length; i++)
{
if (i == 0)
{
tiffPaths[i] = String.Format("{0}\\{1}.tif",
Path.GetDirectoryName(fileNames[i]),
Path.GetFileNameWithoutExtension(fileNames[i]));
// Initialize the first frame of multipage tiff.
tiffImg = System.Drawing.Image.FromFile(fileNames[i]);
encoderParams.Param[0] = new EncoderParameter(
System.Drawing.Imaging.Encoder.SaveFlag, (long)EncoderValue.MultiFrame);
tiffImg.Save(tiffPaths[i], tiffCodecInfo, encoderParams);
}
else
{
// Add additional frames.
encoderParams.Param[0] = new EncoderParameter(
System.Drawing.Imaging.Encoder.SaveFlag, (long)EncoderValue.FrameDimensionPage);
using (System.Drawing.Image frame = System.Drawing.Image.FromFile(fileNames[i]))
{
tiffImg.SaveAdd(frame, encoderParams);
}
}
if (i == fileNames.Length - 1)
{
// When it is the last frame, flush the resources and closing.
encoderParams.Param[0] = new EncoderParameter(
System.Drawing.Imaging.Encoder.SaveFlag, (long)EncoderValue.Flush);
tiffImg.SaveAdd(encoderParams);
}
}
}
finally
{
if (tiffImg != null)
{
tiffImg.Dispose();
tiffImg = null;
}
}
}
else
{
tiffPaths = new string[fileNames.Length];
for (int i = 0; i < fileNames.Length; i++)
{
tiffPaths[i] = String.Format("{0}\\{1}.tif",
Path.GetDirectoryName(fileNames[i]),
Path.GetFileNameWithoutExtension(fileNames[i]));
// Save as individual tiff files.
using (System.Drawing.Image tiffImg = System.Drawing.Image.FromFile(fileNames[i]))
{
tiffImg.Save(tiffPaths[i], ImageFormat.Tiff);
}
}
}
return tiffPaths;
}
ImageMagick command line can do that easily. It is supplied on most Linux systems and is available for Mac or Windows also. See https://imagemagick.org/script/download.php
convert image.suffix -compress XXX image.tiff
or you can process a whole folder of files using
mogrify -format tiff -path path/to/output_directory *
ImageMagick supports combining multiple images into a multi-page TIFF. And the images can be of mixed types even including PDF.
convert image1.suffix1 image2.suffix2 ... -compress XXX imageN.suffixN output.tiff
You can choose from a number of compression formats or no compression.
See
https://imagemagick.org/script/command-line-processing.php
https://imagemagick.org/Usage/basics/
https://imagemagick.org/Usage/basics/#mogrify
https://imagemagick.org/script/command-line-options.php#compress
Or you can use Magick.Net for a C# interface. See https://github.com/dlemstra/Magick.NET
Main ImageMagick page is at https://imagemagick.org.
Supported formats are listed at https://imagemagick.org/script/formats.php
You can easily process your images to resize them, convert to grayscale, filter (sharpen), threshold, etc, all in the same command line.
See
https://imagemagick.org/Usage/
https://imagemagick.org/Usage/reference.html
This is how I convert images that are uploaded to a website. Changed it so it outputs Tiff files. The method input and outputs a byte array so it can easily be used in a variety of ways. But you can easily modify it.
using System.Drawing;
using System.Drawing.Drawing2D;
using System.Drawing.Imaging;
public byte[] ConvertImageToTiff(byte[] SourceImage)
{
//create a new byte array
byte[] bin = new byte[0];
//check if there is data
if (SourceImage == null || SourceImage.Length == 0)
{
return bin;
}
//convert the byte array to a bitmap
Bitmap NewImage;
using (MemoryStream ms = new MemoryStream(SourceImage))
{
NewImage = new Bitmap(ms);
}
//set some properties
Bitmap TempImage = new Bitmap(NewImage.Width, NewImage.Height);
using (Graphics g = Graphics.FromImage(TempImage))
{
g.CompositingMode = CompositingMode.SourceCopy;
g.CompositingQuality = CompositingQuality.HighQuality;
g.SmoothingMode = SmoothingMode.HighQuality;
g.InterpolationMode = InterpolationMode.HighQualityBicubic;
g.PixelOffsetMode = PixelOffsetMode.HighQuality;
g.DrawImage(NewImage, 0, 0, NewImage.Width, NewImage.Height);
}
NewImage = TempImage;
//save the image to a stream
using (MemoryStream ms = new MemoryStream())
{
EncoderParameters encoderParameters = new EncoderParameters(1);
encoderParameters.Param[0] = new EncoderParameter(Encoder.Quality, 80L);
NewImage.Save(ms, GetEncoderInfo("image/tiff"), encoderParameters);
bin = ms.ToArray();
}
//cleanup
NewImage.Dispose();
TempImage.Dispose();
//return data
return bin;
}
//get the correct encoder info
public ImageCodecInfo GetEncoderInfo(string MimeType)
{
ImageCodecInfo[] encoders = ImageCodecInfo.GetImageEncoders();
for (int j = 0; j < encoders.Length; ++j)
{
if (encoders[j].MimeType.ToLower() == MimeType.ToLower())
return encoders[j];
}
return null;
}
To test
var oldImage = File.ReadAllBytes(Server.MapPath("OldImage.jpg"));
var newImage = ConvertImageToTiff(oldImage);
File.WriteAllBytes(Server.MapPath("NewImage.tiff"), newImage);

How to embed all fonts from other PDF iText 7

I am trying to overlay two PDF files using iText7/C#.
The first one is kind of background and the second one is containing form fields.
Everything works fine and only problem is that I lose fonts from the second file.
I try as follows:
static public bool Overlay(string back_path, string front_path, string merge_path)
{
PdfReader reader;
PdfDocument pdf = null, front;
try
{
reader = new PdfReader(back_path);
pdf = new PdfDocument(reader, new PdfWriter(merge_path));
front = new PdfDocument(new PdfReader(front_path));
var form = PdfAcroForm.GetAcroForm(front, false);
PdfAcroForm dform = PdfAcroForm.GetAcroForm(pdf, true);
IDictionary<String, PdfFormField> fields = form.GetFormFields();
// copy styles
dform.SetDefaultResources(form.GetDefaultResources());
dform.SetDefaultAppearance(form.GetDefaultAppearance().GetValue());
// do overlay
foreach (KeyValuePair<string, PdfFormField> pair in fields)
{
try
{
var field = pair.Value;
PdfPage page = field.GetWidgets().First().GetPage();
int pg_no = front.GetPageNumber(page);
if (pg_no < front_start_page || pg_no > front_end_page)
continue;
PdfObject copied = field.GetPdfObject().CopyTo(pdf, true);
PdfFormField copiedField = PdfFormField.MakeFormField(copied, pdf);
// The following returns null. If it returns something, I think I could use copiedField.setFont(font).
// var font = field.GetFont();
dform.AddField(copiedField, pdf.GetPage(pg_no));
}
catch (Exception ex)
{
System.Diagnostics.Debug.WriteLine($"Overlaying field {pair.Key} failed. ({ex.Message})");
}
}
pdf.Close();
return true;
}
catch (Exception ex)
{
throw new OverlayException(ex.Message);
}
}
public static PdfDictionary get_font_dict(PdfDocument pdfDoc)
{
PdfDictionary acroForm = pdfDoc.GetCatalog().GetPdfObject().GetAsDictionary(PdfName.AcroForm);
if (acroForm == null)
{
return null;
}
PdfDictionary dr = acroForm.GetAsDictionary(PdfName.DR);
if (dr == null)
{
return null;
}
PdfDictionary font = dr.GetAsDictionary(PdfName.Font);
return font;
}
So basically I get all fonts from the second PDF and copy them to the final PDF.
But it does not work.
Logically, I think setting font of the original field to the copied one is the right way.
I mean PdfFormField.GetFont() and SetFont().
But it always returns null.
In a comment you clarified:
the background PDF can be assumed not to have form fields or annotations. I mean we can assume background PDF only contains static content (scanned form) and the front PDF only contains formfields.
In that case the easiest way to implement your method is to add the background as xobject to the form PDF instead of adding the form to the background PDF.
You can simply do that like this:
PdfReader formReader = new PdfReader(front_path);
PdfReader backReader = new PdfReader(back_path);
PdfWriter writer = new PdfWriter(merge_path);
using (PdfDocument source = new PdfDocument(backReader))
using (PdfDocument target = new PdfDocument(formReader, writer))
{
PdfFormXObject xobject = source.GetPage(1).CopyAsFormXObject(target);
PdfPage targetFirstPage = target.GetFirstPage();
PdfStream stream = targetFirstPage.NewContentStreamBefore();
PdfCanvas pdfCanvas = new PdfCanvas(stream, targetFirstPage.GetResources(), target);
Rectangle cropBox = targetFirstPage.GetCropBox();
pdfCanvas.AddXObject(xobject, cropBox.GetX(), cropBox.GetY());
}
Depending on the exact static contents of the background and the form PDF, you might want to use NewContentStreamAfter instead of NewContentStreamBefore or even to use some nifty blend mode to get the exact static content look you want.

Convert image to pdf using itextsharp [duplicate]

I have the following code but this code add only the last image into pdf.
try {
filePath = (filePath != null && filePath.endsWith(".pdf")) ? filePath
: filePath + ".pdf";
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document,
new FileOutputStream(filePath));
document.open();
// document.add(new Paragraph("Image Example"));
for (String imageIpath : imagePathsList) {
// Add Image
Image image1 = Image.getInstance(imageIpath);
// Fixed Positioning
image1.setAbsolutePosition(10f, 10f);
// Scale to new height and new width of image
image1.scaleAbsolute(600, 800);
// image1.scalePercent(0.5f);
// Add to document
document.add(image1);
//document.bottom();
}
writer.close();
} catch (Exception e) {
LOGGER.error(e.getMessage());
}
Would you give me a hint about how to update the code in order to add all the images into the exported pdf? imagePathsList contains all the paths of images that that I want to add into a single pdf.
Best Regards,
Aurelian
Take a look at the MultipleImages example and you'll discover that there are two errors in your code:
You create a page with size 595 x 842 user units, and you add every image to that page regardless of the dimensions of the image.
You claim that only one image is added, but that's not true. You are adding all the images on top of each other on the same page. The last image covers all the preceding images.
Take a look at my code:
public void createPdf(String dest) throws IOException, DocumentException {
Image img = Image.getInstance(IMAGES[0]);
Document document = new Document(img);
PdfWriter.getInstance(document, new FileOutputStream(dest));
document.open();
for (String image : IMAGES) {
img = Image.getInstance(image);
document.setPageSize(img);
document.newPage();
img.setAbsolutePosition(0, 0);
document.add(img);
}
document.close();
}
I create a Document instance using the size of the first image. I then loop over an array of images, setting the page size of the next page to the size of each image before I trigger a newPage() [*]. Then I add the image at coordinate 0, 0 because now the size of the image will match the size of each page.
[*] The newPage() method only has effect if something was added to the current page. The first time you go through the loop, nothing has been added yet, so nothing happens. This is why you need set the page size to the size of the first image when you create the Document instance.
Android has the feature "PdfDocument" to achieve this,
class Main2Activity : AppCompatActivity() {
private var imgFiles: Array<File?>? = null
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
setContentView(R.layout.activity_main2)
imgFiles= arrayOfNulls(2)
imgFiles!![0] = File(Environment.getExternalStoragePublicDirectory(Environment.DIRECTORY_PICTURES).toString() + "/doc1.png")
imgFiles!![1] = File(Environment.getExternalStoragePublicDirectory(Environment.DIRECTORY_PICTURES).toString() + "/doc3.png")
val file = getOutputFile(File(Environment.getExternalStorageDirectory().absolutePath)
, "/output.pdf")
val fOut = FileOutputStream(file)
val document = PdfDocument()
var i = 0
imgFiles?.forEach {
i++
val bitmap = BitmapFactory.decodeFile(it?.path)
val pageInfo = PdfDocument.PageInfo.Builder(bitmap.width, bitmap.height, i).create()
val page = document.startPage(pageInfo)
val canvas = page?.canvas
val paint = Paint()
canvas?.drawPaint(paint)
paint.color = Color.BLUE;
canvas?.drawBitmap(bitmap, 0f, 0f, null)
document.finishPage(page)
bitmap.recycle()
}
document.writeTo(fOut)
document.close()
}
private fun getOutputFile(path: File, fileName: String): File? {
if (!path.exists()) {
path.mkdirs()
}
val file = File(path, fileName)
try {
if (file.exists()) {
file.delete()
}
file.createNewFile()
} catch (e: Exception) {
e.printStackTrace()
}
return file
}
}
finally enable the storage permission in manifest, this should works

Get Layer2 Text (Signature Description) from signature image using itextsharp

I need to retrieve the layer2 text from a signature. How can I get the description (under the signature image) using itextsharp? below is the code I'm using to get the sign date and username:
PdfReader reader = new PdfReader(pdfPath, System.Text.Encoding.UTF8.GetBytes(MASTER_PDF_PASSWORD));
using (MemoryStream memoryStream = new MemoryStream())
{
PdfStamper stamper = new PdfStamper(reader, memoryStream);
AcroFields acroFields = stamper.AcroFields;
List<String> names = acroFields.GetSignatureNames();
foreach (String name in names)
{
PdfPKCS7 pk = acroFields.VerifySignature(name);
String userName = PdfPKCS7.GetSubjectFields(pk.SigningCertificate).GetField("CN");
Console.WriteLine("Sign Date: " + pk.SignDate.ToString() + " Name: " + userName);
// Here i need to retrieve the description underneath the signature image
}
reader.RemoveUnusedObjects();
reader.Close();
stamper.Writer.CloseStream = false;
if (stamper != null)
{
stamper.Close();
}
}
and below is the code I used to set the description
PdfStamper st = PdfStamper.CreateSignature(reader, memoryStream, '\0', null, true);
PdfSignatureAppearance sap = st.SignatureAppearance;
sap.Render = PdfSignatureAppearance.SignatureRender.GraphicAndDescription;
sap.Layer2Font = font;
sap.Layer2Text = "Some text that i want to retrieve";
Thank you.
While Bruno addressed the issue starting with a PDF containing a "layer 2", allow me to first state that using these "signature layers" in PDF signature appearances is not required by the PDF specification, the specification actually does not even know these layers at all! Thus, if you try to parse a specific layer, you may not find such a "layer" or, even worse, find something that looks like that layer (a XObject named n2) which contains the wrong data.
That been said, though, Whether you look for text from a layer 2 or from the signature appearance as a whole, you can use iTextSharp text extraction capabilities. I used Bruno's code as base for retrieving the n2 layer.
public static void ExtractSignatureTextFromFile(FileInfo file)
{
try
{
Console.Out.Write("File: {0}\n", file);
using (var pdfReader = new PdfReader(file.FullName))
{
AcroFields fields = pdfReader.AcroFields;
foreach (string name in fields.GetSignatureNames())
{
Console.Out.Write(" Signature: {0}\n", name);
iTextSharp.text.pdf.AcroFields.Item item = fields.GetFieldItem(name);
PdfDictionary widget = item.GetWidget(0);
PdfDictionary ap = widget.GetAsDict(PdfName.AP);
if (ap == null)
continue;
PdfStream normal = ap.GetAsStream(PdfName.N);
if (normal == null)
continue;
Console.Out.Write(" Content of normal appearance: {0}\n", extractText(normal));
PdfDictionary resources = normal.GetAsDict(PdfName.RESOURCES);
if (resources == null)
continue;
PdfDictionary xobject = resources.GetAsDict(PdfName.XOBJECT);
if (xobject == null)
continue;
PdfStream frm = xobject.GetAsStream(PdfName.FRM);
if (frm == null)
continue;
PdfDictionary res = frm.GetAsDict(PdfName.RESOURCES);
if (res == null)
continue;
PdfDictionary xobj = res.GetAsDict(PdfName.XOBJECT);
if (xobj == null)
continue;
PRStream n2 = (PRStream) xobj.GetAsStream(PdfName.N2);
if (n2 == null)
continue;
Console.Out.Write(" Content of normal appearance, layer 2: {0}\n", extractText(n2));
}
}
}
catch (Exception ex)
{
Console.Error.Write("Error... " + ex.StackTrace);
}
}
public static String extractText(PdfStream xObject)
{
PdfDictionary resources = xObject.GetAsDict(PdfName.RESOURCES);
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(strategy);
processor.ProcessContent(ContentByteUtils.GetContentBytesFromContentObject(xObject), resources);
return strategy.GetResultantText();
}
For the sample file signature_n2.pdf Bruno used you get this:
File: ...\signature_n2.pdf
Signature: Signature1
Content of normal appearance: This document was signed by Bruno
Specimen.
Content of normal appearance, layer 2: This document was signed by Bruno
Specimen.
As this sample uses the layer 2 as the OP expects, it already contains the text in question.
Please take a look at the following PDF: signature_n2.pdf. It contains a signature with the following text in the n2 layer:
This document was signed by Bruno
Specimen.
Before we can write code to extract this text, we should use iText RUPS to look at the internal structure of the PDF, so that we can find out where this /n2 layer is stored:
Based on this information, we can start writing our code. See the GetN2fromSig example:
public static void main(String[] args) throws IOException {
PdfReader reader = new PdfReader(SRC);
AcroFields fields = reader.getAcroFields();
Item item = fields.getFieldItem("Signature1");
PdfDictionary widget = item.getWidget(0);
PdfDictionary ap = widget.getAsDict(PdfName.AP);
PdfStream normal = ap.getAsStream(PdfName.N);
PdfDictionary resources = normal.getAsDict(PdfName.RESOURCES);
PdfDictionary xobject = resources.getAsDict(PdfName.XOBJECT);
PdfStream frm = xobject.getAsStream(PdfName.FRM);
PdfDictionary res = frm.getAsDict(PdfName.RESOURCES);
PdfDictionary xobj = res.getAsDict(PdfName.XOBJECT);
PRStream n2 = (PRStream) xobj.getAsStream(PdfName.N2);
byte[] stream = PdfReader.getStreamBytes(n2);
System.out.println(new String(stream));
}
We get the widget annotation for the signature field with name "signature1". Based on the info from RUPS, we know that we have to get the resources (/Resources) of the normal (/N) appearance (/AP). In the /XObjects dictionary, we'll find a form XObject named /FRM. This XObject has in turn also some /Resources, more specifically two /XObjects, one named /n0, the other one named /n2.
We get the stream of the /n2 object and we convert it to an uncompressed byte[]. When we print this array as a String, we get the following result:
BT
1 0 0 1 0 49.55 Tm
/F1 12 Tf
(This document was signed by Bruno)Tj
1 0 0 1 0 31.55 Tm
(Specimen.)Tj
ET
This is PDF syntax. BT and ET stand for "Begin Text" and "End Text". The Tm operator set the text matrix. The Tf operator set the font. Tj shows the strings that are delimited by ( and ). If you want the plain text, it's sufficient to extract only the text that is between parentheses.

Extract image from PDF using itextsharp

I am trying to extract all the images from a pdf using itextsharp but can't seem to overcome this one hurdle.
The error occures on the line System.Drawing.Image ImgPDF = System.Drawing.Image.FromStream(MS); giving an error of "Parameter is not valid".
I think it works when the image is a bitmap but not of any other format.
I have this following code - sorry for the length;
private void Form1_Load(object sender, EventArgs e)
{
FileStream fs = File.OpenRead(#"reader.pdf");
byte[] data = new byte[fs.Length];
fs.Read(data, 0, (int)fs.Length);
List<System.Drawing.Image> ImgList = new List<System.Drawing.Image>();
iTextSharp.text.pdf.RandomAccessFileOrArray RAFObj = null;
iTextSharp.text.pdf.PdfReader PDFReaderObj = null;
iTextSharp.text.pdf.PdfObject PDFObj = null;
iTextSharp.text.pdf.PdfStream PDFStremObj = null;
try
{
RAFObj = new iTextSharp.text.pdf.RandomAccessFileOrArray(data);
PDFReaderObj = new iTextSharp.text.pdf.PdfReader(RAFObj, null);
for (int i = 0; i <= PDFReaderObj.XrefSize - 1; i++)
{
PDFObj = PDFReaderObj.GetPdfObject(i);
if ((PDFObj != null) && PDFObj.IsStream())
{
PDFStremObj = (iTextSharp.text.pdf.PdfStream)PDFObj;
iTextSharp.text.pdf.PdfObject subtype = PDFStremObj.Get(iTextSharp.text.pdf.PdfName.SUBTYPE);
if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())
{
byte[] bytes = iTextSharp.text.pdf.PdfReader.GetStreamBytesRaw((iTextSharp.text.pdf.PRStream)PDFStremObj);
if ((bytes != null))
{
try
{
System.IO.MemoryStream MS = new System.IO.MemoryStream(bytes);
MS.Position = 0;
System.Drawing.Image ImgPDF = System.Drawing.Image.FromStream(MS);
ImgList.Add(ImgPDF);
}
catch (Exception)
{
}
}
}
}
}
PDFReaderObj.Close();
}
catch (Exception ex)
{
throw new Exception(ex.Message);
}
} //Form1_Load
Resolved...
Even I got the same exception of "Parameter is not valid" and after so much of
work with the help of the link provided by der_chirurg
(http://kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx ) I resolved it
and following is the code:
using System.Drawing;
using System.Drawing.Imaging;
using System.IO;
using iTextSharp.text.pdf.parser;
using Dotnet = System.Drawing.Image;
using iTextSharp.text.pdf;
namespace PDF_Parsing
{
partial class PDF_ImgExtraction
{
string imgPath;
private void ExtractImage(string pdfFile)
{
PdfReader pdfReader = new PdfReader(files[fileIndex]);
for (int pageNumber = 1; pageNumber <= pdfReader.NumberOfPages; pageNumber++)
{
PdfReader pdf = new PdfReader(pdfFile);
PdfDictionary pg = pdf.GetPageN(pageNumber);
PdfDictionary res = (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
PdfDictionary xobj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
foreach (PdfName name in xobj.Keys)
{
PdfObject obj = xobj.Get(name);
if (obj.IsIndirect())
{
PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
string width = tg.Get(PdfName.WIDTH).ToString();
string height = tg.Get(PdfName.HEIGHT).ToString();
ImageRenderInfo imgRI = ImageRenderInfo.CreateForXObject(new Matrix(float.Parse(width), float.Parse(height)), (PRIndirectReference)obj, tg);
RenderImage(imgRI);
}
}
}
}
private void RenderImage(ImageRenderInfo renderInfo)
{
PdfImageObject image = renderInfo.GetImage();
using (Dotnet dotnetImg = image.GetDrawingImage())
{
if (dotnetImg != null)
{
using (MemoryStream ms = new MemoryStream())
{
dotnetImg.Save(ms, ImageFormat.Tiff);
Bitmap d = new Bitmap(dotnetImg);
d.Save(imgPath);
}
}
}
}
}
}
You need to check the stream's /Filter to see what image format a given image uses. It may be a standard image format:
DCTDecode (jpeg)
JPXDecode (jpeg 2000)
JBIG2Decode (jbig is a B&W only format)
CCITTFaxDecode (fax format, PDF supports group 3 and 4)
Other than that, you'll need to get the raw bytes (as you are), and build an image using the image stream's width, height, bits per component, number of color components (could be CMYK, indexed, RGB, or Something Weird), and a few others, as defined in section 8.9 of the ISO PDF SPECIFICATION (available for free).
So in some cases your code will work, but in others, it'll fail with the exception you mentioned.
PS: When you have an exception, PLEASE include the stack trace every single time. Pretty please with sugar on top?
Works for me like this, using these two methods:
public static List<System.Drawing.Image> ExtractImagesFromPDF(byte[] bytes)
{
var imgs = new List<System.Drawing.Image>();
var pdf = new PdfReader(bytes);
try
{
for (int pageNumber = 1; pageNumber <= pdf.NumberOfPages; pageNumber++)
{
PdfDictionary pg = pdf.GetPageN(pageNumber);
List<PdfObject> objs = FindImageInPDFDictionary(pg);
foreach (var obj in objs)
{
if (obj != null)
{
int XrefIndex = Convert.ToInt32(((PRIndirectReference)obj).Number.ToString(System.Globalization.CultureInfo.InvariantCulture));
PdfObject pdfObj = pdf.GetPdfObject(XrefIndex);
PdfStream pdfStrem = (PdfStream)pdfObj;
var pdfImage = new PdfImageObject((PRStream)pdfStrem);
var img = pdfImage.GetDrawingImage();
imgs.Add(img);
}
}
}
}
finally
{
pdf.Close();
}
return imgs;
}
private static List<PdfObject> FindImageInPDFDictionary(PdfDictionary pg)
{
var res = (PdfDictionary)PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES));
var xobj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
var pdfObgs = new List<PdfObject>();
if (xobj != null)
{
foreach (PdfName name in xobj.Keys)
{
PdfObject obj = xobj.Get(name);
if (obj.IsIndirect())
{
var tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
var type = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
if (PdfName.IMAGE.Equals(type)) // image at the root of the pdf
{
pdfObgs.Add(obj);
}
else if (PdfName.FORM.Equals(type)) // image inside a form
{
FindImageInPDFDictionary(tg).ForEach(o => pdfObgs.Add(o));
}
else if (PdfName.GROUP.Equals(type)) // image inside a group
{
FindImageInPDFDictionary(tg).ForEach(o => pdfObgs.Add(o));
}
}
}
}
return pdfObgs;
}
In newer version of iTextSharp, the 1st parameter of ImageRenderInfo.CreateForXObject is not Matrix anymore but GraphicsState. #der_chirurg's approach should work. I tested myself with the information from the following link and it worked beautifully:
http://www.thevalvepage.com/swmonkey/2014/11/26/extract-images-from-pdf-files-using-itextsharp/
To extract all Images on all Pages, it is not necessary to implement different filters. iTextSharp has an Image Renderer, which saves all Images in their original image type.
Just do the following found here: http://kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx You don't need to implement HttpHandler...
I added library on github which, extract images in PDF and compress them.
Could be useful, when you are going to start play with very powerful library ITextSharp.
Here the link: https://github.com/rock-walker/PdfCompression
This works for me and I think it's a simple solution:
Write a custom RenderListener and implement its RenderImage method, something like this
public void RenderImage(ImageRenderInfo info)
{
PdfImageObject image = info.GetImage();
Parser.Matrix matrix = info.GetImageCTM();
var fileType = image.GetFileType();
ImageFormat format;
switch (fileType)
{//you may add more types here
case "jpg":
case "jpeg":
format = ImageFormat.Jpeg;
break;
case "pnt":
format = ImageFormat.Png;
break;
case "bmp":
format = ImageFormat.Bmp;
break;
case "tiff":
format = ImageFormat.Tiff;
break;
case "gif":
format = ImageFormat.Gif;
break;
default:
format = ImageFormat.Jpeg;
break;
}
var pic = image.GetDrawingImage();
var x = matrix[Parser.Matrix.I31];
var y = matrix[Parser.Matrix.I32];
var width = matrix[Parser.Matrix.I11];
var height = matrix[Parser.Matrix.I22];
if (x < <some value> && y < <some value>)
{
return;//ignore these images
}
pic.Save(<path and name>, format);
}
I have used this library in the past without any problems.
http://www.winnovative-software.com/PdfImgExtractor.aspx
private void btnExtractImages_Click(object sender, EventArgs e)
{
if (pdfFileTextBox.Text.Trim().Equals(String.Empty))
{
MessageBox.Show("Please choose a source PDF file", "Choose PDF file", MessageBoxButtons.OK);
return;
}
// the source pdf file
string pdfFileName = pdfFileTextBox.Text.Trim();
// start page number
int startPageNumber = int.Parse(textBoxStartPage.Text.Trim());
// end page number
// when it is 0 the extraction will continue up to the end of document
int endPageNumber = 0;
if (textBoxEndPage.Text.Trim() != String.Empty)
endPageNumber = int.Parse(textBoxEndPage.Text.Trim());
// create the PDF images extractor object
PdfImagesExtractor pdfImagesExtractor = new PdfImagesExtractor();
pdfImagesExtractor.LicenseKey = "31FAUEJHUEBQRl5AUENBXkFCXklJSUlQQA==";
// the demo output directory
string outputDirectory = Path.Combine(Application.StartupPath, #"DemoFiles\Output");
Cursor = Cursors.WaitCursor;
// set the handler to be called when an image was extracted
pdfImagesExtractor.ImageExtractedEvent += pdfImagesExtractor_ImageExtractedEvent;
try
{
// start images counting
imageIndex = 0;
// call the images extractor to raise the ImageExtractedEvent event when an images is extracted from a PDF page
// the pdfImagesExtractor_ImageExtractedEvent handler below will be executed for each extracted image
pdfImagesExtractor.ExtractImagesInEvent(pdfFileName, startPageNumber, endPageNumber);
// Alternatively you can use the ExtractImages() and ExtractImagesToFile() methods
// to extracted the images from a PDF document in memory or to image files in a directory
// uncomment the line below to extract the images to an array of ExtractedImage objects
//ExtractedImage[] pdfPageImages = pdfImagesExtractor.ExtractImages(pdfFileName, startPageNumber, endPageNumber);
// uncomment the lines below to extract the images to image files in a directory
//string outputDirectory = System.IO.Path.Combine(Application.StartupPath, #"DemoFiles\Output");
//pdfImagesExtractor.ExtractImagesToFile(pdfFileName, startPageNumber, endPageNumber, outputDirectory, "pdfimage");
}
catch (Exception ex)
{
// The extraction failed
MessageBox.Show(String.Format("An error occurred. {0}", ex.Message), "Error");
return;
}
finally
{
// uninstall the event handler
pdfImagesExtractor.ImageExtractedEvent -= pdfImagesExtractor_ImageExtractedEvent;
Cursor = Cursors.Arrow;
}
try
{
System.Diagnostics.Process.Start(outputDirectory);
}
catch (Exception ex)
{
MessageBox.Show(string.Format("Cannot open output folder. {0}", ex.Message));
return;
}
}
/// <summary>
/// The ImageExtractedEvent event handler called after an image was extracted from a PDF page.
/// The event is raised when the ExtractImagesInEvent() method is used
/// </summary>
/// <param name="args">The handler argument containing the extracted image and the PDF page number</param>
void pdfImagesExtractor_ImageExtractedEvent(ImageExtractedEventArgs args)
{
// get the image object and page number from even handler argument
Image pdfPageImageObj = args.ExtractedImage.ImageObject;
int pageNumber = args.ExtractedImage.PageNumber;
// save the extracted image to a PNG file
string outputPageImage = Path.Combine(Application.StartupPath, #"DemoFiles\Output",
"pdfimage_" + pageNumber.ToString() + "_" + imageIndex++ + ".png");
pdfPageImageObj.Save(outputPageImage, ImageFormat.Png);
args.ExtractedImage.Dispose();
}

Categories