Get document properties from PDF in iTextSharp

Get document properties from PDF in iTextSharp - c#

I'm trying to get some information out of a PDF file. I've tried using PdfSharp, and it has properties for the information I need, but it cannot open iref streams, so i've had to abandon it.
Instead i'm trying iTextSharp. so far i've managed to get some basic information out, like the title, aurhor and subject, from the Info array.
However, i'm now after a bit more information, but cannot find where it is exposed (if it is exposed) in iTextSharp.... The information I am after is highlighted in the image below:
I cannot figure out where this information is stored. Any and all help will be much appreciated.

For documents encrypted using standard password encryption you can retrieve the permissions after opening the file in a PdfReader pdfReader using
getPermissions() in case of iText/Java
int permissions = pdfReader.getPermissions()
Permissions in case of iTextSharp/.Net
int permissions = pdfReader.Permissions
The int value returned is the P value of the encryption dictionary which contains
A set of flags specifying which operations shall be permitted when the document is opened with user access (see Table 22).
[...]
The value of the P entry shall be interpreted as an unsigned 32-bit quantity containing a set of flags specifying which access permissions shall be granted when the document is opened with user access. Table 22 shows the meanings of these flags. Bit positions within the flag word shall be numbered from 1 (low-order) to 32 (high order). A 1 bit in any position shall enable the corresponding access permission.
[...]
Bit position Meaning
3 (Security handlers of revision 2) Print the document. (Security handlers of revision 3 or greater) Print the document (possibly not at the highest quality level, depending on whether bit 12 is also set).
4 Modify the contents of the document by operations other than those controlled by bits 6, 9, and 11.
5 (Security handlers of revision 2) Copy or otherwise extract text and graphics from the document, including extracting text and graphics (in support of accessibility to users with disabilities or for other purposes). (Security handlers of revision 3 or greater) Copy or otherwise extract text and graphics from the document by operations other than that controlled by bit 10.
6 Add or modify text annotations, fill in interactive form fields, and, if bit 4 is also set, create or modify interactive form fields (including signature fields).
9 (Security handlers of revision 3 or greater) Fill in existing interactive form fields (including signature fields), even if bit 6 is clear.
10 (Security handlers of revision 3 or greater) Extract text and graphics (in support of accessibility to users with disabilities or for other purposes).
11 (Security handlers of revision 3 or greater) Assemble the document (insert, rotate, or delete pages and create bookmarks or thumbnail images), even if bit 4 is clear.
12 (Security handlers of revision 3 or greater) Print the document to a representation from which a faithful digital copy of the PDF content could be generated. When this bit is clear (and bit 3 is set), printing is limited to a low-level representation of the appearance, possibly of degraded quality.
(Section 7.6.3.2 "Standard Encryption Dictionary" in the PDF specification ISO 32000-1)
You can use the PdfWriter.ALLOW_* constants in this context.
Concerning the dialog screenshot you made, though, be aware that the operations effectively allowed do not only depend on the PDF document but also on the PDF viewer! Otherwise you might be caught in the same trap as the OP of this question.

Thanks to mkl for your answer, it was part of the story, but here is the answer which you helped me find:
using (var pdf = new PdfReader(File))
{
Console.WriteLine(PdfEncryptor.IsModifyAnnotationsAllowed(pdf.Permissions));
}
The PdfEncryptor is what was missing, it converts the P value into a simple bool for yes or no. Other methods on there are:
IsAssemblyAllowed
IsCopyAllowed
IsDegradedPrintingAllowed
IsFillInAllowed
IsModifyAnnotationsAllowed
IsModifyContentsAllowed
IsPrintingAllowed
IsScreenReadersAllowed
As for the security method part, this is what i went with:
using (var pdf = new PdfReader(File))
{
Console.WriteLine(!pdf.IsOpenedWithFullPermissions == Expected);
}

Related

Revision No. in ItextSharp

I want to sign all pages in ItextSharp and according to this thread I'm using
for (int p = 1; p <= writer.reader.getNumberOfPages(); p++) {
writer.addAnnotation(sigField, p);
}
It's worked well but when I open File with Adobe the rev no. is 1 for all signature I tried to increase rev. at the end of method FindSignatureNames() in AcroFields.cs but it doesnt reflect.
I dont know what i missed to make it worked?

I want to sign all pages in ItextSharp and according to this thread
First of all, a digital PDF signature always signs the whole PDF including in particular all pages. What the question you refer to is about, is adding visualizations (widget annotations) of a single digital signature to all document pages.
This is exactly what your code is doing, it's adding the signature widget annotation (which is merged into the signature field) to the annotation array of all the pages. (Strictly speaking this is done in an invalid way. But current validators ignore this error.)
But even though you now have signature visualizations on each page, there still is merely a single signature field holding a single digital signature. Thus, there also is only a single signed revision.
Your observation, therefore,
It's worked well but when I open File with Adobe the rev no. is 1 for all signature
is to be expected and correct.
If you want your signature visualizations to refer to different revisions, simply apply as many digital signatures to your document as it has pages, with a single visualization each, for the nth signature on page n.

How can I set halftone (sethalftone) to each separation color with device tiffsep1 and other separation ones?

The code works but the commented code will create an error. The error are not solved by changing -sDEVICE to tiffgray, for example.
String[] ARGS = new String[] {
"",
"-sDEVICE=tiffsep1",
"-r1200",
"-o out.tiff",
"SOSample.pdf",
//"-c",
//"<< /HalftoneType 1 /Frequency 300 /Angle 45 /SpotFunction {180 mul cos exch 180 mul cos add 2 div} >> sethalftone",
//"-f"
};
How can I define sethalftone with ghostscript and how can I set it for each color of tiffsep1? What am I doing wrong with one color and how to make it for separations?
I'm using:
[DllImport("gsdll64.dll", EntryPoint = "gsapi_init_with_args")]
public static extern int INSTANCEStart(IntPtr instance, int argc, string[] argv);
and so on.
I'm working with Ghostscript 9.52.
Something that could help (\"):
"-c",
"\"<</Orientation 1>> setpagedevice\"",

You need to use the sethalftone PostScript operator in order to change the halftone. Obviously this will involve writing some PostScript.
Not only that, but you really need to set the default halftone, or set the halftone at the start of the page, because the current PDF interpreter in Ghostscript does an initgraphics at the start of every page of a PDF file.
For all of this you are going to need a copy of the PostScript Language Reference Manual, which you can get from somewhere on the Adobe web site. They keep moving stuff around so I'm not going to try and post a link, just google for the name of the manual. You want the third edition.
So you need to write a BeginPage procedure, which you will find covered in Chapter 6 under device control, pages 427 onwards.
The BeginPage procedure will need to set a halftone, and you will find halftones covered in Section 7.4, page 480 onwards. You will presumably want to use either a type 2 or type 4 halftone dictionary.
When you've assembled that, you then need to pass it to Ghostscript before you process the PDF file. The simplest method is to put the PostScript program in a file (called eg setup.ps) and then put that filename on the command line immediately before the PDF filename.
Eg:
gs -r1200 -sDEVICE=tiffsep1 -o out%d.tif setup.ps sample.pdf
Note that PDF files can contain a halftone specification themselves (this is deprecated in PDF 2.0) and Ghostscript will honour any halftone in a PDF file.
Finally; this is an unusual request and, given that you are writing code to link to the Ghostscript DLL, makes me think you may be using Ghostscript commercially. You should review the AGPL to ensure you are complying with the terms of the license. If you plan on distributing your application you will almost certainly need a commercial license.

How can I tell ghostscript not to rasterize gradients in eps files?

I was searching for solution that would allow me to read, edit and save .eps files. I found out that ghostscript can give all of this opportunities. The algoritm I need is simple: read several .eps files, concatenate them in one big file and save new .eps file. I can do that already but there is a problem: new generated and saved files don't preserve gradients. Gradients are rasterized and shapes which use that gradients are converted to clipping masks. Is there a way to tell ghostscript not to rasterize gradients in eps?
I'm using latest 32 bit version of ghostscript library though my Windows is 64 bit (there were problems running solution on 64 bit version of ghostscript). Actually it's not so important but I'm writting using C# and Ghostscript.Net.
This is the sample code:
using (GhostscriptProcessor processor = new GhostscriptProcessor(lastInstalledVersion, true))
{
List<string> switches = new List<string>();
switches.Add("-o");
switches.Add(#"-sOutputFile=" + outputFile);
switches.Add("-sDEVICE=eps2write");
switches.Add("-dUseCIEColor=true");
switches.Add("-c");
switches.Add("<</Install {0.5 0.5 scale}>> setpagedevice");
switches.Add("-f");
switches.Add(inputFile);
processor.Process(switches.ToArray());
}

The answer to the question you have asked is simple; you can't. The eps2write device is called that for a reason, it only produces level 2 PostScript, and the shfill operator, or type 2 pattern (shading dictionary in PDF) is a level 3 PostScript primitive.
However, there seems to be no good reason to run the exiting files through Ghostscript anyway. You say you already have a number of EPS files. The whole point of EPS files is that they can be treated as a 'black box', you do not need to know what's in them in order to concatenate them, rearrange them etc.
All you do is write some 'wrapper' PostScript that alters the CTM before including the EPS file in its entirety. You can work out what the arguments to scale and translate should be, because the EPS file will have a %%BoundingBox comment that tells you where it sits in user space. All you need to do is alter the scale, and offset the 0,0 origin (bottom left) using translate.
Note that the eps2write device, because it is limited to producing level 2 PostScript, also does not support some other features of PostScript beyond the original level 2 specification, such as CIDFonts.

How to extract text from PDF using iTextSharp version 4.1.6? [duplicate]

We are developing a Pdf parser to be used along with our system.
The requirement is such that, we store all the information on any pdf documents and should be able to reproduce the document as such (with minimal changes from original document).
We did some googling and found iTextSharp be the best mate for our purpose.
We are developing our project using .net.
You might have guessed as i mentioned in my title requiring comparisons for specific versions of iTextSharp (4.1.6 vs 5.x). We know that 4.1.6 is the last version of iTextSharp with the LGPL/MPL license . The 5.x versions are AGPL.
We would like to have a good comparison between the versions before choosing the LGPL version or we buy the license for AGPL (we dont like to publish our code).
I did some browsing through the revision changes in the iTextSharp but i would like to know if any content exist, making a good comparison between the versions.
Thanks in advance!

I'm the CTO of iText Software, so just like Michaël who already answered in the comment section, I'm at the same time the most authoritative source as well as a biased source.
There's a very simple comparison chart on the iText web site.
This chart doesn't cover text extraction, so allow me to list the relevant improvements since iText 5.
You've probably also found this page.
In case you wonder about the bug fixes and the performance improvements regarding text parsing, this is a more exhaustive list:
5.0.0: Text extraction: major overhaul to perform calculations in user space. This allows the parser to correctly determine line breaks, even if the text or page is rotated.
5.0.1: Refactored callback so method signature won't need to change as render callback API evolves.
5.0.1: Refactoring to make it easier for outside users to interact with the content stream processor. Also refactored render listener so text and image event listening occurs in the same interface (reduces a lot of non-value-add complexity)
5.0.1: New filtering functionality for text renderers.
5.0.1: Additional utility method for previewing PDF content.
5.0.1: Added a much more advanced text renderer listener that can reconstruct page content based on physical location of text on the page
5.0.1: Added support for XObject Form processing (text added via PdfTemplate can now be parsed)
5.0.1: Added rudimentary support for XObject Image callbacks
5.0.1: Bug fix - text extraction wasn't correct for certain page orientations
5.0.1: Bug fix - matrices were being concatenated in the wrong order.
5.0.1: PdfTextExtractor: changed the default render listener (new location aware strategy)
5.0.1: Getters for GraphicsState
5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
5.0.2: CMapAwareDocumentFont: Tweaks to make processing quasi-invalid PDF files more robust
5.0.2: PdfContentReaderTool: null pointer handling, plus a few well placed flush calls
5.0.2: PdfContentReaderTool: Show details on resource entries
5.0.2: PdfContentStreamProcessor: Adjustment so embedded images don't cause parsing problems and improvements to EI detection
5.0.2: LocationTextExtractionStrategy: Fixed anti-parallel algorithm, plus accounting for negative inter-character offsets. Change to text extraction strategy that builds out the text model first, then computes concatenation requirements.
5.0.2: Adjustments to linesegment implementation; optimalization of changes made by Bruno to text extraction; for example: introduction of the class MarkedContentInfo.
5.0.2: Major refactoring of interface to text extraction functionality: for instance introduction of class PdfReaderContentParser
5.0.3: added method to get area of image in user units
5.0.3: better parsing of inline images
5.0.3: Adding an extra check for begin/end sequences when parsing a ToUnicode stream.
5.0.4: Content streams in arrays should be parsed as if they were separated by whitespace
5.0.4: Expose CTM
5.0.4: Refactor to pull inline image processing into it's own class. Added parsing of image data if there is no filter applied (there are some PDFs where there is no white space between the end of the image data and the EI operator). Ultimately, it will be best to actually parse the image data, but this will require a pretty big refactoring of the iText decoders (to work from streams instead of byte[] of known lengths).
5.0.4: Handle multi-stage filters; Correct bug that pulled whitespace as first byte of inline image stream.
5.0.4: Applying stream filters to inline images.
5.0.4: PdfReader: Expose filter decoder for arbitrary byte arrays (instead of only streams)
5.0.6: CMapParser: Fix to read broken ToUnicode cmaps.
5.0.6: handle slightly malformed embedded images
5.0.6: CMapAwareDocumentFont: Some PDFs have a diff map bigger than 256 characters.
5.0.6: performance: Cache the fonts used in text extraction
5.1.2: PRTokeniser: Made the algorithm to find startxref more memory efficient.
5.1.2: RandomAccessFileOrArray: Improved handling for huge files that can't be mapped
5.1.2: CMapAwareDocumentFont: fix NPE if mapping doesn't get initialized (I'd rather wind up with junk characters than throw an unexpected exception down the road)
5.1.3: refactoring of how filters are applied to streams, adjust parser so it can handle multi-stage filters
5.1.3: images: allow correct decoding of 1bpc bitmask images
5.1.3: images: add jbig2 streams to pass through
5.1.3: images: handle null and indirect references in decode parameters, throw exception if unable to decode an image
5.2.0: Better error messages and better handling zero sized files and attempts to read past the end of the file.
5.2.0: Removed restriction that using memory mapping requires the file be smaller than ~2GB.
5.2.0: Avoid NullPointerException in RandomAccessFileOrArray
5.2.0: Made a utility method in pdfContentStreamProcessor private and clarified the stateful nature of the class
5.2.0: LocationTextExtractionStrategy: bounds checking on string lengths and refactoring to make code easier to read.
5.2.0: Better handling of color space dictionaries in images.
5.2.0: improve handling of quasi improper inline image content.
5.2.0: don't decode inline image streams until we absolutely need them.
5.2.0: avoid NullPointerException of resource dictionary isn't provided.
5.3.0: LocationTextExtractionStrategy: old comparison approach caused runtime exceptions in Java 7
5.3.3: incorporate the text-rise parameter
5.3.3: expose glyph-by-glyph information
5.3.3: Bugfix: text to user space transformation was being applied multiple times for sub-textrenderinfo objects
5.3.3: Bugfix: Correct baseline calculation so it doesn't include final character spacing
5.3.4: Added low-level filtering hook to LocationTextExtractionStrategy.
5.3.5: Fixed bug in PRTokeniser: handle case where number is at end of stream.
5.3.5: Replaced StringBuffer with StringBuilder in PRTokeniser for performance reasons.
5.4.2: Added an isChunkAtWordBoundary() method to LocationTextExtractionStrategy to check if a space character should be inserted between a previous chunk and the current one.
5.4.2: Added a getCharSpaceWidth() method to LocationTextExtractionStrategy to get the width of a space character.
5.4.2: Added a getText() method to LocationTextExtractionStrategy to get the text of the current Chunk.
5.4.2: Added an appendTextChunk(() method to SimpleTextExtractionStrategy to expose the append process so that subclasses can add text from outside the text parse operation.
5.4.5: Added MultiFilteredRenderListener class for PDF parser.
5.4.5: Added GlyphRenderListener and GlyphTextRenderListener classes for processing each glyph rather than processing chunks of text.
5.4.5: Added method getMcid() in TextRenderInfo.
5.4.5: fixed resource leak when many inline images were in content stream
5.5.0: CMapAwareDocumentFont: if font space width isn't defined, use the default width for the font.
5.5.0: PdfContentReader: avoid exception when displaying an empty dictionary.
There are some things that you won't be able to do if you don't upgrade. For instance, you won't be able to do the things described in these slides.
If you look at the roadmap for iText, you'll see that we'll invest even more time on text extraction in the future.
In all honesty: using the 5 year old version wouldn't only be like reinventing the wheel, it would also be like falling in every pitfall we've fallen in in the last 5 years. I can assure you that buying a license will be less expensive.

PDF certification signature invalidates multiple approval signatures with ITextSharp

I'm using ITextSharp 5.1.1 to digitally sign a PDF with multiple signature fields. I have 4 signature fields: 3 approval signatures (users sign the document) and the 4th signature is the certification signature, which our system signs to indicate that the document is verified as correct and that no further modification can be made to the PDF.
If I add multiple approval signatures, a new revision is created for each signature and all the signatures are valid.
As soon as I then add the certification signature at the end of the signing process, it invalidates all the previous approval signatures.
Am I missing something here? Is there a different way to achieve the same effect as the certification signature without invalidating the approval signatures?
TIA

Your approach is to first add a number of approval signatures to a document and afterwards a certification signature.
This approach cannot work because it violates the PDF specification. As the question is from 2011, I quote from part 1, i.e. ISO 32000-1 published 2008, and not part 2, i.e. ISO 32000-2 published 2017; in essence both specifications agree on this issue, though.
A PDF document may contain the following standard types of signatures:
One or more approval signatures. [...]
At most one certification signature (PDF 1.5). [...] The signature dictionary shall contain a signature reference dictionary (see Table 253) that has a DocMDP transform method. See 12.8.2.2, “DocMDP” for information on how these signatures shall be created and validated.
At most two usage rights signatures. [...]
(ISO 32000-1 section 12.8 "Digital Signatures", subsection 12.8.1 "General")
The DocMDP transform method shall be used to detect modifications relative to a signature field that is signed by the author of a document (the person applying the first signature). A document can contain only one signature field that contains a DocMDP transform method; it shall be the first signed field in the document.
(ISO 32000-1 section 12.8.2.2 "DocMDP", subsection 12.8.2.2.1 "General")
Thus, it is no wonder that as soon as you then add the certification signature at the end of the signing process, it invalidates all the previous approval signatures, because adding a certification signature to an already signed document by definition makes the signature structure of the document invalid. (Ok, Adobe's error message in this case could more clearly point out the issue...)
You wonder is there a different way to achieve the same effect as the certification signature without invalidating the approval signatures?
According to raw ISO 32000-1 your options are limited. By means of a signature field lock dictionary (see section 12.7.4.5 "Signature Fields") and a FieldMDP transformation in your final signature you can lock existing form fields:
On behalf of a document author creating a document containing both form fields and signatures the following shall be supported by conforming writers:
The author specifies that form fields shall be filled in without invalidating the approval or certification signature. The P entry of the DocMDP transform parameters dictionary shall be set to either 2 or 3 (see Table 254).
The author can also specify that after a specific recipient has signed the document, any modifications to specific form fields shall invalidate that recipient’s signature. There shall be a separate signature field for each designated recipient, each having an associated signature field lock dictionary (see Table 233) specifying the form fields that shall be locked for that user.
When the recipient signs the field, the signature, signature reference, and transform parameters dictionaries shall be created. The Action and Fields entries in the transform parameters dictionary shall be copied from the corresponding fields in the signature field lock dictionary.
(ISO 32000-1 section 12.8.2.4 "FieldMDP")
But if your original certification signature allowed annotation changes or there is no certification signature to start with, you cannot dis-allow such annotation changes.
If, on the other hand, your PDF processors and viewers support ISO 32000-1 plus the Adobe Supplement with ExtensionLevel 3, an additional entry in the signature field lock dictionary is specified:
P number (Optional; ExtensionLevel 3) The access permissions granted for this document. Valid values follow:
1, no changes to the document are permitted; any change to the document invalidates the signature.
2, permitted changes are filling in forms, instantiating page templates, and signing; other changes invalidate the signature.
3, permitted changes are the same as for 2, as well as annotation creation, deletion, and modification; other changes invalidate the signature.
Default value: none; absence of this key results in no effect on signature validation rules.
If MDP permission is already in effect from an earlier incremental save section or the original part of the document, the number shall specify permissions less than or equal to the permissions already in effect based on signatures earlier in the document. That is, permissions can be denied but not added. If the number specifies greater permissions than an MDP value already in effect, the new number is ignored.
If the document does not have an author signature, the initial permissions in effect are those based on the number 3.
The new permission applies to any incremental changes to the document following the signature of which this key is part.
(Adobe Supplement to ISO 32000-1 ExtensionLevel 3 part I "Extensions to the PDF specification", section 8.6.3 "Field Types", subsection "Signature Fields")
This addition has been adopted by ISO 32000-2.
If you want to make use of these mechanisms using iText since version 5.3.0, have a look at the Digital Signatures for PDF Documents whitepaper by Bruno Lowagie, iText. Section 2.5.5 "Locking fields and documents after signing" illustrates how to use those mechanisms with iText.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.