How to get the UserUnit property from a PdfFile using iTextSharp PdfReader

How to get the UserUnit property from a PdfFile using iTextSharp PdfReader - c#

I have a bunch of PDF files- I read these as requested into a byte array and then also pass it to a iTextSharp PdfReader instance. I want to then grab the dimensions of each page- in pixels. From what I've read so far it seems by PDF files work in points- a point being a configurable unit stored in some kind of dictionary in an element called UserUnit.
Loading my PDF File into a PdfReader, what do I need to do to get the UserUnit for each page (apparently it can vary from page to page) so I can then get the page dimensions in pixels.
At present I have this code, which grabs the dimensions for each page in "points" - guess I just need the UerUnit, and can then multiply these dimensions by that to get pixels or something similar.
//Create an object to read the PDF
PdfReader reader = new iTextSharp.text.pdf.PdfReader(file_content);
for (int i = 1; i <= reader.NumberOfPages; i++)
{
Rectangle dim = reader.GetPageSize(i);
int[] xy = new int[] { (int)dim.Width, (int)dim.Height }; // returns page size in "points"
page_data[objectid + '-' + i] = xy;
}
Cheers!

Allow me to quote from my book:
iText in Action - Second Edition, page 9:
FAQ What is the measurement unit in PDF documents? Most of the measurements
in PDFs are expressed in user space units. ISO-32000-1 (section 8.3.2.3) tells us
“the default for the size of the unit in default user space (1/72 inch) is
approximately the same as a point (pt), a unit widely used in the printing
industry. It is not exactly the same; there is no universal definition of a point.”
In short, 1 in. = 25.4 mm = 72 user units (which roughly corresponds to 72 pt).
On the next page, I explain that it’s possible to change the default value of the user unit, and I add an example on how to create a document with pages that have a different user unit.
Now for your question: suppose you have an existing PDF, how do you find which user unit was used? Before we answer this, we need to take a look at ISO-32000-1.
In section 7.7.3.3Page Objects, you'll find the description of UserUnit in Table 30, "Entries in a page object":
(Optional; PDF 1.6) A positive number that shall give the size of
default user space units, in multiples of 1⁄72 inch. The range of
supported values shall be implementation-dependent. Default value: 1.0
(user space unit is 1⁄72 inch).
This key was introduced in PDF 1.6; you won't find it in older files. It's optional, so you won't always find it in every page dictionary. In my book, I also explain that the maximum value of the UserUnit key is 75,000.
Now how to retrieve this value with iTextSharp?
You already have Rectangle dim = reader.GetPageSize(i); which returns the MediaBox. This may not be the size of the visual part of the page. If there's a CropBox defined for the page, viewers will show a much smaller size than what you have in xy (but you probably knew that already).
What you need now is the page dictionary, so that you can retrieve the value of the UserUnit key:
PdfDictionary pageDict = reader.GetPageN(i);
PdfNumber userUnit = pageDict.GetAsNumber(PdfName.USERUNIT);
Most of the times userUnit will be null, but if it isn't you can use userUnit.FloatValue.

Related

Is there any way to optimize pdfreader itextsharp?

There is a method that many times reads text from different pages of a pdf document using a rectangle. Accordingly, the larger the file, the slower everything is processed, I tried to use Parallel.Foreach, but I didn't get a substantial increase in processing speed, everything seems to be hampered by PdfReader.
The method is something like this:
var lst = new ConcurrentBag<Test3>();
using(var reader = new PdfReader(byteArr))
{
Parallel.Foreach(areas, t =>
{
var pageSize = reader.GetPageSize(t.PageNumber);
var rectangle = GetRectagle(t.AreaData, pageSize);
var text = GetTextFromRectangle(reader, rectagle, t.PageNumber);
lst.Add(text);
}
}
public string GetTextFromRectagle(PdfReader reader, Rectangle rect, int pageNum)
{
RenderFilter[] filter = {
new RegionTextRenderText()
};
ITextExtractionStrategy strategy =
new FilteredTextRenderListener(new
LocationTextExtractionStrategy(), filter);
return PdfTextExtractor.GetTextFromPage(reader, pageNumber, strategy);
}

After you mentioned in a comment that there are
Approximately 900 rectangle areas per page
and added your GetTextFromRectangle code, the cause of the problem became clear: For each of your pre-defined rectangles you make iText parse the whole content of the page the rectangle is on into a filtered text extraction strategy you expect to be focused on the respective rectangle area.
By the way, even worse, I don't see you using the Rectangle rect parameter in your GetTextFromRectangle method, thus after all you actually do not even focus on the respective rectangle!
So you parse each page approximately 900 times, each time throwing away most of the parsed information, instead of only once and retrieving the text from the pre-parsed data from each of those 900 rectangles per page.
This is waste of resources in its purest form!
What you should do instead, is
sort and separate your areas by their respective page and
for each of the pages
once (and only once) parse the content of that page into an unfiltered LocationTextExtractionStrategy and
for each rectangle on that page use the GetResultantText(TextChunkFilter) method of the strategy instance with a TextChunkFilter that filters by position (whether the chunk in question is inside the rectangle at hand) to retrieve the area text.
As an aside, in case of iText 7 instead of iText 5 (for .Net, formerly called iTextSharp) that GetResultantText overload with a TextChunkFilter is missing but you can emulate it, cf. this answer.

How to find the location of the last item added to a PDF document

Simarly to this question I want to find the location of the last item in a PDF document and add content at that place, more specifically I would like to add an electronic signature at a position where one would normally put a regular handwritten signature on letters.
In the question above the user is making the PDF file but I am importing an existing PDF file that can have any structure as such. Therefore, as far as I can see, I can not use the same method as I do not know if the last object made was a paragraph or some other object.
I found the following function and property on the PdfWriter class that bode well but can not find any documentation that explains the output I get when I run the programme:
PdfWriter w = _document.GetWriter();
long currentPos = w.GetCurrentPos();
long pos = w.Position;
When I run this the output I get is something like the number 84178. What unit is that? Can I use this number to calculate how much vertical space there is left on the page and if it is too small then add a page and have the signature on the next page?

How to read the bit rate information from a .mov video file header (QuickTime File Format)?

I've been trying to read some values out of the metadata of a .mov file (QuickTime File Format) with limited success. I've been using the following link as a reference:
Introduction to QuickTime File Format Specification
I've managed to correctly locate and read out/calculate the media duration, but I can't seem to find which Atom the Bit Rate information is stored in. (Atoms are the internal blocks of metadata inside the file).
If anyone can point me to the correct Atom to read, I'll be alright reading it... I just can't seem to find it in the documentation even. "Bit Rate" is only mentioned a couple of times in the whole document.
UPDATE >>>
Going by the very limited information provided below by #szatmary, I have parsed the Sample Size Atom and the Time to Sample Atom from the relevant Track Atom, but am getting some bizarre values. For example, I keep getting a Sample Size value of 1 (when reading from multiple different single video .mov files with constant Bit Rates). The related documentation (from the above link) says:
Sample size
A 32-bit integer specifying the sample size. If all the samples are the same size, this field contains that size value. If this field is set to 0, then the samples have different sizes, and those sizes are stored in the sample size table.
So the field has the value of 1, which means that all samples have the same size, and the Number of entries [in the Sample Size Table] field matches that of the Sample Count field in the single entry of the Time to Sample Table (some very large number). The documentation states this:
... if a video media has a constant frame rate, this table would have one entry and the count would be equal to the number of samples.
So the video has a constant Bit Rate. However, when reading the size entries from the Sample Size Table, they are all different and non-sensical... some are 0, while others are very large numbers up to around 40000. Why are they different if the video has a constant Bit Rate, or should I not be reading them in this case?
Another issue that I have found is that the single entry in the Time to Sample Table of the Time to Sample Atom has the following values:
Sample Count: some very large number (expected)
Sample Duration: 1
Unfortunately the documentation (from the above link) is very light here:
Time-to-sample table
A table that defines the duration of each sample in the media. Each table entry contains a count field and a duration field.
So what units do these 1 values use (Sample Duration & Sample Size)?
Any further help with calculating the correct Bit Rate would be greatly appreciated. Please note that I have been taking the Big-Endian-ness of the file into consideration and reversing the bytes of each field value before reading them.
UPDATE 2 >>>
I have managed to work out that the Sampling Rate is calculated like this:
Media Duration = Duration / Timescale (from the Movie Header Atom or Track Header Atom)
Sampling Rate = Sample Count (from the Time-to-Sample Atom) / Media Duration
I just need to crack the Bit Rate now and further help is needed.

This will get you what you want, "The Bit Rate that is shown in Windows Explorer", but not from the QT metadata. If it is not appropriate for some reason, maybe it will work as a fallback solution until you can work out the Atom based answer or as something to compare the QT Atom results to.
In short, if you want what Explorer shows, get it from Explorer:
// add reference to Microsoft Shell controls and Automation
// from the COM tab
using Shell32;
class ShellInfo
{
// "columns" we want:
// FileName = 0;
const int PerceivedType = 9;
// FileKind = 11;
// MediaBitrate = 28;
// MediaLength = 27;
static int[] info = {0, 9, 11, 27, 28};
// note: author and title also available
public static Dictionary<string, string> GetMediaProperties(string file)
{
Dictionary<string, string> xtd = new Dictionary<string, string>();
Shell32.Shell shell = new Shell32.Shell();
Shell32.Folder folder;
folder = shell.NameSpace(Path.GetDirectoryName(file));
foreach (var s in folder.Items())
{
if (folder.GetDetailsOf(s, 0).ToLowerInvariant() ==
Path.GetFileName(file).ToLowerInvariant())
{
// see if it is video
// possibly check FileKind ???
if (folder.GetDetailsOf(s, PerceivedType).ToLowerInvariant() ==
"video")
{
// add just the ones we want using the array of col indices
foreach (int n in info)
{
xtd.Add(folder.GetDetailsOf(folder.Items(), n),
folder.GetDetailsOf(s, n));
}
}
break;
}
// ToDo: freak out when it is not a video or audio type
// depending what you are trying to do
}
return xtd;
}
}
Usage:
Dictionary<string, string> myinfo;
myinfo = ShellInfo.GetMediaProperties(filepath);
The test file is a sample QT mov from Apple's site, so there is nothing special about it. The view in Explorer:
The results from GetMediaProperties:
The BitRate returned also matched the Audio BitRate returned by MediaProps and MediaTab (both use MediaInfo.DLL to gather all media property values).
The first 35 Shell extended properties are pretty well documented. I think as of Windows 7, this goes to 291(!). Many are file type specific for photos, emails etc. A few which may be of interest:
282: Data rate
283: Frame height
284: Frame rate
285: Frame width
286: Total bitrate
Data rate (282) is the Video BitRate (matches MediaInfo) ; Total Bitrate (286) is the combined a/v bitrate.
Windows 8 (UPDATE)
While the above code appears to run OK on Windows 7, for computers running Windows 8, to avoid a System.InvalidCastException on the following line...:
Shell shell = new Shell();
... the following code will need to be run to instantiate the Shell and Folder COM objects:
Type shellType = Type.GetTypeFromProgID("Shell.Application");
Object shell = Activator.CreateInstance(shellType);
Folder folder = (Folder)shellType.InvokeMember("NameSpace",
BindingFlags.InvokeMethod, null, shell,
new object[] { Path.GetDirectoryName(file) });
Solution found in the Instantiate Shell32.Shell object in Windows 8 question on the Visual Studio Forum.
Also, on Windows 8, it appears that more attributes have been added so that the maximum index is now 309 (with a few empty entries) and the above mentioned attributes have different indices:
298: Data rate
299: Frame height
300: Frame rate
301: Frame width
303: Total bitrate
It seems the returns from Shell32 has some characters in it which prevent a simple and direct conversion to an int value. For the Bit Rate:
string bRate = myinfo["Bit rate"]; // get return val
bRate = new string(bRate.Where(char.IsDigit).ToArray()); // tidy up
int bitRate = Convert.ToInt32(bRate);

Its not recorded anywhere. As a general rule, it is bad practice to store a value that can be calculated from other values. Plus bitrate can change over time with the same video. What you can do is add up the sizes of the frames you are interested in the stsz box (atoms are called boxes in the iso standard) and the sample durations from he stts box and to the math.

If you are OK to read informational value (you already have szatmary's answer for more accurate information), shell reports this by parsing the file and reading metadata through Media Foundation MPEG-4 Property Handler class.
Native API entry point for this is PSLookupPropertyHandlerCLSID and then regular COM instantiation for IPropertyStore interface and then reading the properties. Even if you don't have C# interface into this, you could easily get this through P/Invoke and interoperability layer.
The properties you can read this way are easily discovered by this helper app, wrapping the API: FilePropertyStore (Win32, x64). That is, what you see through the app is also available to you through the API mentioned.
Here is an excerpt from what it gets for a .MOV file (note PKEY_Audio_EncodingBitrate and PKEY_Video_EncodingBitrate):
## Property
* `PKEY_Media_Duration`, Length: `855000000` (`VT_UI8`) // `855,000,000`
* `PKEY_Audio_EncodingBitrate`, Bit rate: `43744` (`VT_UI4`) // `43,744`
* `PKEY_Audio_ChannelCount`, Channels: `1` (`VT_UI4`) // `1`
* `PKEY_Audio_Format`, Audio format: `{00001610-0000-0010-8000-00AA00389B71}` (`VT_LPWSTR`) // FourCC 0x00001610
* `PKEY_Audio_SampleRate`, Audio sample rate: `32000` (`VT_UI4`) // `32,000`
* `PKEY_Audio_SampleSize`, Audio sample size: `16` (`VT_UI4`) // `16`
* `PKEY_Audio_StreamNumber`: `1` (`VT_UI4`) // `1`
* `PKEY_Video_EncodingBitrate`, Data rate: `263352` (`VT_UI4`) // `263,352`
* `PKEY_Video_FrameWidth`, Frame width: `640` (`VT_UI4`) // `640`
* `PKEY_Video_FrameHeight`, Frame height: `480` (`VT_UI4`) // `480`
The method also works for other media file formats, getting data using the same keys through respective property handlers for other container formats.

Get document properties from PDF in iTextSharp

I'm trying to get some information out of a PDF file. I've tried using PdfSharp, and it has properties for the information I need, but it cannot open iref streams, so i've had to abandon it.
Instead i'm trying iTextSharp. so far i've managed to get some basic information out, like the title, aurhor and subject, from the Info array.
However, i'm now after a bit more information, but cannot find where it is exposed (if it is exposed) in iTextSharp.... The information I am after is highlighted in the image below:
I cannot figure out where this information is stored. Any and all help will be much appreciated.

For documents encrypted using standard password encryption you can retrieve the permissions after opening the file in a PdfReader pdfReader using
getPermissions() in case of iText/Java
int permissions = pdfReader.getPermissions()
Permissions in case of iTextSharp/.Net
int permissions = pdfReader.Permissions
The int value returned is the P value of the encryption dictionary which contains
A set of flags specifying which operations shall be permitted when the document is opened with user access (see Table 22).
[...]
The value of the P entry shall be interpreted as an unsigned 32-bit quantity containing a set of flags specifying which access permissions shall be granted when the document is opened with user access. Table 22 shows the meanings of these flags. Bit positions within the flag word shall be numbered from 1 (low-order) to 32 (high order). A 1 bit in any position shall enable the corresponding access permission.
[...]
Bit position Meaning
3 (Security handlers of revision 2) Print the document. (Security handlers of revision 3 or greater) Print the document (possibly not at the highest quality level, depending on whether bit 12 is also set).
4 Modify the contents of the document by operations other than those controlled by bits 6, 9, and 11.
5 (Security handlers of revision 2) Copy or otherwise extract text and graphics from the document, including extracting text and graphics (in support of accessibility to users with disabilities or for other purposes). (Security handlers of revision 3 or greater) Copy or otherwise extract text and graphics from the document by operations other than that controlled by bit 10.
6 Add or modify text annotations, fill in interactive form fields, and, if bit 4 is also set, create or modify interactive form fields (including signature fields).
9 (Security handlers of revision 3 or greater) Fill in existing interactive form fields (including signature fields), even if bit 6 is clear.
10 (Security handlers of revision 3 or greater) Extract text and graphics (in support of accessibility to users with disabilities or for other purposes).
11 (Security handlers of revision 3 or greater) Assemble the document (insert, rotate, or delete pages and create bookmarks or thumbnail images), even if bit 4 is clear.
12 (Security handlers of revision 3 or greater) Print the document to a representation from which a faithful digital copy of the PDF content could be generated. When this bit is clear (and bit 3 is set), printing is limited to a low-level representation of the appearance, possibly of degraded quality.
(Section 7.6.3.2 "Standard Encryption Dictionary" in the PDF specification ISO 32000-1)
You can use the PdfWriter.ALLOW_* constants in this context.
Concerning the dialog screenshot you made, though, be aware that the operations effectively allowed do not only depend on the PDF document but also on the PDF viewer! Otherwise you might be caught in the same trap as the OP of this question.

Thanks to mkl for your answer, it was part of the story, but here is the answer which you helped me find:
using (var pdf = new PdfReader(File))
{
Console.WriteLine(PdfEncryptor.IsModifyAnnotationsAllowed(pdf.Permissions));
}
The PdfEncryptor is what was missing, it converts the P value into a simple bool for yes or no. Other methods on there are:
IsAssemblyAllowed
IsCopyAllowed
IsDegradedPrintingAllowed
IsFillInAllowed
IsModifyAnnotationsAllowed
IsModifyContentsAllowed
IsPrintingAllowed
IsScreenReadersAllowed
As for the security method part, this is what i went with:
using (var pdf = new PdfReader(File))
{
Console.WriteLine(!pdf.IsOpenedWithFullPermissions == Expected);
}

C# Zxing Encode 1d EAN8

I want to generate a 1D EAN8 barcode using c# Zxing. I have only been able to find code examples and documentation for generating 2D QR-code
var writer = new BarcodeWriter
{
Format = BarcodeFormat.QR_CODE,
Options = new QrCodeEncodingOptions
{
Height = height,
Width = width
}
};
return writer.Write(textForEncoding);
which I can run and works fine, but there is no "1DCodeEncodingOptions" or similarly named function. I tried
var writer = new BarcodeWriter
{
Format = BarcodeFormat.EAN_8
};
return writer.Write("1234567");
but it throughs an index error.
edit: I have the syntax correct now but it is not producing a proper barcode because I do not know the size it expects, and there seems to be no default.
using ZXing;
Using ZXing.OneD
var writer = new BarcodeWriter
{
Format = BarcodeFormat.EAN_8,
Options = new ZXing.Common.EncodingOptions
{
Height = 100,
Width = 300
}
};
return writer.Write("12345678");

12345678 is not a valid EAN8 barcode. The check digit for 1234567 is 0. See EAN 8 : How to calculate checksum digit? for how to calculate checksums.
ZXing won't stop you creating invalid barcodes (at least in the version 0.11 I'm using, although the current source on Codeplex looks like it does), but they won't scan. The scanner uses the checksum to ensure that it has read the data correctly.
If you intend to use an EAN8 in commerce, you will need to get a GTIN-8 from your national GS1 Member Organization. If you only intend to sell them in your store, you should use one of the restricted distribution prefixes.
If you don't need to put your products in the retail supply chain, I'd recommend a different barcode format.
Interleaved 2 of 5 (BarcodeFormat.ITF) is very compact but can only contain digits; it doesn't have any self-checking and there's a design flaw that allows the barcode to be misread from some angles. It's recommended that you put black bars (called Bearer Bars) across the top and bottom of the barcode. I can't see an option for this in ZXing.
The next most compact format is Code 128 (BarcodeFormat.CODE_128), using Code Set C. This encodes two digits in one module (one block of six bars and spaces). The format is self-checking (there is always a check character, which is stripped off by the scanner). Some scanners don't handle Code Set C properly. To force Code Set B, use Code128EncodingOptions in place of Common.EncodingOptions and set ForceCodesetB to true.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.