How to use LocationExtractionStrategy in iText to extract from standardized pdf?

How to use LocationExtractionStrategy in iText to extract from standardized pdf? - c#

I have a standardized pdf invoice. I would like to use iText7 to extract certain data elements from the pdf. After an extensive search on the net, I was not able to find any resources or walkthroughs to show how iText7 can be used to extract text from pdf. I can extract fine using SimpleTextExtractionStrategy, But i am looking for examples on how to use LocationTextExtractionStrategy to extract text from given boxes in a pdf.
I read that you have to locate the "boxes" in your document first, but really coudlnt find any code or information on that. Thanks
PdfReader pdfReader = new PdfReader(path);
PdfDocument pdfDoc = new PdfDocument(pdfReader);
for (int page = 1; page <= pdfDoc.GetNumberOfPages(); page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string pageContent = PdfTextExtractor.GetTextFromPage(pdfDoc.GetPage(page), strategy);
}

Related

How to read manually added text within a pdf file with c#?

i'm using iTextSharp with this C# code:
string parsedText = string.Empty;
PdfReader reader = new PdfReader(pdfPath);
ITextExtractionStrategy its = new LocationTextExtractionStrategy();
parsedText = PdfTextExtractor.GetTextFromPage(reader, 1, its);
parsedText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(parsedText)));
It parses the pdf as expected, but it does not parse text, that is manually added with tools like FoxItReader oder NuancePDF.
Our accounting is manually adding an internal invoicenumber on each pdf and I need to parse that number. For some reason i can't find it.
It looks like it is on another "layer" of something that is not parsed.
Any ideas how to read those numbers?
Thanks

It is possible that the internal invoice number is being added as an annotation, rather than as actual text on the page.
Have you tried iText's facilities for extracting annotations to see if there are any on the page?

How to open an existing PDF file with Migradoc PDF library

I am trying to use the Migradoc library from PDFSharp (http://www.pdfsharp.net/) to print pdf files. So far I have found that Migradoc does support printing through its MigraDoc.Rendering.Printing.MigraDocPrintDocument class. However, I have not found a way to actually open an existing PDF file with MigraDoc.
I did find a way to open an existing PDF file using PDFSharp, but I cannot successfully convert a PDFSharp.Pdf.PdfDocument into a MigraDoc.DocumentObjectModel.Document object. So far I have not found the MigraDoc and PDFSharp documentation to be very helpful.
Does anyone have any experience using these libraries to work with existing PDF files?
I wrote the following code with help from this sample, but the result when my input PDF is 2 pages is an output PDF with 2 blank pages.
using MigraDoc.DocumentObjectModel;
using MigraDoc.Rendering;
using PdfSharp.Drawing;
using PdfSharp.Pdf;
using PdfSharp.Pdf.IO;
...
public void PrintPDF(string filePath, string outFilePath)
{
var document = new Document();
var docRenderer = new DocumentRenderer(document);
docRenderer.PrepareDocument();
var inPdfDoc = PdfReader.Open(filePath, PdfDocumentOpenMode.Modify);
for (var i = 0; i < inPdfDoc.PageCount; i++)
{
document.AddSection();
docRenderer.PrepareDocument();
var page = inPdfDoc.Pages[i];
var gfx = XGraphics.FromPdfPage(page);
docRenderer.RenderPage(gfx, i+1);
}
var renderer = new PdfDocumentRenderer();
renderer.Document = document;
renderer.RenderDocument();
renderer.PdfDocument.Save(outFilePath);
}

Your code modifies the inPdfDoc in memory without saving the changes. Complicated code without any visual effect.
MigraDoc cannot open PDF files, MigraDoc cannot print PDF files, PDFsharp cannot print PDF files.
http://www.pdfsharp.net/wiki/PDFsharpFAQ.ashx

Merging two PDF pages on top of each other

I am looking for a way to merge the content of two pdf pages.
It could be a watermark, an image or whatever.
The scenario is as follows:
I have a Word-addin that allows the user to create different templates for different customers based on several template forms. For each new customer, the user can provide a new letter paper containing header image / logos and footer. This shall be applied anyhow at runtime. Could be an image that is loaded directly into the header of the template (then I would need to render pdf to image, for the letter paper will mostly be provided as pdf-file) or when exporting the document (merging letter paper as background).
But the template shall not be accessible by the user, so this must be done programmatically.
So far, I tried Pdfsharp library, which does not support neither the version of my provided backpapers, nor the version of my documents created in Word 2007.
iTextSharp seemed very promising, but I could not manage to merge the contents so far.
I also tried pdftk.exe, but even when i ran it manually from command line, I got the error: "Done. Input errors, so no output created."
It does not matter how it is handled, but the output matters.
I forgot to mention, there is a whiteline created in the Word-template for archiving purposes, so this part may not be added as image or it has to be added afterwords into the output document.
Thanks in advance!

StampStationery.cs, a sample from the Webified iTextSharp Examples which essentially are the C#/iTextSharp versions of the Java/iText samples from the book iText in Action — 2nd Edition, does show how to add the contents of a page from one PDF document as stationery behind the content of each page of another PDF.
The central method is this:
public byte[] ManipulatePdf(byte[] src, byte[] stationery)
{
// Create readers
PdfReader reader = new PdfReader(src);
PdfReader s_reader = new PdfReader(stationery);
using (MemoryStream ms = new MemoryStream())
{
// Create the stamper
using (PdfStamper stamper = new PdfStamper(reader, ms))
{
// Add the stationery to each page
PdfImportedPage page = stamper.GetImportedPage(s_reader, 1);
int n = reader.NumberOfPages;
PdfContentByte background;
for (int i = 1; i <= n; i++)
{
background = stamper.GetUnderContent(i);
background.AddTemplate(page, 0, 0);
}
}
return ms.ToArray();
}
}
This method returns the manipulated PDF as a byte[].

Strip Adobe Reader and Version requirements from PDF before outputting it to browser

I am planning on using pdf.js to have PDF context via the browser with Javascript. The problem is that some PDFs, the ones I am using, require Adobe's Reader with a specific Version. pdf.js does does not yet(ever?) support spoofing of these. What I need to know is if there's a way in C# to open the PDF and remove these Reader and Version requirements and how to do it. I was planning on using itextsharp to do other PDF manipulation server-side so an example using this would be most helpful. I plan on serving these as an ActionResult from an ajax request via MVC 4, so a MemoryStream would be most helpful at the end of this manipulation.

So in the end pdf.js was unable to do what I needed it too, however, what I was able to do was convert the Xfa/Pdf to a C# object then send the pages as needed via Json to my Javascript for rendering in the HTML5 Canvas. The code below takes an xfa-in-a-pdf file and turns it into a C# object with the help of itextsharp:
PdfReader.unethicalreading = true;
PdfReader reader = new PdfReader(new FileStream(Statics.PdfUploadLocation + PdfFileName, FileMode.Open, FileAccess.Read));
XfaForm xfaForm = new XfaForm(reader);
XDocument xDoc = XDocument.Parse(xfaForm.DomDocument.InnerXml);
string xfaNamespace = #"{http://www.xfa.org/schema/xfa-template/2.6/}";
List<XElement> formPages = xDoc.Descendants(xfaNamespace + "subform").Descendants(xfaNamespace + "subform").ToList();
TotalPages = formPages.Count();
var fieldIndex = 0;
RawPdfFields = new List<XfaField>();
for (int page = 0; page < formPages.Count(); page++)
{
RawPdfFields.AddRange(formPages[page].Descendants(xfaNamespace + "field")
.Select(x => new XfaField
{
Page = page,
Index = fieldIndex++,
Name = (string)x.Attribute("name"),
Height = GetUnitFromPossibleString((string)x.Attribute("h")),
Width = GetUnitFromPossibleString((string)x.Attribute("w")),
XPosition = GetUnitFromPossibleString((string)x.Attribute("x")),
YPosition = GetUnitFromPossibleString((string)x.Attribute("y")),
Reference = GetReference(x.Descendants(xfaNamespace + "traverse")),
AssistSpeak = GetAssistSpeak(x.Descendants(xfaNamespace + "speak"))
}).ToList());
}

Your PDF file n-400.pdf uses the Adobe XML Forms Architecture (XFA). This means you require a viewer that also supports XFA which pdf.js seemingly does not.
Such a PDF normally contains some standard PDF content which indicates that the PDF requires some viewer that supports XFA. In your case the content contains
If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document.
This actually indicates what a XFA enabled viewer does, it renders some pages based upon information in the XFA XML data and displays it instead of the PDF style page descriptions.
While being defined proprietarily by Adobe, the PDF specification ISO 32000-1 describes how XFA data is to be embedded in a PDF document, cf. section 12.7.8 XFA Forms.
If you only need those forms in a flattened state, you might want to have a look at iText Demo: Dynamic XFA forms in PDF.

How to read a PDF Portfolio using iTextSharp

I'm using iTextSharp, in a C# app that reads PDF files and breaks out the pages as separate PDF documents. It works well, except in the case of portfolios. Now I'm trying to figure out how to read a PDF portfolio (or Collection, as they seem to be called in iText) that contains two embedded PDF documents. I want to simply open the portfolio, enumerate the embedded files and then save them as separate, simple PDF files.
There's a good example of how to programmatically create a PDF portfolio, here:
Kubrick Collection Example
But I haven't seen any examples that read portfolios. Any help would be much appreciated!

The example you referenced adds the embedded files as document-level attachments. So you can extract the files like this:
PdfReader reader = new PdfReader(readerPath);
PdfDictionary root = reader.Catalog;
PdfDictionary documentnames = root.GetAsDict(PdfName.NAMES);
PdfDictionary embeddedfiles =
documentnames.GetAsDict(PdfName.EMBEDDEDFILES);
PdfArray filespecs = embeddedfiles.GetAsArray(PdfName.NAMES);
for (int i = 0; i < filespecs.Size; ) {
filespecs.GetAsString(i++);
PdfDictionary filespec = filespecs.GetAsDict(i++);
PdfDictionary refs = filespec.GetAsDict(PdfName.EF);
foreach (PdfName key in refs.Keys) {
PRStream stream = (PRStream) PdfReader.GetPdfObject(
refs.GetAsIndirectObject(key)
);
using (FileStream fs = new FileStream(
filespec.GetAsString(key).ToString(), FileMode.OpenOrCreate
)){
byte[] attachment = PdfReader.GetStreamBytes(stream);
fs.Write(attachment, 0, attachment.Length);
}
}
}
Pass the output file from the Kubrick Collection Example you referenced to the PdfReader constructor (readerPath) if you want to test this.
Java version: part4.chapter16.KubrickDocumentary
C# version.
Hopefully I'll have time to update the C# examples this month from version 5.2.0.0 (the iTextSharp version is about three weeks behind the Java version right now).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to use LocationExtractionStrategy in iText to extract from standardized pdf? - c#

Related

How to read manually added text within a pdf file with c#?

How to open an existing PDF file with Migradoc PDF library

Merging two PDF pages on top of each other

Strip Adobe Reader and Version requirements from PDF before outputting it to browser

How to read a PDF Portfolio using iTextSharp

Categories

Resources