iTextSharp produce PDF from existing PDF template

iTextSharp produce PDF from existing PDF template - c#

I am looking at the feasibility of creating something using C# and iTextSharp that can take a PDF template and replace various place holder values with actual values retrieved from a database. Essentially a PDF mail merge. I have the iText in action book but it covers rather a lot of stuff i don't need and I am struggling to find anything related to what i want to do. I am happy to use PDF fields as the place holders so long as the merged/flattened form does not look like it has fields in it, the output document should look like a mail merged letter and not a form that has been filled in. In an ideal world i just want search the text content of the PDF and then replace text place holders with their correct field values a la word mail merge.
Can anyone advise me of the best approach to this and point me in the direction of the most helpful iTextSharp classes to use, or if you know the iText in Action book a pointer to the most helpful section for me to read.

Build your template sans fields in your page-layout/text-editor of choice.
Save to PDF.
Open that PDF and add fields to it. This is easy to do in Acrobat Pro (you could download a trial if need be). It's also possible in iText, just much harder.
In either case, you want to set your form fields to have no border, and no background... that way only their contents will be visible, no boxes to make your fields look like fields.
Merging field data into a form is Quite Trivial with iText (forgive my Java, I don't know much about C#):
void fillPDF( String filePath, Map<String, String> fieldVals ) {
PdfReader reader = new PdfReader(myFilePath);
PdfStamper stamper = new PdfStamper( reader, outputFileStream );
stamper.setFormFlattening(true);
AcroFields fields = stamper.getAcroFields();
for (String fldName : fieldVals.keySet()) {
fields.setField( fldName, fieldVals.get(fldName) );
}
stamper.close();
}
This ignores list boxes with multiple selections (and exceptions), but other than that should be ready to go. Given that you're doing a mail-merge type thing, I don't think multiple selections will be much of an issue.

Related

How do I detect a signature line in a PDF document and then insert a signature?

Last week I was asked to build an application for a blind man to programmatically fill out a PDF document. The problem he is having is that if the fields in the document aren't labeled correctly then he is not able to put his signature and other information into the document in the correct place.
My first approach was to attempt to read the document using iTextSharp and then insert his signature into the field which was most likely to be the signature box:
public string[] MassFieldEdit(IDictionary<string, string> userData, string originalDocument, string edittedDocument, bool flatten)
{
PdfReader reader = new PdfReader(originalDocument);
reader.SelectPages("1-" + reader.NumberOfPages.ToString());
using (PdfStamper stamper = new PdfStamper(reader, new FileStream(edittedDocument, FileMode.Create)))
{
AcroFields form = stamper.AcroFields;
ICollection<string> fieldKeys = form.Fields.Keys;
List<string> leftover = new List<string>(fieldKeys);
foreach (string fieldKey in fieldKeys)
{
foreach (KeyValuePair<string, string> s in user)
{
//Replace Form field with my custom data
if (fieldKey.ToLower().Contains(s.Key.ToLower()))
{
form.SetField(fieldKey, s.Value);
leftover.Remove(fieldKey);
}
}
}
//The below will make sure the fields are not editable in
//the output PDF.
stamper.FormFlattening = flatten;
return leftover.ToArray();
}
}
This works by taking a dictionary set, the key being a word or phrase, checking that against the PDF fields and then inserting the value into the fields if the field matches the word or phrase in the key.
The signature box before my program edits it.
The signature box after.
But the problem I have now is that if no field exists then although it may have "sign here" right next to the dotted line, there is no way to insert text onto the dotted line without knowing exactly where the dotted line is, nor can my user select the dotted line because that defeats the point of the program.
I have looked at a number of previous questions and answers, including:
How do I get a TextField from AcroFields using iText/Sharp?
How to convert PDF to WORD in c#
Insert text in existing pdf with itextsharp
ITextSharp insert text to an existing pdf
I need a way to detect the signature line and then insert his name onto the signature line with more certainty than taking pot shots at field names. Both in situations where a correctly labeled field exists and also in situations where the signature line may be no more than a line of text which says "sign here".

The robust solution (aka "hard work solution")
Implement IEventListener (iText7 class)
Use IEventListener to get notified of text rendering instructions, and line drawing operations
Rendering instructions do not always appear in logical (reading) order. Fix that by implementing a comparator for these objects
Sort according to comparator
Use language detection to determine the language (n-gram approach is simple, but should suffice)
Dictionary attack. Look for all occurences of words that signify "sign here" in whatever language the document is written in (hence step 5)
In case of multiple candidates, or no candidates, use line rendering instructions to look for likely candidate of the infamous "dotted line"
This approach is not easy, but there is a lot of research into recognition of structural elements in pdf files. In particular, if you run a google scholar search, you'll find loads of helpful article where people have tried detecting tables, lists, paragraphs, etc.

Get PDF content

I want to read the some content pdf files. I just started before getting into the stuff I just want to know what the right approach to do so.
ItextSharp reader may be helpful in that case, so I converted the pdf into text using:
public static string pdfText(string path)
{
PdfReader reader = new PdfReader(path);
string text = string.Empty;
for(int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader,page);
}
reader.Close();
return text;
}
I'm still wondering if this approach seems OK, or if I should convert this pdf into excel and then read the content which I want instead.
Professionals thoughts will be appreciated.

With iText, you can also choose a specific strategy for extracting the text. But keep in mind that this is always a heuristic process.
Pdf documents essentially contain only the instructions needed to render the document for a viewer. So there is no concept of "text". More something like "draw character A at position 420, 890".
In order for any text-extraction to work, it needs to make some guesses on when two characters are close enough together that they should be concatenated, and when they should be apart.
Coincidentally, iText does this based on the width of a single space character in the font that is being used.
Keep in mind there could also be ActualText (this is a sort of text that gets hidden in the document, and is only used in extraction. It makes it possible to have the document render a character like "œ" (ligature version), which gets extracted as "oe" (non ligature version).
Depending on your input documents, you might want to look into the different implementations of ITextExtractionStrategy.

Extracting text from PDF with iTextSharp is not working for some PDF

I am using the following code to extract text from the first page of PDF files with iTextSharp :
public static string ExtractTextFromPDFFirstPage(string fileName)
{
string text = null;
using (var pdfReader = new PdfReader(fileName))
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
text = PdfTextExtractor.GetTextFromPage(pdfReader,1,strategy);
text = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
}
return text;
}
It works quite well for many PDF, but not for some other ones.
Working PDF : http://data.hexagosoft.com/LFBO.pdf
Not working PDF : http://data.hexagosoft.com/LFBP.pdf
These two PDF seems to be quite similar, but one is working and the other is not.
I guess the fact that their producer tag is not the same is a clue here.
Another clue is that this function works for any other page of the PDF without a chart.
I also tried with ghostscipt, without success.
The Encoding line seems to be useless as well.
How can i extract the text of the first page of the non working PDF, using iTextSharp ?
Thanks

Both documents use fonts with inofficial glyph names in their Encoding/Differences array and both do not use a ToUnicode map. The glyph naming seems to be somewhat straight: the number following the MT prefix is the ASCII code of the used glyph.
The first document works, because the mapping is not changed at all and iText will use the default encoding (I guess):
/Differences[65/MT65/MT66/MT67 71/MT71/MT72/MT73 76/MT76 78/MT78 83/MT83]
The other document really changes the mapping:
/Differences [2 /MT76 /MT105 /MT103 /MT104 /MT116 /MT110 /MT32 /MT97 /MT100 /MT115 /MT58 ]
This means: E.g. the character code 2 should map to the glyph named MT76 which is an inofficial/private glyph name that iText doesn't know, so it doesn't have more information but the character code 2 and will use this code for the final result (I guess).
It's impossible without implementing a logic for the MT prefixed glyph names to get the correct text out of this document. Anyhow it is nowhere defined that a glyph name beginning with MT followed by an integer can be mapped to the ASCII value... That's simply by accident or implemented by the font designer/creation tool, whatever it came from.

The 2nd PDF (LFBP.pdf) contains the incorrect mapping from glyphs to text, i.e. you see correct glyphs but the text representation was not correctly encoded for some reason during the generation of this PDF. If you have lot of files like this then the working approach could be:
detect broken pages while extracting text by searching some phrase that should appear on every page, maybe like "service"
process these pages separately using OCR with tools like Tesseract with .NET Wraper

Is there any way to assign Id's to paragraphs in Open XML SDK 2.5?

I'm working on an application which has to create word documents with the use of Office Open XML SDK 2.5. The idea that I'm having now is that I will start from a template with an empty body (so I have all the namespaces etc. defined already), and add Paragraphsto it. If I need images I will add the ImageParts and try to give the ImagePart the Id present in the predefined paragraphpart which will contain the image. I will store the paragraphs as xml in a database, fetch the ones I need, fill in/modify some values if needed and insert them into my word document. But this is the tricky part, how can I easily insert them in a way so I don't have to query on their content to later on find one of the paragraphs? In other words, I need Id's. I have some options in mind:
For each possible paragraph I have, manually create a SdtBlock. This SdtBlock will have an Id which matches the Id of each paragraph in the database. This seems like a lot of manual work though, and I'd rather be able to create future word documents easier...
I chose this approach but I insert Building Blocks which can be stored in templates with a specific tagname.
Create the paragraphs, copy the xml from the developer tool, and manually add a ParagraphId. This seems even more of a nightmare though, because for every future new paragraphs I will have to create new Id's etc. Also it would be impossible to insert tables as there is no way (afaik) to give those an Id.
Work with bookmarks to know where to insert the data. I don't really like this either as bookmarks are visible for everyone. I know I can replace them, but then I don't have any way to identify individual paragraphs later on.
**** my database and just add everything in the template :D Remove the paragraphs I don't need by deleting the bookmarks with their content. This idea seems the worst of all though as I don't want to depend on having a templatefile with all possible content per word-file I need to generate.
Anyone with experience in OpenXml who knows which approach would be the best? Maybe there is another approach which is better and I have completely overlooked? The ideal solution would be that I can add Ids in Office Word but that's a no-go as I haven't found anything to do that yet.
Thanks in advance!

Content Controls (std) were designed for this, although I'm not sure the designers ever contemplated "targeting" each and every paragraph in the document...
Back in the 2003/2007 days it was possible to add custom XML mark-up to a Word document, which would have been exactly what you're looking for. But Microsoft lost a patent court case around 2009 and had to pull the functionality. So content controls are really your only good choice.
Your approach could, possibly, be combined with the BuildingBlocks concept. BuildingBlocks are Word content stored in a Word template file as valid Word Open XML. They can be assigned to "galleries" and categorized. There is a Content Control of type BuildingBlock that can be associated with a specific Gallery and Category which might help you to a certain extent and would be an alternative to storing content in a database.

Ok, I did a small research, you can do it in strict OpenXML, but only before you open your file in Word. Word will remove everything it cannot read.
using (WordprocessingDocument document = WordprocessingDocument.Open(path, true)) {
document.MainDocumentPart.Document.Body.Ancestors().First()
.SetAttribute(new OpenXmlAttribute() {
LocalName = "someIdName",
Value = "111" });
}
Here, for example, I set attribute "someIdName", which doesn't exits in OpenXML, to some random element. You can set it anywhere and use it as id

Use OpenXML to replace text in DOCX file - strange content

I'm trying to use the OpenXML SDK and the samples on Microsoft's pages to replace placeholders with real content in Word documents.
It used to work as described here, but after editing the template file in Word adding headers and footers it stopped working. I wondered why and some debugging showed me this:
Which is the content of texts in this piece of code:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(DocumentFile, true))
{
var texts = wordDoc.MainDocumentPart.Document.Body.Descendants<Text>().ToList();
}
So what I see here is that the body of the document is "fragmented", even though in Word the content looks like this:
Can somebody tell me how I can get around this?
I have been asked what I'm trying to achieve. Basically I want to replace user defined "placeholders" with real content. I want to treat the Word document like a template. The placeholders can be anything. In my above example they look like {var:Template1}, but that's just something I'm playing with. It could basically be any word.
So for example if the document contains the following paragraph:
Do not use the name USER_NAME
The user should be able to replace the USER_NAME placeholder with the word admin for example, keeping the formatting intact. The result should be
Do not use the name admin
The problem I see with working on paragraph level, concatenating the content and then replacing the content of the paragraph, I fear I'm losing the formatting that should be kept as in
Do not use the name admin

Various things can fragment text runs. Most frequently proofing markup (as apparently is the case here, where there are "squigglies") or rsid (used to compare documents and track who edited what, when), as well as the "Go back" bookmark Word sets in the background. These become readily apparent if you view the underlying WordOpenXML (using the Open XML SDK Productivity Tool, for example) in the document.xml "part".
It usually helps to go an element level "higher". In this case, get the list of Paragraph descendants and from there get all the Text descendants and concatenate their InnerText.

OpenXML is indeed fragmenting your text:
I created a library that does exactly this : render a word template with the values from a JSON.
From the documenation of docxtemplater :
Why you should use a library for this
Docx is a zipped format that contains some xml. If you want to build a simple replace {tag} by value system, it can already become complicated, because the {tag} is internally separated into <w:t>{</w:t><w:t>tag</w:t><w:t>}</w:t>. If you want to embed loops to iterate over an array, it becomes a real hassle.
The library basically will do the following to keep formatting :
If the text is :
<w:t>Hello</w:t>
<w:t>{name</w:t>
<w:t>} !</w:t>
<w:t>How are you ?</w:t>
The result would be :
<w:t>Hello</w:t>
<w:t>John !</w:t>
<w:t>How are you ?</w:t>
You also have to replace the tag by <w:t xml:space=\"preserve\"> to ensure that the space is not stripped out if they is any in your variables.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.