I am trying here to create a piece of code which would open an existing PDF Form (previously created with Open Office) with empty controls and set the values using iTextSharp (). I am still at the testing of iTextSharp to see if it does whatever I need to do, and so far the answer is no.
Please see what I've done below according to what I found on the net:
string fileNameExisting = #"PdfTemplate.pdf";
string fileNameNew = #"new.pdf";
using (var existingFileStream = new FileStream(fileNameExisting, FileMode.Open))
using (var newFileStream = new FileStream(fileNameNew, FileMode.Create))
{
// Open existing PDF
var pdfReader = new PdfReader(existingFileStream);
// PdfStamper, which will create
var stamper = new PdfStamper(pdfReader, newFileStream);
var form = stamper.AcroFields;
var fieldKeys = form.Fields.Keys;
foreach (string fieldKey in fieldKeys)
{
bool result = form.SetField(fieldKey, "A lot of text here.");
}
stamper.Close();
pdfReader.Close();
}
Issue 1
iTextSharp only recognizes 'controls' elements from Open Office (Textboxes for example). I tried to add a table to the PDF Template but it doesn't appear in the fields. Which means I am really limited in what to use.
Issue 2
When I set the text in the fields, there is no wraping of the text, and the size of the controls is not dynamic which means if the text is too long, it doesn't all appear. I can't use the scroll bar as the PDF is to print.
I tried
For the first issue, I created a PDF Form with Word instead of Open Office Writer. However, iTextSharp does not recognize any of the controls from Word, my fields collection is empty..
For the second issue, I tried to modify every properties of the controls in Open Office, looked on the internet to see if someone had a solution. But from what I understood, the size is fixed as it is AcroFields, so I can't make the control dynamic and can't change the size afterwards with iTextSharp.
I was hoping someone went through the same situation and would be able to guide me either with iTextSharp, or another library, free or not. I can't afford a £2000 license though as I am running my own business, but I am open to suggestions as I need to deliver.
The last option is to create the PDF from scratch with iTextSharp, but it's not as fast and easy to produce as the modification, and it means that for every update of the PDF, the company would need me to change the code... I'm not very pleased with that solution.
Issue 1:
A table is not a form field. Please read the PDF specification, more specifically ISO-32000-1:
There is no such thing as a dynamic table in PDF. That is only possible in XFA (which is XML wrapped in a PDF file), but XFA is being deprecated. At iText, we'll release a (closed source) product in February 2017 for dynamic documents.
Issue 2:
The text only wraps if the field is defined as a multi-line text field. See for instance https://developers.itextpdf.com/question/how-get-row-count-multiline-field
The font size only adapts to the size of the field if you set the font size to 0: Set AcroField Text Size to Auto
Summarized:
Dynamic forms and PDF either require XFA in which case you need to buy Adobe LiveCycle ES (which is way above your budget), or you need to wait until iText Group releases its dynamics forms project (but that will also be more expensive than £2000).
Related
I have a secured PDF Template with editable fields. When I set a field's value it doesn't show up until I click on it and modify it.
Code for inserting a value into a field:
static void Main(string[] args)
{
using (PdfReader reader = new PdfReader(desktopPath + "PdfTemplate.pdf"))
{
reader.SetUnethicalReading(true);
using (PdfDocument pdfDocument = new PdfDocument(reader, new PdfWriter(desktopPath + "ModifiedPdfTemplate.pdf")))
{
PdfAcroForm form = PdfAcroForm.GetAcroForm(pdfDocument, true);
IDictionary<string, PdfFormField> fields = form.GetFormFields();
fields["Date"].SetValue("DATE");
}
}
}
This is what an unedited field looks like:
After I run the code, the field still looks like before, but there is a value if I click on it:
After I modified the field (added a space at the end), now it shows the text:
I think it's because there is some styling on the field. How can I achieve that is shown on the last picture?
Software Versions:
Itext -> 7.1.16
Adobe Acrobat -> 2020.009.20063
After the attempt to reproduce the issue here failed, the next step was comparing versions of software involved.
Synchronizing the iText version to the current 7.1.16 still resulted in different observations.
But then updating the PDF viewer, Adobe Acrobat (Reader), finally resolved the issue.
Apparently Acrobat 2020.009.20063 fails to display the field properly while Acrobat 2021.001.20155 and 2021.007.20091 succeed.
(One might think that basic functionality like field value showing should have been stable in Acrobat for a number of years. But apparently changes do still occur here. This may be related to hardening of signed forms against forgery in the recent months and years.)
I also faced the same issue, It was due to the font of the Form field, set the font value to null while creating the field or update the font of the existing field.
if the font is null Itext will use the document's default font
I'm trying check a checkbox on my PDF with iText7.
But instead of checking only one field, it's checking all fields
What I need:
What I get:
PDF when editing:
I think the exported value has something to do with it.
But I don't kown what to do.
My code:
private static void CreatePdf(string output)
{
using var _pdfDoc = new PdfDocument(new PdfReader("CheckTest.pdf"), new PdfWriter(output));
var form = PdfAcroForm.GetAcroForm(_pdfDoc, true);
var check = form.GetField("Check");
check.SetValue("01");
}
PDF: Link
Someone know how to check it properly?
Thanks!
First of all, the PDF essentially mis-uses PDF AcroForm check box fields as radio buttons instead of using genuine PDF AcroForm radio button fields.
The PDF specification does not clearly specify what a PDF viewer should do in such a case (it's mis-use after all) but the developers of the PDF form generator in question probably have experimented and determined that in the most widely used PDF viewer, Adobe Acrobat Reader, this mis-use works just as they want.
As this use is beyond specification, though, other PDF processors processing such PDFs may produce completely different results without doing anything wrong.
That being said, there is a way to fill the form using iText and achieve results similar to those generated by Adobe Reader.
The problem at hand is that iText by default for all form field types except actual AcroForm radio button fields generates new appearances in a way appropriate for the field type when setting the field value. In your document there are three check box field objects with the same name. Thus, they are considered a single check box with three widgets representing the same value, and so the appearances are generated as you observe.
But you can tell iText to not generate new appearances, using another SetValue overload accepting an additional boolean value, simply replace
check.SetValue("01");
by
check.SetValue("01", false);
Now iText makes do with the existing appearances, so only the field becomes checked that has an appearance for that "01" value.
Beware, only prevent iText from generating appearances in cases like this. In case of text fields, for example, not updating the appearances would cause the old appearances with the former field content to continue to be displayed even though the internal field value changed.
A did it like this:
Dim MyPDFFormCheckBoxField As Fields.PdfFormField = myform.GetField("myCheckBox")
MyPDFFormCheckBoxField.SetCheckType(PdfFormField.TYPE_CHECK)
MyPDFFormCheckBoxField.SetValue("", If(myCheckBox.IsChecked = True, True, False))
Notice that it is the second parameter of SetValue that is setting the checkbox True or False.
I try to build an application that can convert a PDF to an excel with C#.
I have searched for some library to help me with this, but most of them are commercially licensed, so I ended up to iTextSharp.dll
It's good that is free, but I rarely find any good open source documentation for it.
These are some link that I have read:
https://yoda.entelect.co.za/view/9902/extracting-data-from-pdf-files
https://www.mikesdotnetting.com/article/80/create-pdfs-in-asp-net-getting-started-with-itextsharp
http://www.thedevelopertips.com/DotNet/ASPDotNet/Read-PDF-and-Convert-to-Stream.aspx?id=34
there're more. But, most of them did not really explain what use of the code.
So this is most common code in IText with C#:
StringBuilder text = new StringBuilder(); // my new file that will have pdf content?
PdfReader pdfReader = new PdfReader(myPath); // This maybe how IText read the pdf?
for (int page = 1; page <= pdfReader.NumberOfPages; page++) // looping for read all content in pdf?
{
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy(); // ?
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy); // ?
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText))); // maybe how IText convert the data to text?
text.Append(currentText); // maybe the full content?
}
pdfReader.Close(); // to close the PdfReader?
As you can see, I still do not have a clear knowledge of the IText code that I have. Tell me, if my knowledge is correct and give me an answer for code that I still not understand.
Thank You.
Let me start by explaining a bit about PDF.
PDF is not a 'what you see is what you get'-format.
Internally, PDF is more like a file containing instructions for rendering software. Unless you are working with a tagged PDF file, a PDF document does not naturally have a concept of 'paragraph' or 'table'.
If you open a PDF in notepad for instance, you might see something like
7 0 obj
<</BaseFont/Helvetica-Oblique/Encoding/WinAnsiEncoding/Subtype/Type1/Type/Font>>
endobj
Instructions in the document get gathered into 'objects' and objects are numbered, and can be cross-referenced.
As Bruno already indicated in the comments, this means that finding out what a table is, or what the content of a table is, can be really hard.
The PDF document itself can only tell you things like:
object 8 is a line from [50, 100] to [150, 100]
object 125 is a piece of text, in font Helvetica, at position [50, 110]
With the iText core library you can
get all of these objects (which iText calls PathRenderInfo, TextRenderInfo and ImageRenderInfo objects)
get the graphics state when the object was rendered (which font, font-size, color, etc)
This can allow you to write your own parsing logic.
For instance:
gather all the PathRenderInfo objects
remove everything that is not a perfect horizontal or vertical line
make clusters of everything that intersects at 90 degree angles
if a cluster contains more than a given threshold of lines, consider it a table
Luckily, the pdf2Data solution (an iText add-on) already does that kind of thing for you.
For more information go to http://pdf2data.online/
i am having a trouble in retrieving images and text in a pdf file at the same, i was able to get images and text in a pdf file but not at the same time (this will cause a question of whether to render the image first or the text first for example in my panel control?), maybe if you guys can help me define what does each constants in pdfname means? i tried using pdfname.all but it returns null, but when using pdfname.resources it returns procset, font and xobject. i used xobject for image, but what are procset and font (could this be the style of the text? does it have pdfname.text for retrieving text)?
thanks in advance.
First of all,
i am having a trouble in retrieving images and text in a pdf file at the same
for this task you should use the iText(Sharp) parser API. In iTextSharp you essentially implement IRenderListener (an interface with methods for being informed about (bitmap) images and text fragments in a content stream) and process the page contents with it:
PdfReader reader = new PdfReader(...);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
int pageNumber = [... the number of the page you are interested in; may be a loop variable ...];
IRenderListener listener = new [... your IRenderListener implementation ...]
parser.ProcessContent(pageNumber, listener);
You ask
whether to render the image first or the text first for example in my panel control
The IRenderListener methods also retrieve information on the location of the bitmap or text fragment in question.
For ideas how the text fragments may be combined in your listener, you may want to be inspired by the implementations SimpleTextExtractionStrategy or LocationTextExtractionStrategy present in iTextSharp.
If you insist on doing it manually, though...
maybe if you guys can help me define what does each constants in pdfname means?
You find the definitions of what the names map to in the PDF specification ISO 32000-1:2008 a copy of which Adobe made available here.
when using pdfname.resources it returns procset, font and xobject. i used xobject for image, but what are procset and font (could this be the style of the text?
The contents of the page Resource Dictionaries are explained in section 7.8.3 of the specification.
does it have pdfname.text for retrieving text)?
You'll find how test is presented in page content streams and xobjects in section 9.
I have a pdf with a form in it. I am trying to write a class that will take data from my database and automatically populate the fields in the form.
I have already tried ITextSharp and their pricing is out of my budget, even though it works perfectly fine with my pdf. I need a free pdf parser that will let me import the pdf, set the data, and save the PDF out, preferably to a stream so that I can return a Stream object from my class rather than saving the pdf to the server.
I found this pdf reader and it doesn't work. Null reference errors are abundant and when I tried to "fix" them, it still couldn't find my fields.
So, I have moved on to PdfBox, as the documentation says it can manipulate a PDF, however, I cannot find any examples. Here is the code I have so far.
var document = PDDocument.load(inputPdf);
var catalog = document.getDocumentCatalog();
var form = catalog.getAcroForm();
form.getField("MY_FIELD").setValue("Test Value");
document.save("some location on my hard drive");
document.close();
The problem is that catalog.getAcroForm() is returning a null, so I can't access the fields. Does anyone know how I can use PdfBox to alter the field values and save the thing back out?
EDIT:
I did find this example, which is pretty much what I am doing. It's just that my acroform is null in pdfbox. I know there is one there because itextsharp can pull it out just fine.
Have you tried with the 1.2.1 version?
http://pdfbox.apache.org/apidocs/overview-summary.html