I have a pdf with a form in it. I am trying to write a class that will take data from my database and automatically populate the fields in the form.
I have already tried ITextSharp and their pricing is out of my budget, even though it works perfectly fine with my pdf. I need a free pdf parser that will let me import the pdf, set the data, and save the PDF out, preferably to a stream so that I can return a Stream object from my class rather than saving the pdf to the server.
I found this pdf reader and it doesn't work. Null reference errors are abundant and when I tried to "fix" them, it still couldn't find my fields.
So, I have moved on to PdfBox, as the documentation says it can manipulate a PDF, however, I cannot find any examples. Here is the code I have so far.
var document = PDDocument.load(inputPdf);
var catalog = document.getDocumentCatalog();
var form = catalog.getAcroForm();
form.getField("MY_FIELD").setValue("Test Value");
document.save("some location on my hard drive");
document.close();
The problem is that catalog.getAcroForm() is returning a null, so I can't access the fields. Does anyone know how I can use PdfBox to alter the field values and save the thing back out?
EDIT:
I did find this example, which is pretty much what I am doing. It's just that my acroform is null in pdfbox. I know there is one there because itextsharp can pull it out just fine.
Have you tried with the 1.2.1 version?
http://pdfbox.apache.org/apidocs/overview-summary.html
Related
(This question was formerly titled "C# / WPF : Going from Excel Interop "Range" to WPF "FlowDocument"" however I've made progress on that front that allows me to restrict my question. I'm leaving the original question below so existing answers will still make sense.)
I'm using Office Interop to read the contents of cells in an Excel worksheet. Some of those cells contain Rich Text (for example some words are italicized but not the whole cell) and I would like to capture them as RTF so I can then display them into WPF controls.
I have been able to obtain the RTF contents of cells using the clipboard API, where I use Excel Interop to copy a Range of one cell to the clipboard, and then read the clipboard, like so:
// Step 1 : retrieve the RTF from the clipboard as a string
string txt = Clipboard.GetText(TextDataFormat.Rtf);
// Step 2 : create a FlowDocument object and a TextRange object:
FlowDocument doc = new FlowDocument();
TextRange tr = new TextRange(doc.ContentStart, doc.ContentEnd);
// Step 3 : convert the clipboard string to a stream
byte[] byteArray = Encoding.ASCII.GetBytes(txt);
MemoryStream stream = new MemoryStream(byteArray);
// Step 4 : load that stream into TextRange
tr.Load(stream, DataFormats.Rtf);
If I then assign "doc" to the Document property of, say, a RichTextBox control, it'll display the content of the Excel cell with the exact same formatting as Excel does, down to colored words and font sizes.
However, this is extremely slow. It may take minutes to load a thousand cells that way, even if most are empty.
So here's my updated question : clearly Excel has a mechanism for returning the RTF content of an Excel cell, otherwise my Clipboard code couldn't work. But is there are more efficient way than the Clipboard to exploit that mechanism ? Ideally through Interop ?
Original question :
This may be an unusual question but as I'm quite new to C#, WPF and Interop, I might be going about things the wrong way so don't hesitate to offer a better approach. Here's what I'm trying to do :
I'm coding a WPF application that uses Office Interop to grab the contents of cells from an Excel worksheet. That content is text which may contain some formatting (for example some words are in bold, others are in italics). The application then displays that content in a "FlowDocumentScrollViewer" control on its GUI.
I want this "FlowDocumentScrollViewer" control to render the content from the Excel cell exactly as it appears in Excel, with formatting and everything.
The best I've managed so far is to display the cell's content without any formatting. Here's how this works : I use Office Interop to read a Range of cells from the worksheet and take their Value2 property. Value2 is of type "object". Then I create a FlowDocument object out of it, like so:
FlowDocument doc = new FlowDocument();
Paragraph p = new Paragraph(new Run(Variable_containing_a_Value2.ToString()));
doc.Blocks.Add(p);
And then I store this FlowDocument into the "FlowDocumentScrollViewer" Document property.
Now since I'm using "ToString()" on the Value2 I'm not surprised that any formatting information this object might contain disappears past this point.
My problem is, I haven't been able to find a way to create that FlowDocument, from that Value2 object, that preserves formatting.
Now, I know there has to be a way to get that information through, because when I copy my Excel cell and paste it in Word, for example, then the formatting is carried through. I just don't know how.
Help me Obiwans, you're my only hope, as even Google has failed me.
It seems to me that you have at least a couple of options that will work better than just copying the cell contents as text. The Range object has Copy() and CopyPicture() methods, which you can use to have Excel copy the contents of the range to the clipboard.
The basic Copy() method should (I haven't tested it) put the contents of the cell into the clipboard in a variety of formats, including RTF. And you should be able to get the RTF and put that into the FlowDocument element.
Using RTF, you may still not get exactly the representation as seen in Excel. The only way to do that is to have Excel do the rendering. In that case, you'll want the CopyPicture() method, which will put picture of the range on the clipboard. This will be either a bitmap or metafile, depending on the options you use for the method call. You can then retrieve these from the clipboard and put them into your FlowDocument.
Depending on what applications you're looking at, e.g. Word, there's yet another more complicated approach, one that I doubt would work with FlowDocument, but which they are using. That is, they are presenting the Excel range an OLE object. This is harder to implement, but has the advantage that it's a live representation of the original Excel document, and the user can edit the range in-place in the host application.
The above should be enough to get you pointed in the right direction, so at least you know what you're looking for when you do your web searches. As stated, your question is very broad, and so the above is necessarily vague as well. Once you've decided on a particular method, have done some research and made an attempt into implementing that method, if you still have problems you can post a new question, with a good Minimal, Complete, and Verifiable code example that shows clearly what you've tried, with a detailed explanation of what specifically you're still having trouble with.
I am trying here to create a piece of code which would open an existing PDF Form (previously created with Open Office) with empty controls and set the values using iTextSharp (). I am still at the testing of iTextSharp to see if it does whatever I need to do, and so far the answer is no.
Please see what I've done below according to what I found on the net:
string fileNameExisting = #"PdfTemplate.pdf";
string fileNameNew = #"new.pdf";
using (var existingFileStream = new FileStream(fileNameExisting, FileMode.Open))
using (var newFileStream = new FileStream(fileNameNew, FileMode.Create))
{
// Open existing PDF
var pdfReader = new PdfReader(existingFileStream);
// PdfStamper, which will create
var stamper = new PdfStamper(pdfReader, newFileStream);
var form = stamper.AcroFields;
var fieldKeys = form.Fields.Keys;
foreach (string fieldKey in fieldKeys)
{
bool result = form.SetField(fieldKey, "A lot of text here.");
}
stamper.Close();
pdfReader.Close();
}
Issue 1
iTextSharp only recognizes 'controls' elements from Open Office (Textboxes for example). I tried to add a table to the PDF Template but it doesn't appear in the fields. Which means I am really limited in what to use.
Issue 2
When I set the text in the fields, there is no wraping of the text, and the size of the controls is not dynamic which means if the text is too long, it doesn't all appear. I can't use the scroll bar as the PDF is to print.
I tried
For the first issue, I created a PDF Form with Word instead of Open Office Writer. However, iTextSharp does not recognize any of the controls from Word, my fields collection is empty..
For the second issue, I tried to modify every properties of the controls in Open Office, looked on the internet to see if someone had a solution. But from what I understood, the size is fixed as it is AcroFields, so I can't make the control dynamic and can't change the size afterwards with iTextSharp.
I was hoping someone went through the same situation and would be able to guide me either with iTextSharp, or another library, free or not. I can't afford a £2000 license though as I am running my own business, but I am open to suggestions as I need to deliver.
The last option is to create the PDF from scratch with iTextSharp, but it's not as fast and easy to produce as the modification, and it means that for every update of the PDF, the company would need me to change the code... I'm not very pleased with that solution.
Issue 1:
A table is not a form field. Please read the PDF specification, more specifically ISO-32000-1:
There is no such thing as a dynamic table in PDF. That is only possible in XFA (which is XML wrapped in a PDF file), but XFA is being deprecated. At iText, we'll release a (closed source) product in February 2017 for dynamic documents.
Issue 2:
The text only wraps if the field is defined as a multi-line text field. See for instance https://developers.itextpdf.com/question/how-get-row-count-multiline-field
The font size only adapts to the size of the field if you set the font size to 0: Set AcroField Text Size to Auto
Summarized:
Dynamic forms and PDF either require XFA in which case you need to buy Adobe LiveCycle ES (which is way above your budget), or you need to wait until iText Group releases its dynamics forms project (but that will also be more expensive than £2000).
I need to convert image files to PDF without using third party libraries in C#. The images can be in any format like (.jpg, .png, .jpeg, .tiff).
I am successfully able to do this with the help of itextsharp; here is the code.
string value = string.Empty;//value contains the data from a json file
List<string> sampleData;
public void convertdata()
{
//sampleData = Newtonsoft.Json.JsonConvert.DeserializeObject<List<string>>(value);
var jsonD = System.IO.File.ReadAllLines(#"json.txt");
sampleData = Newtonsoft.Json.JsonConvert.DeserializeObject<List<string>>(jsonD[0]);
Document document = new Document();
using (var stream = new FileStream("test111.pdf", FileMode.Create, FileAccess.Write, FileShare.None))
{
PdfWriter.GetInstance(document, stream);
document.Open();
foreach (var item in sampleData)
{
newdata = Convert.FromBase64String(item);
var image = iTextSharp.text.Image.GetInstance(newdata);
document.Add(image);
Console.WriteLine("Conversion done check folder");
}
document.Close();
}
But now I need to perform the same without using third party library.
I have searched the internet but I am unable to get something that can suggest a proper answer. All I am getting is to use it with either "itextsharp" or "PdfSharp" or with the "GhostScriptApi".
Would someone suggest a possible solution?
This is doable but not practical in the sense that it would very likely take way too much time for you to implement. The general procedure is:
Open the image file format
Either copy the encoded bytes verbatim to a stream in a PDF document you have created or decode the image data and re-encode it in a PDF stream (whether it's the former or latter depends on the image format)
Save the PDF
This looks easy (it's only three points after all :-)) but when you start to investigate you'll see that it's very complicated.
First of all you need to understand enough of the PDF specification to write a new PDF file from scratch, doing all of the right things. The PDF specification is way over 1000 pages by now; you don't need all of it but you need to support a good portion of it to write a proper PDF document.
Secondly you will need to understand every image file format you want to support. That by itself is not trivial (the TIFF file format for example is so broad that it's a nightmare to support a reasonable fraction of TIFF files out there). In some cases you'll be able to simply copy the bulk of an image file format into your PDF document (jpeg files fall in that category for example), that's a complication you want to support because uncompressing the JPEG file and then recompressing it in a PDF stream will cause quality loss.
So... possible? Yes. Plausible? No. Unless you have gotten lots and lots of time to complete this project.
The structure of the simpliest PDF document with one single page and one single image is the following:
- pdf header
- pdf document catalog
- pages info
- image
- image header
- image data
- page
- reference to image
- list of references to objects inside pdf document
Check this Python code that is doing the following steps to convert image to PDF:
Writes PDF header;
Checks image data to find which filter to use. You should better select just one format like FlateDecode codec (used by PDF to compress images without loss);
Writes "catalog" object which is basically is the array of references to page objects.
Writes image object header;
Writes image data (pixels by pixels, converted to the given codec format) as the "stream" object in pdf;
Writes "page" object which contains "image" object;
Writes "trailer" section with the set of references to objects inside PDF and their starting offsets. PDF format stores references of objects at the end of PDF document.
I would write my own ASP.NET Web Service or Web API service and call it within the app :)
I have spent about 20 hours of coding to produce invoices using iText in c#.
Now, i want to use the same code to transform some of the tables to html.
Do you know if i can do this?
For instance i have this:
PdfPTable table = new PdfPTable(3);
table.DefaultCell.Border = 0;
table.DefaultCell.Padding = 3;
table.WidthPercentage = 100;
int[] widths = { 100, 200, 100};
table.SetWidths(widths);
List listOfCompanyData = (List)getCompanyData();
List listOfCumparatorDreaptaData = (List)getCumparatorDreaptaData(proformaInvoice.getCumparatorDreapta());
table.AddCell((Phrase)listOfCompanyData.Items[0]);
table.AddCell("");
table.AddCell((Phrase)listOfCumparatorDreaptaData.Items[0]);
and i want to transform this table into html...
Is it possible?
PDFs and HTML are fundamentally different display technologies. PDF is much more complex then HTML is, which is why you find so many HTML to PDF converters. The other way around is much more difficult.
iText can only do do it from HTML to PDF.
There are online converters that will take a PDF and convert it to HTML. There are also downloadable utilities.
I am not aware of any .NET library that will do this.
PDF is almost a write-only format. Any time your workflow calls for "get the data out of a PDF", you've probably screwed up.
Having said that, there are several ways to stash data within a PDF:
Form fields have no particular length limit and need not be visible. Getting form data with iText is trivial.
You can attach a file to a PDF and suck it out later, both with iText.
DocInfo fields. You can stuff a string into one of the author/title/keywords/etc metadata fields. An ugly hack, but effective.
XML metadata. The "new-fangled" metadata is stored in an XML schema. You can put pretty much whatever you want in there... though iText regenerates some of it every time it makes changes (mod date and such).
Custom keys/values. You can tack any old key/value pairs you like into any old dictionary within a PDF. Adobe would like you to register a company-specific prefix for your custom tags to avoid collisions, but I've never felt the need.
From the book iText in Action it seems that it is doable using the original java library, but it does not seem like it is no longer ported in the c# lib. I'm pretty sure it was in version 4 :-/
Try look at some old source here: http://www.koders.com/csharp/fid60B0985D3A89152128B73F54EDD4EB5420A5E4D8.aspx?s=%22Ken+Auer%22
nFOP + XSLT + XML = pdf | doc | HTML
nfop.sourceforge.net/article.html should give you an idea on how to use it, you need "Microsoft Visual J # NET Redistributable Package" to run nFOP
open source no cost :)
K
I have a PDF form with a number of text fields. The values entered in these fields are used to calculate values in other fields (the calculated fields are read-only).
When I open the form in Adobe Reader and fill in a field, the calculated fields automatically re-calculate.
However, I am using iTextSharp to fill in the fields, flatten the resulting form, then stream the flattened form back to the user over the web.
That part works just fine except the calculated fields never calculate. I'm assuming that since no user triggered events (like keydowns or focus or blur) are firing, the calculations don't occur.
Obviously, I could remove the calculations from the fillable form and do them all on the server as I am filling the fields, but I'd like the fillable form to be usable by humans as well as the server.
Does anyone know how to force the calculations?
EDIT:
I ain't feeling too much iText/iTextSharp love here...
Here are a few more details. Setting stamper.AcroFields.GenerateAppearances to true doesn't help.
I think the answer lies somewhere in the page actions, but I don't know how to trigger it...
Paulo Soares (one of the main devs of iText and current maintainer of iTextSharp) says:
iText doesn't do any effort to fix
calculated fields because most of the
times that's impossible. PdfCopyFields
has some support for it that sometimes
works and sometimes don't.
I have updated all the calculated fields of my pdfs by calling the javascript method calculateNow on the Doc object.
According to the adobe javascript documentation this.calculateNow();
Forces computation of all calculation fields in the current document.
When a form contains many calculations, there can be a significant delay after the user inputs data into a field, even if it is not a calculation field. One strategy is to turn off calculations at some point and turn them back on later (see example).
To include the javascript call with iTextSharp :
using (PdfReader pdfReader = new PdfReader(pdfTemplate))
using (PdfStamper pdfStamper = new PdfStamper(pdfReader, new FileStream(newFile, FileMode.Create)))
{
// fill f1 field and more...
AcroFields pdfFormFields = pdfStamper.AcroFields;
pdfFormFields.SetField("f1", "100");
//...
// add javascript on load event of the pdf
pdfStamper.JavaScript = "this.calculateNow();";
pdfStamper.Close();
}
I have figured out how to do this. Please see my answer for the stackoverflow question:
How to refresh Formatting on Non-Calculated field and refresh Calculated fields in Fillable PDF form
On the server side, see if there is an answer in the calculated fields. If not, calculate them.
As Greg Hurlman says, you should do the calculations yourself on the server. This is for more than just convenience, there's actually a good reason for it.
Any file that the customer has, they have the potential to screw with. I don't know what the PDF forms are for, but chances are it's connected to money somehow, so the potential exists for people to cheat by making the calculations show the wrong result. If you trust the calculations done on the client side, you have no way of detecting it.
When you recieve the PDF form from the client, you should redo all calculations so that you know they're correct. Then if you also have the client's versions to compare to, you should check whether they've been screwed with.
Don't think your clients are that untrustworthy? Good for you, but the evidence disagrees. One of my earliest introductions to programming was opening up the savegames for SimCity to give myself more money. If the opportunity exists to cheat in some way, then at some point people will try it.