Following is the function I am writing to create PDF using iTextSharp.
Let me explain the function ...
I am here creating a PDF file from the another Template PDF file. The template PDF file is sent to this function in bytes[], then I create pdfReader from this...
From pdfReader I create pdfStamper (i.e. new PDF file) and write the response values to its fields. It is working fine... only issue is fint size of values is much large...
public void GeneratePrintPDFTest(ResponseGroup actual, Pages page, byte[] filebyte, out string pdfname, string localstorage)
{
string rootPath = #"D:/FOP-PDF/";
var pdfReader = new PdfReader(filebyte);
var pdfStamper = new PdfStamper(pdfReader,new FileStream(rootPath.ToString(CultureInfo.InvariantCulture) + page.PageId.ToString(CultureInfo.InvariantCulture)
+ ".pdf",FileMode.Create));
pdfname = rootPath.ToString(CultureInfo.InvariantCulture) + page.PageId.ToString(CultureInfo.InvariantCulture) + ".pdf";
AcroFields pdfFormFields = pdfStamper.AcroFields;
foreach (DictionaryEntry de in pdfReader.AcroFields.Fields)
{
var response = actual.Responses.Where(obj => obj.ITPPageFieldKeyId == Convert.ToInt32(de.Key.ToString())).Select(obj => obj).FirstOrDefault();
if (response != null)
{
if (response.ResponseValues != null && !string.IsNullOrEmpty(response.ResponseValues.ToString())
&& response.ResponseValues.ToString() != "0" && !string.IsNullOrEmpty(response.DataItemID)
&& response.DataItemID != "0")
{
if (response.PrintFormulaResult || response.PageFieldFormulaId == 0)
{
pdfFormFields.SetField(de.Key.ToString(), response.ResponseValues.ToString());
}
}
}
}
pdfStamper.FormFlattening = false;
pdfStamper.Close();
}
I tried following solutions but of no use....
float fSize = 10;
pdfFormFields.SetFieldProperty(de.Key.ToString(), de.Key.ToString(), fSize, null);
I also doubt it may be coming from the template PDF file, but if so how could I change it programatically.
Please help me with this... Thanks in advance...
Related
I am using iTextSharp c# to extract images and its name from catalog pdf. I Am able to extract images from pdf, but struggling with extracting its corresponding image name as per the attached screenshot and save the file with that name. Please find the code below and let me know your suggestions.
Sample PDF: https://docdro.id/PwBsNR9
Code:
private static List<System.Drawing.Image> ExtractImages(String PDFSourcePath)
{
List<System.Drawing.Image> ImgList = new List<System.Drawing.Image>();
iTextSharp.text.pdf.RandomAccessFileOrArray RAFObj = null;
iTextSharp.text.pdf.PdfReader PDFReaderObj = null;
iTextSharp.text.pdf.PdfObject PDFObj = null;
iTextSharp.text.pdf.PdfStream PDFStremObj = null;
try
{
RAFObj = new iTextSharp.text.pdf.RandomAccessFileOrArray(PDFSourcePath);
PDFReaderObj = new iTextSharp.text.pdf.PdfReader(RAFObj, null);
for (int i = 0; i <= PDFReaderObj.XrefSize - 1; i++)
{
PDFObj = PDFReaderObj.GetPdfObject(i);
if ((PDFObj != null) && PDFObj.IsStream())
{
PDFStremObj = (iTextSharp.text.pdf.PdfStream)PDFObj;
iTextSharp.text.pdf.PdfObject subtype = PDFStremObj.Get(iTextSharp.text.pdf.PdfName.SUBTYPE);
if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())
{
}
if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())
{
try
{
iTextSharp.text.pdf.parser.PdfImageObject PdfImageObj =
new iTextSharp.text.pdf.parser.PdfImageObject((iTextSharp.text.pdf.PRStream)PDFStremObj);
System.Drawing.Image ImgPDF = PdfImageObj.GetDrawingImage();
ImgList.Add(ImgPDF);
}
catch (Exception)
{
}
}
}
}
PDFReaderObj.Close();
}
catch (Exception ex)
{
throw new Exception(ex.Message);
}
return ImgList;
}
Unfortunately the example PDF is not tagged. Thus, one has to otherwise try and associate title text and image, either by analyzing the location in respect to each other or by exploiting a pattern in the content streams.
In the case at hand analyzing the location in respect to each other is feasible as the title always is (at least partially) drawn on the matching image or is the text right beneath it. Thus, one could in a first pass extract the text with position from a page and in a second one the images, at the same time looking for a title in the previously extracted text in the image area or right beneath. Alternatively one could first extract images with position and size and then extract the text in these areas.
But there also is a certain pattern in the content streams: The titel is always drawn in a single text drawing instruction right after the corresponding image is drawn. Thus, one can also go ahead and in one pass extract images and the next drawn text as associated title.
Either approach can be implemented using the iText parser API. For example in case of the latter approach as follows: first, one implements a render listener that behaves as described, i.e. saves images and the following text:
internal class ImageWithTitleRenderListener : IRenderListener
{
int imageNumber = 0;
String format;
bool expectingTitle = false;
public ImageWithTitleRenderListener(String format)
{
this.format = format;
}
public void BeginTextBlock()
{ }
public void EndTextBlock()
{ }
public void RenderText(TextRenderInfo renderInfo)
{
if (expectingTitle)
{
expectingTitle = false;
File.WriteAllText(string.Format(format, imageNumber, "txt"), renderInfo.GetText());
}
}
public void RenderImage(ImageRenderInfo renderInfo)
{
imageNumber++;
expectingTitle = true;
PdfImageObject imageObject = renderInfo.GetImage();
if (imageObject == null)
{
Console.WriteLine("Image {0} could not be read.", imageNumber);
}
else
{
File.WriteAllBytes(string.Format(format, imageNumber, imageObject.GetFileType()), imageObject.GetImageAsBytes());
}
}
}
Then one parses the document pages using that render listener:
using (PdfReader reader = new PdfReader(#"EVERMOTION ARCHMODELS VOL.78.pdf"))
{
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
ImageWithTitleRenderListener listener = new ImageWithTitleRenderListener(#"EVERMOTION ARCHMODELS VOL.78-{0:D3}.{1}");
for (var i = 1; i <= reader.NumberOfPages; i++)
{
parser.ProcessContent(i, listener);
}
}
I hope this would help.
I am doing this type of thing but if this would help.
// existing pdf path
PdfReader reader = new PdfReader(path);
PRStream pst;
PdfImageObject pio;
PdfObject po;
// number of objects in pdf document
int n = reader.XrefSize;
//FileStream fs = null;
// set image file location
//String path = "E:/";
for (int i = 0; i < n; i++)
{
// get the object at the index i in the objects collection
po = reader.GetPdfObject(i);
// object not found so continue
if (po == null || !po.IsStream())
continue;
//cast object to stream
pst = (PRStream)po;
//get the object type
PdfObject type = pst.Get(PdfName.SUBTYPE);
//check if the object is the image type object
if (type != null && type.ToString().Equals(PdfName.IMAGE.ToString()))
{
//get the image
pio = new PdfImageObject(pst);
// fs = new FileStream(path + "image" + i + ".jpg", FileMode.Create);
//read bytes of image in to an array
byte[] imgdata = pio.GetImageAsBytes();
try
{
Stream stream = new MemoryStream(imgdata);
FileStream fs = stream as FileStream;
if (fs != null) Console.WriteLine(fs.Name);
}
catch
{
}
}
}
Now you can save your stream.
public void SaveStreamToFile(string fileFullPath, Stream stream)
{
if (stream.Length == 0) return;
// Create a FileStream object to write a stream to a file
using (FileStream fileStream = System.IO.File.Create(fileFullPath, (int)stream.Length))
{
// Fill the bytes[] array with the stream data
byte[] bytesInStream = new byte[stream.Length];
stream.Read(bytesInStream, 0, (int)bytesInStream.Length);
// Use FileStream object to write to the specified file
fileStream.Write(bytesInStream, 0, bytesInStream.Length);
}
}
I'm trying to write verification code for our PDF generating routines, and I'm having difficulty getting PDFsharp to extract text from files created with MigraDoc. The ExtractText code works with other PDFs, but not with the PDFs that I generate with MigraDoc (see code below.)
Any tips on what I'm doing wrong?
//Create the Doc
var doc = new MigraDoc.DocumentObjectModel.Document();
doc.Info.Title = "VerifyReadWrite";
var section = doc.AddSection();
section.AddParagraph("ABCDEF abcdef");
//Render the PDF
var renderer = new PdfDocumentRenderer(true);
var pdf = new PdfDocument();
renderer.PdfDocument = pdf;
renderer.Document = doc;
renderer.RenderDocument();
var msOut = new MemoryStream();
pdf.Save(msOut, true);
var pdfBytes = msOut.ToArray();
//Read the PDF into PdfSharp
var ms = new MemoryStream(pdfBytes);
var pdfRead = PdfSharp.Pdf.IO.PdfReader.Open(ms, PdfDocumentOpenMode.ReadOnly);
var segments = pdfRead.Pages[0].ExtractText().ToList();
Results in the following:
segments[0] = "\0$\0%\0&\0'\0(\0)"
segments[1] = "\0D\0E\0F\0G\0H\0I"
I'd expect to see:
segments[0] = "ABCDEF"
segments[1] = "abcdef"
I'm using the ExtractText code from here:
C# Extract text from PDF using PdfSharp
and it works very well for all but PDFs generated with MigraDoc.
public static IEnumerable<string> ExtractText(this PdfPage page)
{
var content = ContentReader.ReadContent(page);
var text = content.ExtractText();
return text.Select(x => x.Trim());
}
public static IEnumerable<string> ExtractText(this CObject cObject)
{
if (cObject is COperator)
{
var cOperator = (COperator) cObject;
if (cOperator.OpCode.Name == OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
foreach (var txt in ExtractText(cOperand))
yield return txt;
}
}
else
{
var sequence = cObject as CSequence;
if (sequence != null)
{
var cSequence = sequence;
foreach (var element in cSequence)
foreach (var txt in ExtractText(element))
yield return txt;
}
else if (cObject is CString)
{
var cString = (CString) cObject;
yield return cString.Value;
}
}
}
It seems the code used to extract text does not support all cases.
Try new PdfDocumentRenderer(false) (instead of 'true'). AFAIK this will lead to a different encoding and the text extraction might work.
I want to read a .docx file and send its content in Email as email body not as an attachment.
So for this I use openXML and OpenXmlPowerTools to convert docx file to html. This is almost working fine until i got a document which has Header and Footer with images.
Here is my code to convert .docx to Html
using (WordprocessingDocument doc = WordprocessingDocument.Open(stream, true))
{
HtmlConverterSettings convSettings = new HtmlConverterSettings()
{
FabricateCssClasses = true,
CssClassPrefix = "cls-",
RestrictToSupportedLanguages = false,
RestrictToSupportedNumberingFormats = false,
ImageHandler = imageInfo =>
{
DirectoryInfo localDirInfo = new DirectoryInfo(imageDirectoryName);
if (!localDirInfo.Exists)
{
localDirInfo.Create();
}
++imageCounter;
string extension = imageInfo.ContentType.Split('/')[1].ToLower();
ImageFormat imageFormat = null;
if (extension == "png")
{
extension = "jpeg";
imageFormat = ImageFormat.Jpeg;
}
else if (extension == "bmp")
{
imageFormat = ImageFormat.Bmp;
}
else if (extension == "jpeg")
{
imageFormat = ImageFormat.Jpeg;
}
else if (extension == "tiff")
{
imageFormat = ImageFormat.Tiff;
}
// If the image format is not one that you expect, ignore it,
// and do not return markup for the link.
if (imageFormat == null)
{
return null;
}
string imageFileName = imageDirectoryName + "/image" + imageCounter.ToString() + "." + extension;
try
{
imageInfo.Bitmap.Save(imageFileName, imageFormat);
}
catch (System.Runtime.InteropServices.ExternalException)
{
return null;
}
XElement img = new XElement(Xhtml.img, new XAttribute(NoNamespace.src, imageFileName), imageInfo.ImgStyleAttribute, imageInfo.AltText != null ? new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
return img;
}
};
XElement html = OpenXmlPowerTools.HtmlConverter.ConvertToHtml(doc1, convSettings);
Above code works fine, convert images as well, but if the document has header and footer those are not converted.
So is their any workaround to include header and footer in html file.
Please suggest me. Thanks!
OpenXmlPowerTools ignores headers and footers when converting a docx-document to HTML, so they won't show up in the resulting HTML (you can browse the source code on github).
Perhaps it's because the concept of a 'page' doesn't apply to HTML, so there's no obvious equivalent to a document header.
I need to retrieve the layer2 text from a signature. How can I get the description (under the signature image) using itextsharp? below is the code I'm using to get the sign date and username:
PdfReader reader = new PdfReader(pdfPath, System.Text.Encoding.UTF8.GetBytes(MASTER_PDF_PASSWORD));
using (MemoryStream memoryStream = new MemoryStream())
{
PdfStamper stamper = new PdfStamper(reader, memoryStream);
AcroFields acroFields = stamper.AcroFields;
List<String> names = acroFields.GetSignatureNames();
foreach (String name in names)
{
PdfPKCS7 pk = acroFields.VerifySignature(name);
String userName = PdfPKCS7.GetSubjectFields(pk.SigningCertificate).GetField("CN");
Console.WriteLine("Sign Date: " + pk.SignDate.ToString() + " Name: " + userName);
// Here i need to retrieve the description underneath the signature image
}
reader.RemoveUnusedObjects();
reader.Close();
stamper.Writer.CloseStream = false;
if (stamper != null)
{
stamper.Close();
}
}
and below is the code I used to set the description
PdfStamper st = PdfStamper.CreateSignature(reader, memoryStream, '\0', null, true);
PdfSignatureAppearance sap = st.SignatureAppearance;
sap.Render = PdfSignatureAppearance.SignatureRender.GraphicAndDescription;
sap.Layer2Font = font;
sap.Layer2Text = "Some text that i want to retrieve";
Thank you.
While Bruno addressed the issue starting with a PDF containing a "layer 2", allow me to first state that using these "signature layers" in PDF signature appearances is not required by the PDF specification, the specification actually does not even know these layers at all! Thus, if you try to parse a specific layer, you may not find such a "layer" or, even worse, find something that looks like that layer (a XObject named n2) which contains the wrong data.
That been said, though, Whether you look for text from a layer 2 or from the signature appearance as a whole, you can use iTextSharp text extraction capabilities. I used Bruno's code as base for retrieving the n2 layer.
public static void ExtractSignatureTextFromFile(FileInfo file)
{
try
{
Console.Out.Write("File: {0}\n", file);
using (var pdfReader = new PdfReader(file.FullName))
{
AcroFields fields = pdfReader.AcroFields;
foreach (string name in fields.GetSignatureNames())
{
Console.Out.Write(" Signature: {0}\n", name);
iTextSharp.text.pdf.AcroFields.Item item = fields.GetFieldItem(name);
PdfDictionary widget = item.GetWidget(0);
PdfDictionary ap = widget.GetAsDict(PdfName.AP);
if (ap == null)
continue;
PdfStream normal = ap.GetAsStream(PdfName.N);
if (normal == null)
continue;
Console.Out.Write(" Content of normal appearance: {0}\n", extractText(normal));
PdfDictionary resources = normal.GetAsDict(PdfName.RESOURCES);
if (resources == null)
continue;
PdfDictionary xobject = resources.GetAsDict(PdfName.XOBJECT);
if (xobject == null)
continue;
PdfStream frm = xobject.GetAsStream(PdfName.FRM);
if (frm == null)
continue;
PdfDictionary res = frm.GetAsDict(PdfName.RESOURCES);
if (res == null)
continue;
PdfDictionary xobj = res.GetAsDict(PdfName.XOBJECT);
if (xobj == null)
continue;
PRStream n2 = (PRStream) xobj.GetAsStream(PdfName.N2);
if (n2 == null)
continue;
Console.Out.Write(" Content of normal appearance, layer 2: {0}\n", extractText(n2));
}
}
}
catch (Exception ex)
{
Console.Error.Write("Error... " + ex.StackTrace);
}
}
public static String extractText(PdfStream xObject)
{
PdfDictionary resources = xObject.GetAsDict(PdfName.RESOURCES);
ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
PdfContentStreamProcessor processor = new PdfContentStreamProcessor(strategy);
processor.ProcessContent(ContentByteUtils.GetContentBytesFromContentObject(xObject), resources);
return strategy.GetResultantText();
}
For the sample file signature_n2.pdf Bruno used you get this:
File: ...\signature_n2.pdf
Signature: Signature1
Content of normal appearance: This document was signed by Bruno
Specimen.
Content of normal appearance, layer 2: This document was signed by Bruno
Specimen.
As this sample uses the layer 2 as the OP expects, it already contains the text in question.
Please take a look at the following PDF: signature_n2.pdf. It contains a signature with the following text in the n2 layer:
This document was signed by Bruno
Specimen.
Before we can write code to extract this text, we should use iText RUPS to look at the internal structure of the PDF, so that we can find out where this /n2 layer is stored:
Based on this information, we can start writing our code. See the GetN2fromSig example:
public static void main(String[] args) throws IOException {
PdfReader reader = new PdfReader(SRC);
AcroFields fields = reader.getAcroFields();
Item item = fields.getFieldItem("Signature1");
PdfDictionary widget = item.getWidget(0);
PdfDictionary ap = widget.getAsDict(PdfName.AP);
PdfStream normal = ap.getAsStream(PdfName.N);
PdfDictionary resources = normal.getAsDict(PdfName.RESOURCES);
PdfDictionary xobject = resources.getAsDict(PdfName.XOBJECT);
PdfStream frm = xobject.getAsStream(PdfName.FRM);
PdfDictionary res = frm.getAsDict(PdfName.RESOURCES);
PdfDictionary xobj = res.getAsDict(PdfName.XOBJECT);
PRStream n2 = (PRStream) xobj.getAsStream(PdfName.N2);
byte[] stream = PdfReader.getStreamBytes(n2);
System.out.println(new String(stream));
}
We get the widget annotation for the signature field with name "signature1". Based on the info from RUPS, we know that we have to get the resources (/Resources) of the normal (/N) appearance (/AP). In the /XObjects dictionary, we'll find a form XObject named /FRM. This XObject has in turn also some /Resources, more specifically two /XObjects, one named /n0, the other one named /n2.
We get the stream of the /n2 object and we convert it to an uncompressed byte[]. When we print this array as a String, we get the following result:
BT
1 0 0 1 0 49.55 Tm
/F1 12 Tf
(This document was signed by Bruno)Tj
1 0 0 1 0 31.55 Tm
(Specimen.)Tj
ET
This is PDF syntax. BT and ET stand for "Begin Text" and "End Text". The Tm operator set the text matrix. The Tf operator set the font. Tj shows the strings that are delimited by ( and ). If you want the plain text, it's sufficient to extract only the text that is between parentheses.
I have an application where I need to get the objects that are embedded in excel should be stored in some location through code.
connection.Close();
//connection.ResetState();
string embeddingPartString;
//ArrayList chkdLstEmbeddedFiles = new ArrayList(); ;
List<String> chkdLstEmbeddedFiles = new List<string>();
//string fileName = txtSourceFile.Text;
if (filepath == string.Empty || !System.IO.File.Exists(filepath))
{
return;
}
// Open the package file
pkg = Package.Open(filepath, FileMode.Open, FileAccess.ReadWrite);
System.IO.FileInfo fi = new System.IO.FileInfo(filepath);
string extension = fi.Extension.ToLower();
if ((extension == ".docx") || (extension == ".dotx") || (extension == ".docm") || (extension == ".dotm"))
{
embeddingPartString = "/word/embeddings/";
}
else if ((extension == ".xlsx") || (extension == ".xlsm") || (extension == ".xltx") || (extension == ".xltm"))
{
embeddingPartString = "/xl/embeddings/";
}
else
{
embeddingPartString = "/ppt/embeddings/";
}
// Get the embedded files names.
foreach (PackagePart pkgPart in pkg.GetParts())
{
if (pkgPart.Uri.ToString().StartsWith(embeddingPartString))
{
string fileName1 = pkgPart.Uri.ToString().Remove(0, embeddingPartString.Length);
chkdLstEmbeddedFiles.Add(fileName1);
}
}
//pkg.Close();
if (chkdLstEmbeddedFiles.Count == 0)
//MessageBox.Show("The file does not contain any embedded files.");
// Open the package and loop through parts
// Check if the part uri to find if it contains the selected items in checked list box
pkg = Package.Open(filepath);
foreach (PackagePart pkgPart in pkg.GetParts())
{
for (int i = 0; i < chkdLstEmbeddedFiles.Count; i++)
{
object chkditem = chkdLstEmbeddedFiles[i];
if (pkgPart.Uri.ToString().Contains(embeddingPartString + chkdLstEmbeddedFiles[i].ToString()))
{
// Get the file name
string fileName1 = pkgPart.Uri.ToString().Remove(0, embeddingPartString.Length);
// Get the stream from the part
System.IO.Stream partStream = pkgPart.GetStream();
//string filePath = txtDestinationFolder.Text + "\\" + fileName1;
// Write the steam to the file.
Environment.GetEnvironmentVariable("userprofile");
System.IO.FileStream writeStream = new System.IO.FileStream(#"C:\Users\jyshet\Documents\MicrosoftDoc_File.docx", FileMode.Create, FileAccess.Write);
ReadWriteStream(pkgPart.GetStream(), writeStream);
// If the file is a structured storage file stored as a oleObjectXX.bin file
// Use Ole10Native class to extract the contents inside it.
if (fileName1.Contains("oleObject"))
{
// The Ole10Native class is defined in Ole10Native.cs file
// Ole10Native.ExtractFile(filePath, txtDestinationFolder.Text);
}
}
}
}
pkg.Close();
I am able to download only doc file and rest are not seen like txt and mht files. Please suggest what to do in this case. My problem is any objects embedded in excel should be downloaded thru code and saved in destination.