Get Meta Data from PDF using PDFsharp

Get Meta Data from PDF using PDFsharp - c#

How do I get meta data from PDF using PDFsharp. Refer to the image.
I want to extract 'Document Restrictions Summary'
private static void Method1(string strPDFAddress)
{
PdfDocument pdfDoc = new PdfDocument(strPDFAddress);
Console.WriteLine("--------------------------------------------------------------");
Console.WriteLine("File: {0}", strPDFAddress);
Console.WriteLine("Author: {0}", pdfDoc.Info.Author);
Console.WriteLine("CreationDate: {0}", pdfDoc.Info.CreationDate);
Console.WriteLine("Creator: {0}", pdfDoc.Info.Creator);
Console.WriteLine("Keywords: {0}", pdfDoc.Info.Keywords);
PdfDocumentSettings pdfDocSettings = pdfDoc.Settings;
Console.WriteLine(pdfDocSettings.ToString());
PdfSecuritySettings pdfSecuritySettings = pdfDoc.SecuritySettings;
Console.WriteLine(pdfSecuritySettings.PermitExtractContent);
//PdfSharp.Pdf.Advanced.PdfFormXObject xObj =
PdfDictionary.DictionaryElements pdfDictionaryElements = pdfDoc.Info.Elements;
Console.WriteLine(pdfDictionaryElements.ToString());
}

Try this
Hope it works.
PdfReader reader = new PdfReader("HelloWorldNoMetadata.pdf");
string s = reader.Info["Author"];

You can set these document restrictions with the PdfSecuritySettings class.
See this sample:
http://www.pdfsharp.net/wiki/ProtectDocument-sample.ashx
I'm not sure but I would expect that this structure will also be filled when opening a PDF document.

Related

C# Pdf to Text with values in multiple line

Hi I have a pdf with content as following : -
Property Address: 123 Door Form Type: Miscellaneous
ABC City
Pin - XXX
So when I use itextSharp to get the content, it is obtained as follows -
Property Address: 123 Door Form Type: Miscellaneous ABC City Pin - XXX
The data is mixed since it is in next line. Please suggest a possible way to get the content as required. Thanks
Property Address: 123 Door ABC City Pin - XXX Form Type: Miscellaneous

The following code using iTextSharp helped in formatting the pdf -
PdfReader reader = new PdfReader(path);
int pagenumber = reader.NumberOfPages;
for (int page = 1; page <= pagenumber; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string tt = PdfTextExtractor.GetTextFromPage(reader, page , strategy);
tt = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(tt)));
File.AppendAllLines(outfile, tt, Encoding.UTF8);
}

I'm Using Below helper class to convert PDF to Text file. this one is working clam for me.
If any one need full working desktop application please refer this github repo
https://github.com/Kithuldeniya/PDFReader
using iText.Kernel.Geom;
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using System;
namespace PDFReader.Helpers
{
public static class PdfHelper
{
public static string ManipulatePdf(string filePath)
{
PdfDocument pdfDoc = new PdfDocument(new PdfReader(filePath));
//CustomFontFilter fontFilter = new CustomFontFilter(rect);
FilteredEventListener listener = new FilteredEventListener();
// Create a text extraction renderer
LocationTextExtractionStrategy extractionStrategy = listener
.AttachEventListener(new LocationTextExtractionStrategy());
// Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
new PdfCanvasProcessor(listener).ProcessPageContent(pdfDoc.GetFirstPage());
// Get the resultant text after applying the custom filter
String actualText = extractionStrategy.GetResultantText();
pdfDoc.Close();
return actualText;
}
}
}

Inserting a doc file inplace of place holder

I have a word document which contain many pages. One of those pages contain a placeholder instead of other content. so I want to replace that placeholder with another doc file without losing formatting. This doc file which is to be replaced may have many pages. How can I replace that placeholder with this doc file programmatically.. I searched many but could not find any option to insert a doc file replacing a placeholder.. Thank You In Advance.
Or how can we copy the contents of doc to be inserted and then replace the placeholder with copied content
I found a post here.The below code is from that post.
With the library, you can do the following to replace text from a Word document, considering that documentByteArray is your document byte content taken from database:
using (MemoryStream mem = new MemoryStream())
{
mem.Write(documentByteArray, 0, (int)documentByteArray.Length);
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(document, true))
{
string docText = null;
using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
{
docText = sr.ReadToEnd();
}
Regex regexText = new Regex("Hello world!");
docText = regexText.Replace(docText, "Hi Everyone!");
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(docText);
}
}
}
if instead of "Hi Everyone" if we replace it with a binarydata,which is an array of bytes
byte[] binarydata = File.ReadAllBytes(filepaths);
how can we modify the program?

First of all you should get a Nuget package called Novacode.Docx, this is what I have found to be the best Document creator and editor in the last few years.
using Novacode.Docx;
void Main()
{
var doc = DocX.Load(#"c:\temp\existingDoc.docx");
var docToAdd = DocX.Load(#"c:\temp\docToAdd.docx");
doc.InsertDocument(docToAdd, true); //version 1.0.0.22
doc.InsertDocument(docToAdd); //version 1.0.0.19
}
this is the most simple and basic implementation of what it is that youre after but this works.
for anything else take a look at the documentation at
https://docx.codeplex.com/
or
http://cathalscorner.blogspot.co.uk/
this will be the best place to start. I would also recommend that if you do use this one that you use the version 1.0.0.19 as there are some formatting issues in 1.0.0.22

ASP.NET + C#. Creating word document from template

I have a word document which contains only one page filled with text and graphic. The page also contains some placeholders like [Field1],[Field2],..., etc.
I get data from database and I want to open this document and fill placeholders with some data. For each data row I want to open this document, fill placeholders with row's data and then concatenate all created documents into one document.
What is the best and simpliest way to do this?

Instead of some third party i will suggest you openXML
add following namespaces System.Text.RegularExpressions;
DocumentFormat.OpenXml.Packaging; and DocumentFormat.OpenXml.Wordprocessing;
public static void SearchAndReplace(string document)
{
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(document, true))
{
string docText = null;
using (StreamReader sr = new StreamReader(wordDoc.MainDocumentPart.GetStream()))
{
docText = sr.ReadToEnd();
}
Regex regexText = new Regex("Hello world!");
docText = regexText.Replace(docText, "Hi Everyone!");
using (StreamWriter sw = new StreamWriter(wordDoc.MainDocumentPart.GetStream(FileMode.Create)))
{
sw.Write(docText);
}
}
}

You'll probably need to use a third party library.
You might want to check out http://www.codeproject.com/Articles/660478/Csharp-Create-and-Manipulate-Word-Documents-Progra
The below section specifically discusses replacing values in a Word document.
http://www.typecastexception.com/post/2013/09/28/C-Create-and-Manipulate-Word-Documents-Programmatically-Using-DocX.aspx#Find-and-Replace-Text-Using-DocX---Merge-Templating--Anyone-

how to convert window form data into PDF

I am working on an Salary Project and what i want to do is
when an user see their Salary slip and click on Downloads then the complete form data is converted into PDf file and stored on an predifined location..
plz suggest the code to meet my requirements..

I had faced this problem before, and the best I found to solve it was to user Microsoft Word Interops. You can put whatever you want in a word document and then save it as PDF, fortunately Microsoft word allows you to export the document to PDF.
The simplest way to do this would be to save your data as just plain text, but don't forget to well format them, and then run this method to convert the plain text to PDF.
public PDFWriter(String Path, String FileName) {
Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();
try
{
word.Visible = false;
word.Documents.Open(Path);
word.ActiveDocument.SaveAs2(FileName + ".pdf", Microsoft.Office.Interop.Word.WdSaveFormat.wdFormatPDF);
this.Path = FileName + ".pdf";
}
catch (Exception e)
{
word.Quit();
throw new Exception(e.Message);
}
finally
{
word.Quit();
}
}

Download pdfSharp.dll pdfSharp
and add it as reference.
Capture your form as image and then
private void ImageToPdf()
{
PdfSharp.Pdf.PdfDocument doc = new PdfSharp.Pdf.PdfDocument();
PdfSharp.Pdf.PdfPage oPage = new PdfSharp.Pdf.PdfPage();
String destinaton = "your destination";
doc.Pages.Add(oPage);
XGraphics xgr;
XImage img;
img = XImage.FromGdiPlusImage(form image);
xgr = PdfSharp.Drawing.XGraphics.FromPdfPage(oPage);
xgr.DrawImage(img, 0, 0);
doc.Save(destinaton);
doc.Close();
}
valter

Manipulating Word 2007 Document XML in C#

I am trying to manipulate the XML of a Word 2007 document in C#. I have managed to find and manipulate the node that I want but now I can't seem to figure out how to save it back. Here is what I am trying:
// Open the document from memoryStream
Package pkgFile = Package.Open(memoryStream, FileMode.Open, FileAccess.ReadWrite);
PackageRelationshipCollection pkgrcOfficeDocument = pkgFile.GetRelationshipsByType(strRelRoot);
foreach (PackageRelationship pkgr in pkgrcOfficeDocument)
{
if (pkgr.SourceUri.OriginalString == "/")
{
Uri uriData = new Uri("/word/document.xml", UriKind.Relative);
PackagePart pkgprtData = pkgFile.GetPart(uriData);
XmlDocument doc = new XmlDocument();
doc.Load(pkgprtData.GetStream());
NameTable nt = new NameTable();
XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
nsManager.AddNamespace("w", nsUri);
XmlNodeList nodes = doc.SelectNodes("//w:body/w:p/w:r/w:t", nsManager);
foreach (XmlNode node in nodes)
{
if (node.InnerText == "{{TextToChange}}")
{
node.InnerText = "success";
}
}
if (pkgFile.PartExists(uriData))
{
// Delete template "/customXML/item1.xml" part
pkgFile.DeletePart(uriData);
}
PackagePart newPkgprtData = pkgFile.CreatePart(uriData, "application/xml");
StreamWriter partWrtr = new StreamWriter(newPkgprtData.GetStream(FileMode.Create, FileAccess.Write));
doc.Save(partWrtr);
partWrtr.Close();
}
}
pkgFile.Close();
I get the error 'Memory stream is not expandable'. Any ideas?

I would recommend that you use Open XML SDK instead of hacking the format by yourself.

Using OpenXML SDK 2.0, I do this:
public void SearchAndReplace(Dictionary<string, string> tokens)
{
using (WordprocessingDocument doc = WordprocessingDocument.Open(_filename, true))
ProcessDocument(doc, tokens);
}
private string GetPartAsString(OpenXmlPart part)
{
string text = String.Empty;
using (StreamReader sr = new StreamReader(part.GetStream()))
{
text = sr.ReadToEnd();
}
return text;
}
private void SavePart(OpenXmlPart part, string text)
{
using (StreamWriter sw = new StreamWriter(part.GetStream(FileMode.Create)))
{
sw.Write(text);
}
}
private void ProcessDocument(WordprocessingDocument doc, Dictionary<string, string> tokenDict)
{
ProcessPart(doc.MainDocumentPart, tokenDict);
foreach (var part in doc.MainDocumentPart.HeaderParts)
{
ProcessPart(part, tokenDict);
}
foreach (var part in doc.MainDocumentPart.FooterParts)
{
ProcessPart(part, tokenDict);
}
}
private void ProcessPart(OpenXmlPart part, Dictionary<string, string> tokenDict)
{
string docText = GetPartAsString(part);
foreach (var keyval in tokenDict)
{
Regex expr = new Regex(_starttag + keyval.Key + _endtag);
docText = expr.Replace(docText, keyval.Value);
}
SavePart(part, docText);
}
From this you could write a GetPartAsXmlDocument, do what you want with it, and then stream it back with SavePart(part, xmlString).
Hope this helps!

You should use the OpenXML SDK to work on docx files and not write your own wrapper.
Getting Started with the Open XML SDK 2.0 for Microsoft Office
Introducing the Office (2007) Open XML File Formats
How to: Manipulate Office Open XML Formats Documents
Manipulate Docx with C# without Microsoft Word installed with OpenXML SDK

The problem appears to be doc.Save(partWrtr), which is built using newPkgprtData, which is built using pkgFile, which loads from a memory stream... Because you loaded from a memory stream it's trying to save the document back to that same memory stream. This leads to the error you are seeing.
Instead of saving it to the memory stream try saving it to a new file or to a new memory stream.

The short and simple answer to the issue with getting 'Memory stream is not expandable' is:
Do not open the document from memoryStream.
So in that respect the earlier answer is correct, simply open a file instead.
Opening from MemoryStream editing the document (in my experience) easy lead to 'Memory stream is not expandable'.
I suppose the message appears when one do edits that requires the memory stream to expand.
I have found that I can do some edits but not anything that add to the size.
So, f.ex deleting a custom xml part is ok but adding one and some data is not.
So if you actually need to open a memory stream you must figure out how to open an expandable MemoryStream if you want to add to it.
I have a need for this and hope to find a solution.
Stein-Tore Erdal
PS: just noticed the answer from "Jan 26 '11 at 15:18".
Don't think that is the answer in all situations.
I get the error when trying this:
var ms = new MemoryStream(bytes);
using (WordprocessingDocument wd = WordprocessingDocument.Open(ms, true))
{
...
using (MemoryStream msData = new MemoryStream())
{
xdoc.Save(msData);
msData.Position = 0;
ourCxp.FeedData(msData); // Memory stream is not expandable.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Get Meta Data from PDF using PDFsharp - c#

Try this Hope it works. PdfReader reader = new PdfReader("HelloWorldNoMetadata.pdf"); string s = reader.Info["Author"];

You can set these document restrictions with the PdfSecuritySettings class. See this sample: http://www.pdfsharp.net/wiki/ProtectDocument-sample.ashx I'm not sure but I would expect that this structure will also be filled when opening a PDF document.

Related

C# Pdf to Text with values in multiple line

Inserting a doc file inplace of place holder

ASP.NET + C#. Creating word document from template

how to convert window form data into PDF

Manipulating Word 2007 Document XML in C#

Categories

Resources