How to read PDF bookmarks programmatically

How to read PDF bookmarks programmatically - c#

I'm using a PDF converter to access the graphical data within a PDF. Everything works fine, except that I don't get a list of the bookmarks. Is there a command-line app or a C# component that can read a PDF's bookmarks? I found the iText and SharpPDF libraries and I'm currently looking through them. Have you ever done such a thing?

Try the following code
PdfReader pdfReader = new PdfReader(filename);
IList<Dictionary<string, object>> bookmarks = SimpleBookmark.GetBookmark(pdfReader);
for(int i=0;i<bookmarks.Count;i++)
{
MessageBox.Show(bookmarks[i].Values.ToArray().GetValue(0).ToString());
if (bookmarks[i].Count > 3)
{
MessageBox.Show(bookmarks[i].ToList().Count.ToString());
}
}
Note: Don't forget to add iTextSharp DLL to your project.

As the bookmarks are in a tree structure (https://en.wikipedia.org/wiki/Tree_(data_structure)),
I've used some recursion here to collect all bookmarks and it's children.
iTextSharp solved it for me.
dotnet add package iTextSharp
Collected all bookmarks with the following code:
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
using iTextSharp.text.pdf;
namespace PdfManipulation
{
class Program
{
static void Main(string[] args)
{
StringBuilder bookmarks = ExtractAllBookmarks("myPdfFile.pdf");
}
private static StringBuilder ExtractAllBookmarks(string pdf)
{
StringBuilder sb = new StringBuilder();
PdfReader reader = new PdfReader(pdf);
IList<Dictionary<string, object>> bookmarksTree = SimpleBookmark.GetBookmark(reader);
foreach (var node in bookmarksTree)
{
sb.AppendLine(PercorreBookmarks(node).ToString());
}
return RemoveAllBlankLines(sb);
}
private static StringBuilder RemoveAllBlankLines(StringBuilder sb)
{
return new StringBuilder().Append(Regex.Replace(sb.ToString(), #"^\s+$[\r\n]*", string.Empty, RegexOptions.Multiline));
}
private static StringBuilder PercorreBookmarks(Dictionary<string, object> bookmark)
{
StringBuilder sb = new StringBuilder();
sb.AppendLine(bookmark["Title"].ToString());
if (bookmark != null && bookmark.ContainsKey("Kids"))
{
IList<Dictionary<string, object>> children = (IList<Dictionary<string, object>>) bookmark["Kids"];
foreach (var bm in children)
{
sb.AppendLine(PercorreBookmarks(bm).ToString());
}
}
return sb;
}
}
}

You can use the PDFsharp library. It is published under the MIT License so it can be used even in corporate development. Here is an untested example.
using PdfSharp.Pdf;
using (PdfDocument document = PdfReader.IO.Open("bookmarked.pdf", IO.PdfDocumentOpenMode.Import))
{
PdfDictionary outline = document.Internals.Catalog.Elements.GetDictionary("/Outlines");
PrintBookmark(outline);
}
void PrintBookmark(PdfDictionary bookmark)
{
Console.WriteLine(bookmark.Elements.GetString("/Title"));
for (PdfDictionary child = bookmark.Elements.GetDictionary("/First"); child != null; child = child.Elements.GetDictionary("/Next"))
{
PrintBookmark(child);
}
}
Gotchas:
PdfSharp doesn't support open pdf's over version 1.6 very well. (throws: cannot handle iref streams. the current implementation of pdfsharp cannot handle this pdf feature introduced with acrobat 6)
There are many types of strings in PDFs which PDFsharp returns as is including UTF-16BE strings. (7.9.2.1 ISO32000 2008)

You might try Docotic.Pdf library for the task if you are fine with a commercial solution.
Here is a sample code to list all top-level items from bookmarks with some of their properties.
using (PdfDocument doc = new PdfDocument("file.pdf"))
{
PdfOutlineItem root = doc.OutlineRoot;
foreach (PdfOutlineItem item in root.Children)
{
Console.WriteLine("{0} ({1} child nodes, points to page {2})",
item.Title, item.ChildCount, item.PageIndex);
}
}
PdfOutlineItem class also provides properties related to outline item styles and more.
Disclaimer: I work for the vendor of the library.

If a commercial library is an option for you you could give Amyuni PDF Creator .Net a try.
Use the class Amyuni.PDFCreator.IacDocument.RootBookmark to retrieve the root of the bookmarks' tree, then the properties in IacBookmark to access each tree element, to navigate through the tree, and to add, edit or remove elements if needed.
Usual disclaimer applies

Related

ITextSharp 4.1.6 extract PDF content as text

The company would like to use the Itextsharp 4.1.6 version specifically and don't want to buy the license (version 5/7).
So, we had already implemented the TextExtract from pdf using the itextsharp 5 version. As we downgraded, this method doesn't support in the 4.16 LGPL version.
So, I looked into many StackOverflow and other sites for the answer. Looks like no custom implementation found other than the below code which exists in AGPL version.
PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy())
And byte[] pageContent = reader.GetPageContent(i); gives the byte content, when converted to string it won't give us the exact file text.
As, we do not wish to buy the AGPL version and need to implement the textextractor of pdf, any idea if any other tool supports this/ anybody has the implementation of textextractor.
Any suggestions would be greatly appreciated.
Edit: Refernce for the #jgoday's answer:

With iText 4.1 you can use PdfContentParser (https://github.com/schourode/iTextSharp-LGPL/blob/f75cdad88236d502af42458a420d48be2a47008f/src/core/iTextSharp/text/pdf/PdfContentParser.cs), to parse contents of every page.
using System;
using System.Text;
using iTextSharp.text.pdf;
namespace PdfExtractor
{
class Program
{
static void Main(string[] args)
{
var reader = new PdfReader(#"D:\Tmp\sample.pdf");
try
{
var parser = new PdfContentParser(new PRTokeniser(reader.GetPageContent(2)));
var sb = new StringBuilder();
while (parser.Tokeniser.NextToken())
{
if (parser.Tokeniser.TokenType == PRTokeniser.TK_STRING)
{
string str = parser.Tokeniser.StringValue;
sb.Append(str);
}
}
Console.WriteLine(sb.ToString());
}
finally {
reader.Close();
}
}
}
}

how to get Styles from existing word document by using Novacode.Docx?

This is the Example code using OpenXML SDK 2.5
void AddStylesPart()
{
StyleDefinitionsPart styleDefinitionsPart = mainPart.StyleDefinitionsPart;
styleDefinitionsPart = mainPart.AddNewPart<StyleDefinitionsPart>();
Styles styles1 = new Styles();
styles1.Save(styleDefinitionsPart);
if (styleDefinitionsPart != null)
{
using (WordprocessingDocument wordTemplate = WordprocessingDocument.Open(#"..\AT\Docs\FPMaster-4DEV.docx", false))
{
foreach (var templateStyle in wordTemplate.MainDocumentPart.StyleDefinitionsPart.Styles)
{
styleDefinitionsPart.Styles.Append(templateStyle.CloneNode(true));
}
}
}
}
Here an existing document is taken using WordprocessingDocument class finally Cloned all the styles present in existing document,
similarly I want to do it using Novacode.Docx DLL. How to get styles used in existing document using Novacode.Docx DLL? kindly please help.

Found an alternative solution, I hope this will help
Using Novacode.Docx DLL we can easily clone the styles used in original document.
It can be done by creating template of the original document.
once If it is done. apply the template in your project.
document.ApplyTemplate(#"..\TemplateFileName.dotx", false);
Now we can able to use all styles present in original document.

How to insert text into a content control with the Open XML SDK

I'm trying to develop a solution which takes the input from a ASP.Net Web Page and Embed the input values into Corresponding Content Controls within a MS Word Document. The MS Word Document has also got Static Data with some Dynamic data to be Embed into the Header and Footer fields.
The Idea here is that the solution should be Web based. Can I use OpenXML for this purpose or any other approach that you can suggest.
Thank you very much in advance for all your valuable inputs. I really appreciate them.

I have a little code sample from my project, to insert a few words in a content control you've created in a Word document:
public static WordprocessingDocument InsertText(this WordprocessingDocument doc, string contentControlTag, string text)
{
SdtElement element = doc.MainDocumentPart.Document.Body.Descendants<SdtElement>()
.FirstOrDefault(sdt => sdt.SdtProperties.GetFirstChild<Tag>()?.Val == contentControlTag);
if (element == null)
throw new ArgumentException($"ContentControlTag \"{contentControlTag}\" doesn't exist.");
element.Descendants<Text>().First().Text = text;
element.Descendants<Text>().Skip(1).ToList().ForEach(t => t.Remove());
return doc;
}
It simply looks for the first contentcontrol in the document with a specific Tag (you can set that by enabling designer mode in word and right-clicking on the content control), and replaces the current text with the text passed into the method. After this the document will still contain the content controls of course which may not be desired. So when I'm done editing the document I run the following method to get rid of the content controls:
internal static WordprocessingDocument RemoveSdtBlocks(this WordprocessingDocument doc, IEnumerable<string> contentBlocks)
{
List<SdtElement> SdtBlocks = doc.MainDocumentPart.Document.Descendants<SdtElement>().ToList();
if (contentBlocks == null)
return doc;
foreach(var s in contentBlocks)
{
SdtElement currentElement = SdtBlocks.FirstOrDefault(sdt => sdt.SdtProperties.GetFirstChild<Tag>()?.Val == s);
if (currentElement == null)
continue;
IEnumerable<OpenXmlElement> elements = null;
if (currentElement is SdtBlock)
elements = (currentElement as SdtBlock).SdtContentBlock.Elements();
else if (currentElement is SdtCell)
elements = (currentElement as SdtCell).SdtContentCell.Elements();
else if (currentElement is SdtRun)
elements = (currentElement as SdtRun).SdtContentRun.Elements();
foreach (var el in elements)
currentElement.InsertBeforeSelf(el.CloneNode(true));
currentElement.Remove();
}
return doc;
}
To open the WordProcessingDocument from a template and edit it, there is plenty of information available online.
Edit:
Little sample code to open/save documents while working with them in a memorystream, of course you should take care of this with an extra repository class that takes care of managing the document in the real code:
byte[] byteArray = File.ReadAllBytes(#"C:\...\Template.dotx");
using (var stream = new MemoryStream())
{
stream.Write(byteArray, 0, byteArray.Length);
using (WordprocessingDocument doc = WordprocessingDocument.Open(stream, true))
{
//Needed because I'm working with template dotx file,
//remove this if the template is a normal docx.
doc.ChangeDocumentType(DocumentFormat.OpenXml.WordprocessingDocumentType.Document);
doc.InsertText("contentControlName","testtesttesttest");
}
using (FileStream fs = new FileStream(#"C:\...\newFile.docx", FileMode.Create))
{
stream.WriteTo(fs);
}
}

Read Word bookmarks

Can any tell me how to real all bookmarks in a word 2010 document using openXml 2.0. I was using Microsoft.Office.Interop.Word to read bookmarks but i am not able to deploy my website as it was having issues so i switched to openxml can anyone tell me how to read all bookmarks

you can iterate through all
file.MainDocumentPart.RootElement.Descendants<BookmarkStart>()
like:
IDictionary<String, BookmarkStart> bookmarkMap =
new Dictionary<String, BookmarkStart>();
// get all
foreach (BookmarkStart bookmarkStart in file.MainDocumentPart.RootElement.Descendants<BookmarkStart>())
{
bookmarkMap[bookmarkStart.Name] = bookmarkStart;
}
// get their text
foreach (BookmarkStart bookmarkStart in bookmarkMap.Values)
{
Run bookmarkText = bookmarkStart.NextSibling<Run>();
if (bookmarkText != null)
{
string bookmarkText = bookmarkText.GetFirstChild<Text>().Text;
}
}
code extracted from https://stackoverflow.com/a/3318381/28004

Try this.I have used same in my project
http://www.legalcube.de/post/Word-openxml-sdk-bookmark-handling.aspx

Manipulating Word 2007 Document XML in C#

I am trying to manipulate the XML of a Word 2007 document in C#. I have managed to find and manipulate the node that I want but now I can't seem to figure out how to save it back. Here is what I am trying:
// Open the document from memoryStream
Package pkgFile = Package.Open(memoryStream, FileMode.Open, FileAccess.ReadWrite);
PackageRelationshipCollection pkgrcOfficeDocument = pkgFile.GetRelationshipsByType(strRelRoot);
foreach (PackageRelationship pkgr in pkgrcOfficeDocument)
{
if (pkgr.SourceUri.OriginalString == "/")
{
Uri uriData = new Uri("/word/document.xml", UriKind.Relative);
PackagePart pkgprtData = pkgFile.GetPart(uriData);
XmlDocument doc = new XmlDocument();
doc.Load(pkgprtData.GetStream());
NameTable nt = new NameTable();
XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
nsManager.AddNamespace("w", nsUri);
XmlNodeList nodes = doc.SelectNodes("//w:body/w:p/w:r/w:t", nsManager);
foreach (XmlNode node in nodes)
{
if (node.InnerText == "{{TextToChange}}")
{
node.InnerText = "success";
}
}
if (pkgFile.PartExists(uriData))
{
// Delete template "/customXML/item1.xml" part
pkgFile.DeletePart(uriData);
}
PackagePart newPkgprtData = pkgFile.CreatePart(uriData, "application/xml");
StreamWriter partWrtr = new StreamWriter(newPkgprtData.GetStream(FileMode.Create, FileAccess.Write));
doc.Save(partWrtr);
partWrtr.Close();
}
}
pkgFile.Close();
I get the error 'Memory stream is not expandable'. Any ideas?

I would recommend that you use Open XML SDK instead of hacking the format by yourself.

Using OpenXML SDK 2.0, I do this:
public void SearchAndReplace(Dictionary<string, string> tokens)
{
using (WordprocessingDocument doc = WordprocessingDocument.Open(_filename, true))
ProcessDocument(doc, tokens);
}
private string GetPartAsString(OpenXmlPart part)
{
string text = String.Empty;
using (StreamReader sr = new StreamReader(part.GetStream()))
{
text = sr.ReadToEnd();
}
return text;
}
private void SavePart(OpenXmlPart part, string text)
{
using (StreamWriter sw = new StreamWriter(part.GetStream(FileMode.Create)))
{
sw.Write(text);
}
}
private void ProcessDocument(WordprocessingDocument doc, Dictionary<string, string> tokenDict)
{
ProcessPart(doc.MainDocumentPart, tokenDict);
foreach (var part in doc.MainDocumentPart.HeaderParts)
{
ProcessPart(part, tokenDict);
}
foreach (var part in doc.MainDocumentPart.FooterParts)
{
ProcessPart(part, tokenDict);
}
}
private void ProcessPart(OpenXmlPart part, Dictionary<string, string> tokenDict)
{
string docText = GetPartAsString(part);
foreach (var keyval in tokenDict)
{
Regex expr = new Regex(_starttag + keyval.Key + _endtag);
docText = expr.Replace(docText, keyval.Value);
}
SavePart(part, docText);
}
From this you could write a GetPartAsXmlDocument, do what you want with it, and then stream it back with SavePart(part, xmlString).
Hope this helps!

You should use the OpenXML SDK to work on docx files and not write your own wrapper.
Getting Started with the Open XML SDK 2.0 for Microsoft Office
Introducing the Office (2007) Open XML File Formats
How to: Manipulate Office Open XML Formats Documents
Manipulate Docx with C# without Microsoft Word installed with OpenXML SDK

The problem appears to be doc.Save(partWrtr), which is built using newPkgprtData, which is built using pkgFile, which loads from a memory stream... Because you loaded from a memory stream it's trying to save the document back to that same memory stream. This leads to the error you are seeing.
Instead of saving it to the memory stream try saving it to a new file or to a new memory stream.

The short and simple answer to the issue with getting 'Memory stream is not expandable' is:
Do not open the document from memoryStream.
So in that respect the earlier answer is correct, simply open a file instead.
Opening from MemoryStream editing the document (in my experience) easy lead to 'Memory stream is not expandable'.
I suppose the message appears when one do edits that requires the memory stream to expand.
I have found that I can do some edits but not anything that add to the size.
So, f.ex deleting a custom xml part is ok but adding one and some data is not.
So if you actually need to open a memory stream you must figure out how to open an expandable MemoryStream if you want to add to it.
I have a need for this and hope to find a solution.
Stein-Tore Erdal
PS: just noticed the answer from "Jan 26 '11 at 15:18".
Don't think that is the answer in all situations.
I get the error when trying this:
var ms = new MemoryStream(bytes);
using (WordprocessingDocument wd = WordprocessingDocument.Open(ms, true))
{
...
using (MemoryStream msData = new MemoryStream())
{
xdoc.Save(msData);
msData.Position = 0;
ourCxp.FeedData(msData); // Memory stream is not expandable.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to read PDF bookmarks programmatically - c#

Related

ITextSharp 4.1.6 extract PDF content as text

how to get Styles from existing word document by using Novacode.Docx?

How to insert text into a content control with the Open XML SDK

Read Word bookmarks

Manipulating Word 2007 Document XML in C#

Categories

Resources