Reading PDF content with itextsharp dll in VB.NET or C#

Reading PDF content with itextsharp dll in VB.NET or C# - c#

How can I read PDF content with the itextsharp with the Pdfreader class. My PDF may include Plain text or Images of the text.

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.IO;
public string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}

LGPL / FOSS iTextSharp 4.x
var pdfReader = new PdfReader(path); //other filestream etc
byte[] pageContent = _pdfReader .GetPageContent(pageNum); //not zero based
byte[] utf8 = Encoding.Convert(Encoding.Default, Encoding.UTF8, pageContent);
string textFromPage = Encoding.UTF8.GetString(utf8);
None of the other answers were useful to me, they all seem to target the AGPL v5 of iTextSharp. I could never find any reference to SimpleTextExtractionStrategy or LocationTextExtractionStrategy in the FOSS version.
Something else that might be very useful in conjunction with this:
const string PdfTableFormat = #"\(.*\)Tj";
Regex PdfTableRegex = new Regex(PdfTableFormat, RegexOptions.Compiled);
List<string> ExtractPdfContent(string rawPdfContent)
{
var matches = PdfTableRegex.Matches(rawPdfContent);
var list = matches.Cast<Match>()
.Select(m => m.Value
.Substring(1) //remove leading (
.Remove(m.Value.Length - 4) //remove trailing )Tj
.Replace(#"\)", ")") //unencode parens
.Replace(#"\(", "(")
.Trim()
)
.ToList();
return list;
}
This will extract the text-only data from the PDF if the text displayed is Foo(bar) it will be encoded in the PDF as (Foo\(bar\))Tj, this method would return Foo(bar) as expected. This method will strip out lots of additional information such as location coordinates from the raw pdf content.

Here is a VB.NET solution based on ShravankumarKumar's solution.
This will ONLY give you the text. The images are a different story.
Public Shared Function GetTextFromPDF(PdfFileName As String) As String
Dim oReader As New iTextSharp.text.pdf.PdfReader(PdfFileName)
Dim sOut = ""
For i = 1 To oReader.NumberOfPages
Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)
Next
Return sOut
End Function

In my case, I just wanted the text from a specific area of the PDF document so I used a rectangle around the area and extracted the text from it. In the sample below the coordinates are for the entire page. I don't have PDF authoring tools so when it came time to narrow down the rectangle to the specific location I took a few guesses at the coordinates until the area was found.
Rectangle _pdfRect = new Rectangle(0f, 0f, 612f, 792f); // Entire page - PDF coordinate system 0,0 is bottom left corner. 72 points / inch
RenderFilter _renderfilter = new RegionTextRenderFilter(_pdfRect);
ITextExtractionStrategy _strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), _filter);
string _text = PdfTextExtractor.GetTextFromPage(_pdfReader, 1, _strategy);
As noted by the above comments the resulting text doesn't maintain any of the formatting found in the PDF document, however, I was happy that it did preserve the carriage returns. In my case, there were enough constants in the text that I was able to extract the values that I required.

Here an improved answer of ShravankumarKumar. I created special classes for the pages so you can access words in the pdf based on the text rows and the word in that row.
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
//create a list of pdf pages
var pages = new List<PdfPage>();
//load the pdf into the reader. NOTE: path can also be replaced with a byte array
using (PdfReader reader = new PdfReader(path))
{
//loop all the pages and extract the text
for (int i = 1; i <= reader.NumberOfPages; i++)
{
pages.Add(new PdfPage()
{
content = PdfTextExtractor.GetTextFromPage(reader, i)
});
}
}
//use linq to create the rows and words by splitting on newline and space
pages.ForEach(x => x.rows = x.content.Split('\n').Select(y =>
new PdfRow() {
content = y,
words = y.Split(' ').ToList()
}
).ToList());
The custom classes
class PdfPage
{
public string content { get; set; }
public List<PdfRow> rows { get; set; }
}
class PdfRow
{
public string content { get; set; }
public List<string> words { get; set; }
}
Now you can get a word by row and word index.
string myWord = pages[0].rows[12].words[4];
Or use Linq to find the rows containing a specific word.
//find the rows in a specific page containing a word
var myRows = pages[0].rows.Where(x => x.words.Any(y => y == "myWord1")).ToList();
//find the rows in all pages containing a word
var myRows = pages.SelectMany(r => r.rows).Where(x => x.words.Any(y => y == "myWord2")).ToList();

Public Sub PDFTxtToPdf(ByVal sTxtfile As String, ByVal sPDFSourcefile As String)
Dim sr As StreamReader = New StreamReader(sTxtfile)
Dim doc As New Document()
PdfWriter.GetInstance(doc, New FileStream(sPDFSourcefile, FileMode.Create))
doc.Open()
doc.Add(New Paragraph(sr.ReadToEnd()))
doc.Close()
End Sub

Related

Using iText and C#, create named destinations at exact same page coordinate as each PDF bookmark

I can iterate through all the bookmarks in my PDF file and create named destinations but positioned at the top of the same page as each bookmark and not at the same coordinates.
The code below works but how do I set the named destination to also be at the exact same page coordinates as the bookmark (there can be a few bookmarks per page):
string file = #"C:\MyDocs\MyDoc1.pdf";
string newFile = #"C:\MyDocs\MyDoc2.pdf";
PdfDocument pdf = new PdfDocument(new PdfReader(file), new PdfWriter(newFile));
// Get the bookmarks
PdfOutline outlines = pdf.GetOutlines(false);
List<PdfOutline> bookmarks = outlines.GetAllChildren().ToList<PdfOutline>();
PdfNameTree destsTree = pdf.GetCatalog().GetNameTree(PdfName.Dests);
IDictionary<String, PdfObject> Tnames = destsTree.GetNames();
foreach (var item in bookmarks)
{
string title = item.GetTitle();
int pgn = pdf.GetPageNumber((PdfDictionary)item.GetDestination().GetDestinationPage(Tnames));
PdfPage pdfPage = pdf.GetPage(pgn);
iText.Kernel.Geom.Rectangle pageRect = pdfPage.GetPageSize();
float getLeft = pageRect.GetLeft();
float getTop = pageRect.GetTop();
PdfExplicitDestination destObj = PdfExplicitDestination.CreateXYZ(pdfPage, getLeft, getTop, 1);
pdf.AddNamedDestination(title, destObj.GetPdfObject());
}
pdf.Close();

How to write Arabic text using MigraDoc?

I am using ASP for this and I had to generate reports in PDF format and send the file back to clients so they can download it.
I made the reports using MigraDoc library and they were great but after I tried it with Arabic text I found the texts were in LTR and the characters were disjointed so I made this code to test things out
...............
MigraDoc.DocumentObjectModel.Document reportDoc = new MigraDoc.DocumentObjectModel.Document();
reportDoc.Info.Title = "test";
sec = reportDoc.AddSection();
string fileName = "test.pdf";
addformattedText(sec, "العبارة", true);
PdfDocumentRenderer renderer = new PdfDocumentRenderer(true);
renderer.Document = reportDoc;
renderer.RenderDocument();
MemoryStream pdfStream = new MemoryStream();
renderer.PdfDocument.Save(pdfStream);
byte[] bytes = pdfStream.ToArray();
...............
private void addformattedText(Section sec,string text, bool shouldBeBold = false)
{
var tf = sec.AddTextFrame();
var p = tf.AddParagraph(text);
p.Format.Font.Name = "Tahoma";
if (shouldBeBold) p.Format.Font.Bold = true;
}
I get the output like this
I have tried to encode the text and make it a unicode string using this code
private string getEscapedString(string text)
{
if (true || HasArabicCharacters(text))
{
string uString = "";
byte[] utfBytes = Encoding.Unicode.GetBytes(text);
foreach (var u in utfBytes)
{
if (u != 0)
{
uString += String.Format(#"\u{0:x4}", u);
}
}
return uString;
}
else
return text;
}
and get the returned string into a paragraph and save the PDF documents with unicode parameter set to true
But it is all the same.
I can not figure out how to get it done.
The reports were done using MigraDoc 1.50.5147 library.

The problem is Arabic language font have 4 different shap in begging,last,connected and alone, where Pdfsharp and MigraDoc can not recognize which shap to print farther more you need to reverse the character order to solve this you can use AraibcPdfUnicodeGlyphsResharper to help do such work as following:
using PdfSharp.Drawing;
using PdfSharp.Pdf;
using AraibcPdfUnicodeGlyphsResharper;
namespace MigraDocArabic
{
internal class PrintArabicUsingPdfSharp
{
public PrintArabicUsingPdfSharp(string path)
{
PdfDocument document = new PdfDocument();
document.Info.Title = "Created with PDFsharp";
System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);
// Create an empty page
PdfPage page = document.AddPage();
// Get an XGraphics object for drawing
XGraphics gfx = XGraphics.FromPdfPage(page);
// Create a font
XFont font = new XFont("Arial", 20, XFontStyle.BoldItalic);
var xArabicString = "كتابة اللغة العربية شيئ جميل".ArabicWithFontGlyphsToPfd();
// Draw the text
gfx.DrawString("Hello, World!", font, XBrushes.Black, new XRect(0, 0, page.Width, page.Height), XStringFormats.Center);
gfx.DrawString(xArabicString, font, XBrushes.Black, new XRect(50, 50, page.Width, page.Height), XStringFormats.Center);
// Save the document...
document.Save(path);
}
}
}
Do not Forget the Extension method
By the way this is work with iText7 too
see the image for result
Result

PDFsharp does not support RTL languages yet:
http://www.pdfsharp.net/wiki/PDFsharpFAQ.ashx#Does_PDFsharp_support_for_Arabic_Hebrew_CJK_Chinese_Japanese_Korean_6
You can work around this limitation by reversing the string.
PDFsharp does not support font ligatures yet. You are probably able to work around this limitation by replacing letters with the correct glyph (start, middle, end) depending on the position.

C# OpenXML How to Replace \r\n with Break()?

I have a text field in my database and it has a text with many lines.
When generating a MS Word document using OpenXML and bookmarks, the text become one single line.
I've noticed that in each new line the bookmark value show the characters "\r\n".
Looking for a solution, I've found some answers which helped me, but I'm still having a problem.
I've used the run.Append(new Break()); solution, but the text replaced is showing the name of the bookmark as well.
For example:
bookmark test = "Big text here in first paragraph\r\nSecond paragraph".
It is shown in MS Word document like:
testBig text here in first paragraph
Second paragraph
Can anyone, please, help me to eliminate the bookmark name?
Here is my code:
public void UpdateBookmarksVistoria(string originalPath, string copyPath, string fileType)
{
string wordmlNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
// Make a copy of the template file.
File.Copy(originalPath, copyPath, true);
//Open the document as an Open XML package and extract the main document part.
using (WordprocessingDocument wordPackage = WordprocessingDocument.Open(copyPath, true))
{
MainDocumentPart part = wordPackage.MainDocumentPart;
//Setup the namespace manager so you can perform XPath queries
//to search for bookmarks in the part.
NameTable nt = new NameTable();
XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
nsManager.AddNamespace("w", wordmlNamespace);
//Load the part's XML into an XmlDocument instance.
XmlDocument xmlDoc = new XmlDocument(nt);
xmlDoc.Load(part.GetStream());
//pega a url para exibir as fotos
string url = HttpContext.Current.Request.Url.ToString();
string enderecoURL;
if (url.Contains("localhost"))
enderecoURL = url.Substring(0, 26);
else if (url.Contains("www."))
enderecoURL = url.Substring(0, 24);
else
enderecoURL = url.Substring(0, 20);
//Iterate through the bookmarks.
int cont = 56;
foreach (KeyValuePair<string, string> bookmark in bookmarks)
{
var res = from bm in part.Document.Body.Descendants<BookmarkStart>()
where bm.Name == bookmark.Key
select bm;
var bk = res.SingleOrDefault();
if (bk != null)
{
Run bookmarkText = bk.NextSibling<Run>();
if (bookmarkText != null) // if the bookmark has text replace it
{
var texts = bookmark.Value.Split(new[] { Environment.NewLine }, StringSplitOptions.None);
for (int i = 0; i < texts.Length; i++)
{
if (i > 0)
bookmarkText.Append(new Break());
Text text = new Text();
text.Text = texts[i];
bookmarkText.Append(text); //HERE IS MY PROBLEM
}
}
else // otherwise append new text immediately after it
{
var parent = bk.Parent; // bookmark's parent element
Text text = new Text(bookmark.Value);
Run run = new Run(new RunProperties());
run.Append(text);
// insert after bookmark parent
parent.Append(run);
}
bk.Remove(); // we don't want the bookmark anymore
}
}
//Write the changes back to the document part.
xmlDoc.Save(wordPackage.MainDocumentPart.GetStream(FileMode.Create));
wordPackage.Close();
}}

Extract text by line from PDF using iTextSharp c#

I need to run some analysis my extracting data from a PDF document.
Using iTextSharp, I used the PdfTextExtractor.GetTextFromPage method to extract contents from a PDF document and it returned me in a single long line.
Is there a way to get the text by line so that i can store them in an array? So that i can analyze the data by line which will be more flexible.
Below is the code I used:
string urlFileName1 = "pdf_link";
PdfReader reader = new PdfReader(urlFileName1);
string text = string.Empty;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader, page);
}
reader.Close();
candidate3.Text = text.ToString();

public void ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
ITextExtractionStrategy Strategy = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string page = "";
page = PdfTextExtractor.GetTextFromPage(reader, i,Strategy);
string[] lines = page.Split('\n');
foreach (string line in lines)
{
MessageBox.Show(line);
}
}
}
}

I know this is posting on an older post, but I spent a lot of time trying to figure this out so I'm going to share this for the future people trying to google this:
using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFApp2
{
class Program
{
static void Main(string[] args)
{
string filePath = #"Your said path\the file name.pdf";
string outPath = #"the output said path\the text file name.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page ++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
//Creating and appending to a text file
using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
I had the program read in a PDF, from a set path, and just output to a text file, but you can manipulate that to anything. This was building off of Snziv Gupta's response.

All the other code samples here didn't work for me, probably due to changes to the itext7 API.
This minimal example here works ok:
var pdfReader = new iText.Kernel.Pdf.PdfReader(fileName);
var pdfDocument = new iText.Kernel.Pdf.PdfDocument(pdfReader);
var contents = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(pdfDocument.GetFirstPage());

LocationTextExtractionStrategy will automatically insert '\n' in the output text. However, sometimes it will insert '\n' where it shouldn't.
In that case you need to build a custom TextExtractionStrategy or RenderListener. Bascially the code that detects newline is the method
public virtual bool SameLine(ITextChunkLocation other) {
return OrientationMagnitude == other.OrientationMagnitude &&
DistPerpendicular == other.DistPerpendicular;
}
In some cases '\n' shouldn't be inserted if there is only small difference between DistPerpendicular and other.DistPerpendicular, so you need to change it to something like Math.Abs(DistPerpendicular - other.DistPerpendicular) < 10
Or you can put that piece of code in the RenderText method of your custom TextExtractionStrategy/RenderListener class

Use LocationTextExtractionStrategy in lieu of SimpleTextExtractionStrategy. LocationTextExtractionStrategy extracted text contains the new line character at the end of line.
ITextExtractionStrategy Strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
string pdftext = PdfTextExtractor.GetTextFromPage(reader,pageno, Strategy);
string[] words = pdftext.Split('\n');
return words;

Try
String page = PdfTextExtractor.getTextFromPage(reader, 2);
String s1[]=page.split("\n");

using ITextSharp to extract and update links in an existing PDF

I need to post several (read: a lot) PDF files to the web but many of them have hard coded file:// links and links to non-public locations. I need to read through these PDFs and update the links to the proper locations. I've started writing an app using itextsharp to read through the directories and files, find the PDFs and iterate through each page. What I need to do next is find the links and then update the incorrect ones.
string path = "c:\\html";
DirectoryInfo rootFolder = new DirectoryInfo(path);
foreach (DirectoryInfo di in rootFolder.GetDirectories())
{
// get pdf
foreach (FileInfo pdf in di.GetFiles("*.pdf"))
{
string contents = string.Empty;
Document doc = new Document();
PdfReader reader = new PdfReader(pdf.FullName);
using (MemoryStream ms = new MemoryStream())
{
PdfWriter writer = PdfWriter.GetInstance(doc, ms);
doc.Open();
for (int p = 1; p <= reader.NumberOfPages; p++)
{
byte[] bt = reader.GetPageContent(p);
}
}
}
}
Quite frankly, once I get the page content I'm rather lost on this when it comes to iTextSharp. I've read through the itextsharp examples on sourceforge, but really didn't find what I was looking for.
Any help would be greatly appreciated.
Thanks.

This one is a little complicated if you don't know the internals of the PDF format and iText/iTextSharp's abstraction/implementation of it. You need to understand how to use PdfDictionary objects and look things up by their PdfName key. Once you get that you can read through the official PDF spec and poke around a document pretty easily. If you do care I've included the relevant parts of the PDF spec in parenthesis where applicable.
Anyways, a link within a PDF is stored as an annotation (PDF Ref 12.5). Annotations are page-based so you need to first get each page's annotation array individually. There's a bunch of different possible types of annotations so you need to check each one's SUBTYPE and see if its set to LINK (12.5.6.5). Every link should have an ACTION dictionary associated with it (12.6.2) and you want to check the action's S key to see what type of action it is. There's a bunch of possible ones for this, link's specifically could be internal links or open file links or play sound links or something else (12.6.4.1). You are looking only for links that are of type URI (note the letter I and not the letter L). URI Actions (12.6.4.7) have a URI key that holds the actual address to navigate to. (There's also an IsMap property for image maps that I can't actually imagine anyone using.)
Whew. Still reading? Below is a full working VS 2010 C# WinForms app based on my post here targeting iTextSharp 5.1.1.0. This code does two main things: 1) Create a sample PDF with a link in it pointing to Google.com and 2) replaces that link with a link to bing.com. The code should be pretty well commented but feel free to ask any questions that you might have.
using System;
using System.Text;
using System.Windows.Forms;
using iTextSharp.text;
using iTextSharp.text.pdf;
using System.IO;
namespace WindowsFormsApplication1
{
public partial class Form1 : Form
{
//Folder that we are working in
private static readonly string WorkingFolder = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Hyperlinked PDFs");
//Sample PDF
private static readonly string BaseFile = Path.Combine(WorkingFolder, "OldFile.pdf");
//Final file
private static readonly string OutputFile = Path.Combine(WorkingFolder, "NewFile.pdf");
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
CreateSamplePdf();
UpdatePdfLinks();
this.Close();
}
private static void CreateSamplePdf()
{
//Create our output directory if it does not exist
Directory.CreateDirectory(WorkingFolder);
//Create our sample PDF
using (iTextSharp.text.Document Doc = new iTextSharp.text.Document(PageSize.LETTER))
{
using (FileStream FS = new FileStream(BaseFile, FileMode.Create, FileAccess.Write, FileShare.Read))
{
using (PdfWriter writer = PdfWriter.GetInstance(Doc, FS))
{
Doc.Open();
//Turn our hyperlink blue
iTextSharp.text.Font BlueFont = FontFactory.GetFont("Arial", 12, iTextSharp.text.Font.NORMAL, iTextSharp.text.BaseColor.BLUE);
Doc.Add(new Paragraph(new Chunk("Go to URL", BlueFont).SetAction(new PdfAction("http://www.google.com/", false))));
Doc.Close();
}
}
}
}
private static void UpdatePdfLinks()
{
//Setup some variables to be used later
PdfReader R = default(PdfReader);
int PageCount = 0;
PdfDictionary PageDictionary = default(PdfDictionary);
PdfArray Annots = default(PdfArray);
//Open our reader
R = new PdfReader(BaseFile);
//Get the page cont
PageCount = R.NumberOfPages;
//Loop through each page
for (int i = 1; i <= PageCount; i++)
{
//Get the current page
PageDictionary = R.GetPageN(i);
//Get all of the annotations for the current page
Annots = PageDictionary.GetAsArray(PdfName.ANNOTS);
//Make sure we have something
if ((Annots == null) || (Annots.Length == 0))
continue;
//Loop through each annotation
foreach (PdfObject A in Annots.ArrayList)
{
//Convert the itext-specific object as a generic PDF object
PdfDictionary AnnotationDictionary = (PdfDictionary)PdfReader.GetPdfObject(A);
//Make sure this annotation has a link
if (!AnnotationDictionary.Get(PdfName.SUBTYPE).Equals(PdfName.LINK))
continue;
//Make sure this annotation has an ACTION
if (AnnotationDictionary.Get(PdfName.A) == null)
continue;
//Get the ACTION for the current annotation
PdfDictionary AnnotationAction = (PdfDictionary)AnnotationDictionary.Get(PdfName.A);
//Test if it is a URI action
if (AnnotationAction.Get(PdfName.S).Equals(PdfName.URI))
{
//Change the URI to something else
AnnotationAction.Put(PdfName.URI, new PdfString("http://www.bing.com/"));
}
}
}
//Next we create a new document add import each page from the reader above
using (FileStream FS = new FileStream(OutputFile, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (Document Doc = new Document())
{
using (PdfCopy writer = new PdfCopy(Doc, FS))
{
Doc.Open();
for (int i = 1; i <= R.NumberOfPages; i++)
{
writer.AddPage(writer.GetImportedPage(R, i));
}
Doc.Close();
}
}
}
}
}
}
EDIT
I should note, this only changes the actual link. Any text within the document won't get updated. Annotations are drawn on top of text but aren't really tied to the text underneath in anyway. That's another topic completely.

Noted if the Action is indirect it will not return a dictionary and you will have an error:
PdfDictionary AnnotationAction = (PdfDictionary)AnnotationDictionary.Get(PdfName.A);
In cases of possible indirect dictionaries:
PdfDictionary Action = null;
//Get action directly or by indirect reference
PdfObject obj = Annotation.Get(PdfName.A);
if (obj.IsIndirect) {
Action = PdfReader.GetPdfObject(obj);
} else {
Action = (PdfDictionary)obj;
}
In that case you have to investigate the returned dictionary to figure out where the URI is found. As with an indirect /Launch dictionary the URI is located in the /F item being of type PRIndirectReference with the /Type being a /FileSpec and the URI located in the value of /F

Added code for dealing with indirect and launch actions and null annotation-dictionary:
PdfReader r = new PdfReader(#"d:\kb2\" + f);
for (int i = 1; i <= r.NumberOfPages; i++) {
//Get the current page
var PageDictionary = r.GetPageN(i);
//Get all of the annotations for the current page
var Annots = PageDictionary.GetAsArray(PdfName.ANNOTS);
//Make sure we have something
if ((Annots == null) || (Annots.Length == 0))
continue;
foreach (var A in Annots.ArrayList) {
var AnnotationDictionary = PdfReader.GetPdfObject(A) as PdfDictionary;
if (AnnotationDictionary == null)
continue;
//Make sure this annotation has a link
if (!AnnotationDictionary.Get(PdfName.SUBTYPE).Equals(PdfName.LINK))
continue;
//Make sure this annotation has an ACTION
if (AnnotationDictionary.Get(PdfName.A) == null)
continue;
var annotActionObject = AnnotationDictionary.Get(PdfName.A);
var AnnotationAction = (PdfDictionary)(annotActionObject.IsIndirect() ? PdfReader.GetPdfObject(annotActionObject) : annotActionObject);
var type = AnnotationAction.Get(PdfName.S);
//Test if it is a URI action
if (type.Equals(PdfName.URI)) {
//Change the URI to something else
string relativeRef = AnnotationAction.GetAsString(PdfName.URI).ToString();
AnnotationAction.Put(PdfName.URI, new PdfString(url));
} else if (type.Equals(PdfName.LAUNCH)) {
//Change the URI to something else
var filespec = AnnotationAction.GetAsDict(PdfName.F);
string url = filespec.GetAsString(PdfName.F).ToString();
AnnotationAction.Put(PdfName.F, new PdfString(url));
}
}
}
//Next we create a new document add import each page from the reader above
using (var output = File.OpenWrite(outputFile.FullName)) {
using (Document Doc = new Document()) {
using (PdfCopy writer = new PdfCopy(Doc, output)) {
Doc.Open();
for (int i = 1; i <= r.NumberOfPages; i++) {
writer.AddPage(writer.GetImportedPage(r, i));
}
Doc.Close();
}
}
}
r.Close();

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Reading PDF content with itextsharp dll in VB.NET or C# - c#

How can I read PDF content with the itextsharp with the Pdfreader class. My PDF may include Plain text or Images of the text.

Related

Using iText and C#, create named destinations at exact same page coordinate as each PDF bookmark

How to write Arabic text using MigraDoc?

C# OpenXML How to Replace \r\n with Break()?

Extract text by line from PDF using iTextSharp c#

using ITextSharp to extract and update links in an existing PDF

Categories

Resources