How do I remove link annotations from a PDF using iText? - c#

Here i want to remove Annotation(Link, Text, ..) from PDF permanently using iTextSharp.
Already i have tried
AnnotationDictionary.Remove(PdfName.LINK);
But that Link annotations exist in that PDF.
Note:
I want remove particular selected Annotations(Link, Text, ..),
For Example i want remove Link Annotation with the URI as www.google.com, remaining Link Annotations i want to be retain as per exist.

I got the answer for my question.
Sample Code:
//Get the current page
PageDictionary = R.GetPageN(i);
//Get all of the annotations for the current page
Annots = PageDictionary.GetAsArray(PdfName.ANNOTS);
foreach (PdfObject A in Annots.ArrayList)
{
//code to check the annotation
//remove the annotation
Annots.Remove(int idx);
}

Dim pdfReader As New PdfReader(fileloc)
For i = 1 To pdfReader.NumberOfPages
Dim pageDict As PdfDictionary = pdfReader.GetPageN(i)
Dim annots As PdfArray = pageDict.GetAsArray(PdfName.ANNOTS)
Dim newAnnots As PdfArray = New PdfArray()
If annots IsNot Nothing Then
For j As Integer = 0 To annots.Size() - 1
Dim annotDict As PdfDictionary = annots.GetAsDict(i)
If Not PdfName.LINK.Equals(annotDict.GetAsName(PdfName.SUBTYPE)) Then
newAnnots.Add(annots.GetAsDict(j))
End If
Next
pageDict.Put(PdfName.ANNOTS, newAnnots)
End If
Next
Dim pdfStamper As PdfStamper = Nothing
Dim extension = Path.GetExtension(fileloc)
Dim filename As String = Path.GetFileNameWithoutExtension(fileloc)
Dim filePath As String = Path.GetDirectoryName(fileloc)
fileloc = filePath + "\" + filename + "new" + extension
pdfStamper = New PdfStamper(pdfReader, New FileStream(fileloc, FileMode.Create))
pdfStamper.FormFlattening = False
pdfStamper.Close()
pdfReader.Close()
End Sub```

Related

Generate PDF in using windows form

I want to generate a PDF using windows form in the desktop application. I have readymade pdf design and I just want to feed data from database in that blank section of pdf for each user. (One type of receipt). Please guide me. I have searched but most of the time there is the solution in asp.net for the web application. I want to do in the desktop app. Here is my code I am able to fatch data from database and print in pdf. But main problem is trhat I have already designed pdf and I want to place data exactly at same field (ie name, Amount, date etc.)
using System;
using System.Windows.Forms;
using System.Diagnostics;
using PdfSharp;
using PdfSharp.Drawing;
using PdfSharp.Pdf;
using System.Data.SqlClient;
using System.Data;
using System.Configuration;
namespace printPDF
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void button1_Click_1(object sender, EventArgs e)
{
try
{
string connetionString = null;
SqlConnection connection ;
SqlCommand command ;
SqlDataAdapter adapter = new SqlDataAdapter();
DataSet ds = new DataSet();
int i = 0;
string sql = null;
int yPoint = 0;
string pubname = null;
string city = null;
string state = null;
connetionString = "Data Source=EEVO-SALMAN\\MY_PC;Initial Catalog=;User ID=s***;Password=******";
// var connectionString = ConfigurationManager.ConnectionStrings["CharityManagement"].ConnectionString;
sql = "select NAME,NAME,uid from tblumaster";
connection = new SqlConnection(connetionString);
connection.Open();
command = new SqlCommand(sql, connection);
adapter.SelectCommand = command;
adapter.Fill(ds);
connection.Close();
PdfDocument pdf = new PdfDocument();
pdf.Info.Title = "Database to PDF";
PdfPage pdfPage = pdf.AddPage();
XGraphics graph = XGraphics.FromPdfPage(pdfPage);
XFont font = new XFont("Verdana", 20, XFontStyle.Regular );
yPoint = yPoint + 100;
for (i = 0; i <=ds.Tables[0].Rows.Count-1; i++)
{
pubname = ds.Tables[0].Rows[i].ItemArray[0].ToString ();
city = ds.Tables[0].Rows[i].ItemArray[1].ToString();
state = ds.Tables[0].Rows[i].ItemArray[2].ToString();
graph.DrawString(pubname, font, XBrushes.Black, new XRect(10, yPoint, pdfPage.Width.Point, pdfPage.Height.Point), XStringFormats.TopLeft);
graph.DrawString(city, font, XBrushes.Black, new XRect(200, yPoint, pdfPage.Width.Point, pdfPage.Height.Point), XStringFormats.TopLeft);
graph.DrawString(state, font, XBrushes.Black, new XRect(400, yPoint, pdfPage.Width.Point, pdfPage.Height.Point), XStringFormats.TopLeft);
yPoint = yPoint + 40;
}
string pdfFilename = "dbtopdf.pdf";
pdf.Save(pdfFilename);
Process.Start(pdfFilename);
}
catch (Exception ex)
{
MessageBox.Show(ex.ToString());
}
}
}
}
Instead of modifying the document, please create a new document and copy the pages from the old document to new document
sample code can be found here,
http://forum.pdfsharp.net/viewtopic.php?p=2637#p2637
Because modifying pdf is not recommended using 'PdfSharp' library. if you still want to edit you can use 'ISharp' library which needs a license.
Here is some VB.Net code I use to fill PDF forms. You need a PDF fillable form with form control names matching the SQL record field names.
It calls a routine Gen.GetDataTable() that just builds a typical DataTable. You could re-code to accept a pre-built Datatable as a parameter. Only the top row is processed. The code can be modified to work with a DataRow (.Table.Columns for column reference) or a DataReader.
Public Function FillPDFFormSQL(pdfMasterPath As String, pdfFinalPath As String, SQL As String, Optional FlattenForm As Boolean = True, Optional PrintPDF As Boolean = False, Optional PrinterName As String = "", Optional AllowMissingFields As Boolean = False) As Boolean
' case matters SQL <-> PDF Form Field Names
Dim pdfFormFields As AcroFields
Dim pdfReader As PdfReader
Dim pdfStamper As PdfStamper
Dim s As String = ""
Try
If pdfFinalPath = "" Then pdfFinalPath = pdfMasterPath.Replace(".pdf", "_Out.pdf")
Dim newFile As String = pdfFinalPath
pdfReader = New PdfReader(pdfMasterPath)
pdfStamper = New PdfStamper(pdfReader, New FileStream(newFile, FileMode.Create))
pdfReader.Close()
pdfFormFields = pdfStamper.AcroFields
Dim dt As DataTable = Gen.GetDataTable(SQL)
For i As Integer = 0 To dt.Columns.Count - 1
s = dt.Columns(i).ColumnName
If AllowMissingFields Then
If pdfFormFields.Fields.ContainsKey(s) Then
pdfFormFields.SetField(s, dt.Rows(0)(i).ToString.Trim)
Else
Debug.WriteLine($"Missing PDF Field: {s}")
End If
Else
pdfFormFields.SetField(s, dt.Rows(0)(i).ToString.Trim)
End If
Next
' flatten the form to remove editing options
' set it to false to leave the form open for subsequent manual edits
If My.Computer.Keyboard.CtrlKeyDown Then
pdfStamper.FormFlattening = False
Else
pdfStamper.FormFlattening = FlattenForm
End If
pdfStamper.Close()
If Not newFile.Contains("""") Then newFile = """" & newFile & """"
If Not PrintPDF Then
Process.Start(newFile)
Else
Dim sPDFProgramPath As String = INI.GetValue("OISForms", "PDFProgramPath", "C:\Program Files (x86)\Foxit Software\Foxit PhantomPDF\FoxitPhantomPDF.exe")
If Not IO.File.Exists(sPDFProgramPath) Then MsgBox("PDF EXE not found:" & vbNewLine & sPDFProgramPath) : Exit Function
If PrinterName.Length > 0 Then
Process.Start(sPDFProgramPath, "/t " & newFile & " " & PrinterName)
Else
Process.Start(sPDFProgramPath, "/p " & newFile)
End If
End If
Return True
Catch ex As Exception
MsgBox(ex.Message)
Return False
Finally
pdfStamper = Nothing
pdfReader = Nothing
End Try
End Function

Insert HTML Text Into OpenOffice Document (.odt) Files

I am Trying to Insert HTML Text Inside Apache Open Office .odt File
I try Statement with Bold as show Below but it is not Working.
Is There I am missing Something ?
XComponentContext oStrap = uno.util.Bootstrap.bootstrap();
XMultiServiceFactory oServMan = (XMultiServiceFactory)oStrap.getServiceManager();
XComponentLoader oDesk = (XComponentLoader)oServMan.createInstance("com.sun.star.frame.Desktop");
string url = #"private:factory/swriter";
PropertyValue[] propVals = new PropertyValue[0];
XComponent oDoc = oDesk.loadComponentFromURL(url, "_blank", 0, propVals);
string docText = "<b>This will</b> be my first paragraph.\n\r";
docText += "This will be my second paragraph.\n\r";
((XTextDocument)oDoc).getText().setString(docText);
string fileName = #"C:\test.odt";
fileName = "file:///" + fileName.Replace(#"\", "/");
((XStorable)oDoc).storeAsURL(fileName, propVals);
((XComponent)oDoc).dispose();
oDoc = null;
Output:
As answered already in the other question - you have to use character properties to get bold (or otherwise attributed) text

Send Multiple Documents To Printer

I have one Requirement in .net wpf.
There are some images in format of .gif or .jpg in a local folder.
I have prepared one List of string , to store file names
i want to send all images in a list to a printer in button click.
I have searched Google but for Print document we can give only one file PrintFileName.
But i want to give each file name in for each loop . any one can explain how is it possible?
Thanks..
question subject is look like wrong...
answer;
var filenames = Directory.EnumerateFiles(#"c:\targetImagePath", "*.*", SearchOption.AllDirectories)
.Where(s => s.EndsWith(".gif") || s.EndsWith(".jpg") || s.EndsWith(".bmp"));
foreach (var filename in filenames)
{
//use filename
}
Private Sub btnPrint_Click(sender As Object, e As RoutedEventArgs) Handles btnPrint.Click
Dim printDialog = New System.Windows.Controls.PrintDialog()
If printDialog.ShowDialog = False Then
Return
End If
Dim fixedDocument = New FixedDocument()
fixedDocument.DocumentPaginator.PageSize = New System.Windows.Size(printDialog.PrintableAreaWidth, printDialog.PrintableAreaHeight)
For Each p In _lablefilenames
Dim page As New FixedPage()
Dim info As System.IO.FileInfo = New FileInfo(p)
'If info.Extension.ToLower = ".gif" Then
' page.Width = fixedDocument.DocumentPaginator.PageSize.Height
' page.Height = fixedDocument.DocumentPaginator.PageSize.Width
'Else
page.Width = fixedDocument.DocumentPaginator.PageSize.Width
page.Height = fixedDocument.DocumentPaginator.PageSize.Height
'End If
Dim img As New System.Windows.Controls.Image()
' PrintIt my project's name, Img folder
'Dim uriImageSource = New Uri(p, UriKind.RelativeOrAbsolute)
'img.Source = New BitmapImage(uriImageSource)
Dim Bimage As New BitmapImage()
Bimage.BeginInit()
Bimage.CacheOption = BitmapCacheOption.OnLoad
Bimage.UriSource = New Uri(p)
If info.Extension.ToLower = ".gif" Then Bimage.Rotation += Rotation.Rotate90
Bimage.EndInit()
'img.Width = 100
'img.Height = 100
img.Source = Bimage
page.Children.Add(img)
Dim pageContent As New PageContent()
DirectCast(pageContent, IAddChild).AddChild(page)
fixedDocument.Pages.Add(pageContent)
Next
' Print me an image please!
printDialog.PrintDocument(fixedDocument.DocumentPaginator, "Print")
End Sub

How can I remove watermark in PDF using itext/itextsharp? [duplicate]

I added a watermark on pdf using Pdfstamper. Here is the code:
for (int pageIndex = 1; pageIndex <= pageCount; pageIndex++)
{
iTextSharp.text.Rectangle pageRectangle = reader.GetPageSizeWithRotation(pageIndex);
PdfContentByte pdfData = stamper.GetUnderContent(pageIndex);
pdfData.SetFontAndSize(BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252,
BaseFont.NOT_EMBEDDED), watermarkFontSize);
PdfGState graphicsState = new PdfGState();
graphicsState.FillOpacity = watermarkFontOpacity;
pdfData.SetGState(graphicsState);
pdfData.SetColorFill(iTextSharp.text.BaseColor.BLACK);
pdfData.BeginText();
pdfData.ShowTextAligned(PdfContentByte.ALIGN_CENTER, "LipikaChatterjee",
pageRectangle.Width / 2, pageRectangle.Height / 2, watermarkRotation);
pdfData.EndText();
}
This works fine. Now I want to remove this watermark from my pdf. I looked into iTextSharp but was not able to get any help. I even tried to add watermark as layer and then delete the layer but was not able to delete the content of layer from the pdf. I looked into iText for layer removal and found a class OCGRemover but I was not able to get an equivalent class in iTextsharp.
I'm going to give you the benefit of the doubt based on the statement "I even tried to add watermark as layer" and assume that you are working on content that you are creating and not trying to unwatermark someone else's content.
PDFs use Optional Content Groups (OCG) to store objects as layers. If you add your watermark text to a layer you can fairly easily remove it later.
The code below is a full working C# 2010 WinForms app targeting iTextSharp 5.1.1.0. It uses code based on Bruno's original Java code found here. The code is in three sections. Section 1 creates a sample PDF for us to work with. Section 2 creates a new PDF from the first and applies a watermark to each page on a separate layer. Section 3 creates a final PDF from the second but removes the layer with our watermark text. See the code comments for additional details.
When you create a PdfLayer object you can assign it a name to appear within a PDF reader. Unfortunately I can't find a way to access this name so the code below looks for the actual watermark text within the layer. If you aren't using additional PDF layers I would recommend only looking for /OC within the content stream and not wasting time looking for your actual watermark text. If you find a way to look for /OC groups by name please let me kwow!
using System;
using System.Windows.Forms;
using System.IO;
using iTextSharp.text;
using iTextSharp.text.pdf;
namespace WindowsFormsApplication1 {
public partial class Form1 : Form {
public Form1() {
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e) {
string workingFolder = Environment.GetFolderPath(Environment.SpecialFolder.Desktop);
string startFile = Path.Combine(workingFolder, "StartFile.pdf");
string watermarkedFile = Path.Combine(workingFolder, "Watermarked.pdf");
string unwatermarkedFile = Path.Combine(workingFolder, "Un-watermarked.pdf");
string watermarkText = "This is a test";
//SECTION 1
//Create a 5 page PDF, nothing special here
using (FileStream fs = new FileStream(startFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
using (Document doc = new Document(PageSize.LETTER)) {
using (PdfWriter witier = PdfWriter.GetInstance(doc, fs)) {
doc.Open();
for (int i = 1; i <= 5; i++) {
doc.NewPage();
doc.Add(new Paragraph(String.Format("This is page {0}", i)));
}
doc.Close();
}
}
}
//SECTION 2
//Create our watermark on a separate layer. The only different here is that we are adding the watermark to a PdfLayer which is an OCG or Optional Content Group
PdfReader reader1 = new PdfReader(startFile);
using (FileStream fs = new FileStream(watermarkedFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
using (PdfStamper stamper = new PdfStamper(reader1, fs)) {
int pageCount1 = reader1.NumberOfPages;
//Create a new layer
PdfLayer layer = new PdfLayer("WatermarkLayer", stamper.Writer);
for (int i = 1; i <= pageCount1; i++) {
iTextSharp.text.Rectangle rect = reader1.GetPageSize(i);
//Get the ContentByte object
PdfContentByte cb = stamper.GetUnderContent(i);
//Tell the CB that the next commands should be "bound" to this new layer
cb.BeginLayer(layer);
cb.SetFontAndSize(BaseFont.CreateFont(BaseFont.HELVETICA, BaseFont.CP1252, BaseFont.NOT_EMBEDDED), 50);
PdfGState gState = new PdfGState();
gState.FillOpacity = 0.25f;
cb.SetGState(gState);
cb.SetColorFill(BaseColor.BLACK);
cb.BeginText();
cb.ShowTextAligned(PdfContentByte.ALIGN_CENTER, watermarkText, rect.Width / 2, rect.Height / 2, 45f);
cb.EndText();
//"Close" the layer
cb.EndLayer();
}
}
}
//SECTION 3
//Remove the layer created above
//First we bind a reader to the watermarked file, then strip out a bunch of things, and finally use a simple stamper to write out the edited reader
PdfReader reader2 = new PdfReader(watermarkedFile);
//NOTE, This will destroy all layers in the document, only use if you don't have additional layers
//Remove the OCG group completely from the document.
//reader2.Catalog.Remove(PdfName.OCPROPERTIES);
//Clean up the reader, optional
reader2.RemoveUnusedObjects();
//Placeholder variables
PRStream stream;
String content;
PdfDictionary page;
PdfArray contentarray;
//Get the page count
int pageCount2 = reader2.NumberOfPages;
//Loop through each page
for (int i = 1; i <= pageCount2; i++) {
//Get the page
page = reader2.GetPageN(i);
//Get the raw content
contentarray = page.GetAsArray(PdfName.CONTENTS);
if (contentarray != null) {
//Loop through content
for (int j = 0; j < contentarray.Size; j++) {
//Get the raw byte stream
stream = (PRStream)contentarray.GetAsStream(j);
//Convert to a string. NOTE, you might need a different encoding here
content = System.Text.Encoding.ASCII.GetString(PdfReader.GetStreamBytes(stream));
//Look for the OCG token in the stream as well as our watermarked text
if (content.IndexOf("/OC") >= 0 && content.IndexOf(watermarkText) >= 0) {
//Remove it by giving it zero length and zero data
stream.Put(PdfName.LENGTH, new PdfNumber(0));
stream.SetData(new byte[0]);
}
}
}
}
//Write the content out
using (FileStream fs = new FileStream(unwatermarkedFile, FileMode.Create, FileAccess.Write, FileShare.None)) {
using (PdfStamper stamper = new PdfStamper(reader2, fs)) {
}
}
this.Close();
}
}
}
As an extension to Chris's answer, a VB.Net class for removing a layer is included at the bottom of this post which should be a bit more precise.
It goes through the PDF's list of layers (stored in the OCGs array in the OCProperties dictionary in the file's catalog). This array contains indirect references to objects in the PDF file which contain the name
It goes through the properties of the page (also stored in a dictionary) to find the properties which point to the layer objects (via indirect references)
It does an actual parse of the content stream to find instances of the pattern /OC /{PagePropertyReference} BDC {Actual Content} EMC so it can remove just these segments as appropriate
The code then cleans up all the references as much as it can. Calling the code might work as shown:
Public Shared Sub RemoveWatermark(path As String, savePath As String)
Using reader = New PdfReader(path)
Using fs As New FileStream(savePath, FileMode.Create, FileAccess.Write, FileShare.None)
Using stamper As New PdfStamper(reader, fs)
Using remover As New PdfLayerRemover(reader)
remover.RemoveByName("WatermarkLayer")
End Using
End Using
End Using
End Using
End Sub
Full class:
Imports iTextSharp.text
Imports iTextSharp.text.io
Imports iTextSharp.text.pdf
Imports iTextSharp.text.pdf.parser
Public Class PdfLayerRemover
Implements IDisposable
Private _reader As PdfReader
Private _layerNames As New List(Of String)
Public Sub New(reader As PdfReader)
_reader = reader
End Sub
Public Sub RemoveByName(name As String)
_layerNames.Add(name)
End Sub
Private Sub RemoveLayers()
Dim ocProps = _reader.Catalog.GetAsDict(PdfName.OCPROPERTIES)
If ocProps Is Nothing Then Return
Dim ocgs = ocProps.GetAsArray(PdfName.OCGS)
If ocgs Is Nothing Then Return
'Get a list of indirect references to the layer information
Dim layerRefs = (From l In (From i In ocgs
Select Obj = DirectCast(PdfReader.GetPdfObject(i), PdfDictionary),
Ref = DirectCast(i, PdfIndirectReference))
Where _layerNames.Contains(l.Obj.GetAsString(PdfName.NAME).ToString)
Select l.Ref).ToList
'Get a list of numbers for these layer references
Dim layerRefNumbers = (From l In layerRefs Select l.Number).ToList
'Loop through the pages
Dim page As PdfDictionary
Dim propsToRemove As IEnumerable(Of PdfName)
For i As Integer = 1 To _reader.NumberOfPages
'Get the page
page = _reader.GetPageN(i)
'Get the page properties which reference the layers to remove
Dim props = _reader.GetPageResources(i).GetAsDict(PdfName.PROPERTIES)
propsToRemove = (From k In props.Keys Where layerRefNumbers.Contains(props.GetAsIndirectObject(k).Number) Select k).ToList
'Get the raw content
Dim contentarray = page.GetAsArray(PdfName.CONTENTS)
If contentarray IsNot Nothing Then
For j As Integer = 0 To contentarray.Size - 1
'Parse the stream data looking for references to a property pointing to the layer.
Dim stream = DirectCast(contentarray.GetAsStream(j), PRStream)
Dim streamData = PdfReader.GetStreamBytes(stream)
Dim newData = GetNewStream(streamData, (From p In propsToRemove Select p.ToString.Substring(1)))
'Store data without the stream references in the stream
If newData.Length <> streamData.Length Then
stream.SetData(newData)
stream.Put(PdfName.LENGTH, New PdfNumber(newData.Length))
End If
Next
End If
'Remove the properties from the page data
For Each prop In propsToRemove
props.Remove(prop)
Next
Next
'Remove references to the layer in the master catalog
RemoveIndirectReferences(ocProps, layerRefNumbers)
'Clean up unused objects
_reader.RemoveUnusedObjects()
End Sub
Private Shared Function GetNewStream(data As Byte(), propsToRemove As IEnumerable(Of String)) As Byte()
Dim item As PdfLayer = Nothing
Dim positions As New List(Of Integer)
positions.Add(0)
Dim pos As Integer
Dim inGroup As Boolean = False
Dim tokenizer As New PRTokeniser(New RandomAccessFileOrArray(New RandomAccessSourceFactory().CreateSource(data)))
While tokenizer.NextToken
If tokenizer.TokenType = PRTokeniser.TokType.NAME AndAlso tokenizer.StringValue = "OC" Then
pos = CInt(tokenizer.FilePointer - 3)
If tokenizer.NextToken() AndAlso tokenizer.TokenType = PRTokeniser.TokType.NAME Then
If Not inGroup AndAlso propsToRemove.Contains(tokenizer.StringValue) Then
inGroup = True
positions.Add(pos)
End If
End If
ElseIf tokenizer.TokenType = PRTokeniser.TokType.OTHER AndAlso tokenizer.StringValue = "EMC" AndAlso inGroup Then
positions.Add(CInt(tokenizer.FilePointer))
inGroup = False
End If
End While
positions.Add(data.Length)
If positions.Count > 2 Then
Dim length As Integer = 0
For i As Integer = 0 To positions.Count - 1 Step 2
length += positions(i + 1) - positions(i)
Next
Dim newData(length) As Byte
length = 0
For i As Integer = 0 To positions.Count - 1 Step 2
Array.Copy(data, positions(i), newData, length, positions(i + 1) - positions(i))
length += positions(i + 1) - positions(i)
Next
Dim origStr = System.Text.Encoding.UTF8.GetString(data)
Dim newStr = System.Text.Encoding.UTF8.GetString(newData)
Return newData
Else
Return data
End If
End Function
Private Shared Sub RemoveIndirectReferences(dict As PdfDictionary, refNumbers As IEnumerable(Of Integer))
Dim newDict As PdfDictionary
Dim arrayData As PdfArray
Dim indirect As PdfIndirectReference
Dim i As Integer
For Each key In dict.Keys
newDict = dict.GetAsDict(key)
arrayData = dict.GetAsArray(key)
If newDict IsNot Nothing Then
RemoveIndirectReferences(newDict, refNumbers)
ElseIf arrayData IsNot Nothing Then
i = 0
While i < arrayData.Size
indirect = arrayData.GetAsIndirectObject(i)
If refNumbers.Contains(indirect.Number) Then
arrayData.Remove(i)
Else
i += 1
End If
End While
End If
Next
End Sub
#Region "IDisposable Support"
Private disposedValue As Boolean ' To detect redundant calls
' IDisposable
Protected Overridable Sub Dispose(disposing As Boolean)
If Not Me.disposedValue Then
If disposing Then
RemoveLayers()
End If
' TODO: free unmanaged resources (unmanaged objects) and override Finalize() below.
' TODO: set large fields to null.
End If
Me.disposedValue = True
End Sub
' TODO: override Finalize() only if Dispose(ByVal disposing As Boolean) above has code to free unmanaged resources.
'Protected Overrides Sub Finalize()
' ' Do not change this code. Put cleanup code in Dispose(ByVal disposing As Boolean) above.
' Dispose(False)
' MyBase.Finalize()
'End Sub
' This code added by Visual Basic to correctly implement the disposable pattern.
Public Sub Dispose() Implements IDisposable.Dispose
' Do not change this code. Put cleanup code in Dispose(ByVal disposing As Boolean) above.
Dispose(True)
GC.SuppressFinalize(Me)
End Sub
#End Region
End Class

Reading PDF content with itextsharp dll in VB.NET or C#

How can I read PDF content with the itextsharp with the Pdfreader class. My PDF may include Plain text or Images of the text.
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.IO;
public string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}
LGPL / FOSS iTextSharp 4.x
var pdfReader = new PdfReader(path); //other filestream etc
byte[] pageContent = _pdfReader .GetPageContent(pageNum); //not zero based
byte[] utf8 = Encoding.Convert(Encoding.Default, Encoding.UTF8, pageContent);
string textFromPage = Encoding.UTF8.GetString(utf8);
None of the other answers were useful to me, they all seem to target the AGPL v5 of iTextSharp. I could never find any reference to SimpleTextExtractionStrategy or LocationTextExtractionStrategy in the FOSS version.
Something else that might be very useful in conjunction with this:
const string PdfTableFormat = #"\(.*\)Tj";
Regex PdfTableRegex = new Regex(PdfTableFormat, RegexOptions.Compiled);
List<string> ExtractPdfContent(string rawPdfContent)
{
var matches = PdfTableRegex.Matches(rawPdfContent);
var list = matches.Cast<Match>()
.Select(m => m.Value
.Substring(1) //remove leading (
.Remove(m.Value.Length - 4) //remove trailing )Tj
.Replace(#"\)", ")") //unencode parens
.Replace(#"\(", "(")
.Trim()
)
.ToList();
return list;
}
This will extract the text-only data from the PDF if the text displayed is Foo(bar) it will be encoded in the PDF as (Foo\(bar\))Tj, this method would return Foo(bar) as expected. This method will strip out lots of additional information such as location coordinates from the raw pdf content.
Here is a VB.NET solution based on ShravankumarKumar's solution.
This will ONLY give you the text. The images are a different story.
Public Shared Function GetTextFromPDF(PdfFileName As String) As String
Dim oReader As New iTextSharp.text.pdf.PdfReader(PdfFileName)
Dim sOut = ""
For i = 1 To oReader.NumberOfPages
Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
sOut &= iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)
Next
Return sOut
End Function
In my case, I just wanted the text from a specific area of the PDF document so I used a rectangle around the area and extracted the text from it. In the sample below the coordinates are for the entire page. I don't have PDF authoring tools so when it came time to narrow down the rectangle to the specific location I took a few guesses at the coordinates until the area was found.
Rectangle _pdfRect = new Rectangle(0f, 0f, 612f, 792f); // Entire page - PDF coordinate system 0,0 is bottom left corner. 72 points / inch
RenderFilter _renderfilter = new RegionTextRenderFilter(_pdfRect);
ITextExtractionStrategy _strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), _filter);
string _text = PdfTextExtractor.GetTextFromPage(_pdfReader, 1, _strategy);
As noted by the above comments the resulting text doesn't maintain any of the formatting found in the PDF document, however, I was happy that it did preserve the carriage returns. In my case, there were enough constants in the text that I was able to extract the values that I required.
Here an improved answer of ShravankumarKumar. I created special classes for the pages so you can access words in the pdf based on the text rows and the word in that row.
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
//create a list of pdf pages
var pages = new List<PdfPage>();
//load the pdf into the reader. NOTE: path can also be replaced with a byte array
using (PdfReader reader = new PdfReader(path))
{
//loop all the pages and extract the text
for (int i = 1; i <= reader.NumberOfPages; i++)
{
pages.Add(new PdfPage()
{
content = PdfTextExtractor.GetTextFromPage(reader, i)
});
}
}
//use linq to create the rows and words by splitting on newline and space
pages.ForEach(x => x.rows = x.content.Split('\n').Select(y =>
new PdfRow() {
content = y,
words = y.Split(' ').ToList()
}
).ToList());
The custom classes
class PdfPage
{
public string content { get; set; }
public List<PdfRow> rows { get; set; }
}
class PdfRow
{
public string content { get; set; }
public List<string> words { get; set; }
}
Now you can get a word by row and word index.
string myWord = pages[0].rows[12].words[4];
Or use Linq to find the rows containing a specific word.
//find the rows in a specific page containing a word
var myRows = pages[0].rows.Where(x => x.words.Any(y => y == "myWord1")).ToList();
//find the rows in all pages containing a word
var myRows = pages.SelectMany(r => r.rows).Where(x => x.words.Any(y => y == "myWord2")).ToList();
Public Sub PDFTxtToPdf(ByVal sTxtfile As String, ByVal sPDFSourcefile As String)
Dim sr As StreamReader = New StreamReader(sTxtfile)
Dim doc As New Document()
PdfWriter.GetInstance(doc, New FileStream(sPDFSourcefile, FileMode.Create))
doc.Open()
doc.Add(New Paragraph(sr.ReadToEnd()))
doc.Close()
End Sub

Categories