ITextSharp 4.1.6 extract PDF content as text

ITextSharp 4.1.6 extract PDF content as text - c#

The company would like to use the Itextsharp 4.1.6 version specifically and don't want to buy the license (version 5/7).
So, we had already implemented the TextExtract from pdf using the itextsharp 5 version. As we downgraded, this method doesn't support in the 4.16 LGPL version.
So, I looked into many StackOverflow and other sites for the answer. Looks like no custom implementation found other than the below code which exists in AGPL version.
PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy())
And byte[] pageContent = reader.GetPageContent(i); gives the byte content, when converted to string it won't give us the exact file text.
As, we do not wish to buy the AGPL version and need to implement the textextractor of pdf, any idea if any other tool supports this/ anybody has the implementation of textextractor.
Any suggestions would be greatly appreciated.
Edit: Refernce for the #jgoday's answer:

With iText 4.1 you can use PdfContentParser (https://github.com/schourode/iTextSharp-LGPL/blob/f75cdad88236d502af42458a420d48be2a47008f/src/core/iTextSharp/text/pdf/PdfContentParser.cs), to parse contents of every page.
using System;
using System.Text;
using iTextSharp.text.pdf;
namespace PdfExtractor
{
class Program
{
static void Main(string[] args)
{
var reader = new PdfReader(#"D:\Tmp\sample.pdf");
try
{
var parser = new PdfContentParser(new PRTokeniser(reader.GetPageContent(2)));
var sb = new StringBuilder();
while (parser.Tokeniser.NextToken())
{
if (parser.Tokeniser.TokenType == PRTokeniser.TK_STRING)
{
string str = parser.Tokeniser.StringValue;
sb.Append(str);
}
}
Console.WriteLine(sb.ToString());
}
finally {
reader.Close();
}
}
}
}

Related

Rewriting simple iTextSharp read page function using iText7 library

I've been using the iTextSharp library in my C# .NET program to read PDF files. A simplified version of my program uses code like the following procedure to get the text for a specified page number in a PDF File.
using iTextSharp.text.pdf.parser;
using System.Text;
namespace PDFReaderITextSharp
{
public static class PdfHelper
{
public static string GetText(string fileName, int pageNumber)
{
PdfReader pdfReader = new PdfReader(fileName);
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, pageNumber, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
return currentText;
}
}
}
Now, I'd like to construct a similar type of function that does that same thing but using the iText7 library in a .NET Core 3 program, namely, return the text for a specified PDF page. However, when I look at the text that is returned and compare it to the text returned by the above function as well as visually looking at the PDF file using an Adobe Reader, I see document text being represented multiple times.
For example, when I look at the PDF file in Adobe, there are several fields displayed in the form of "Caption: Value". For example, Invoice #: 1234 Invoice Date: 12/32/2019. But the text returned using the iText7 library returns "Invoice#Invoice #: 1234 Invoice DateInvoice Date: 12/32/2019" (the labels are duplicated 2 or more times.)
I wish I could upload the PDF document to help.
Is there something wrong with the iText7 function? What might the iText7 library be doing?
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
namespace PDFReader.Helpers
{
public static class PdfHelperIText7
{
public static string GetPdfPageText(string pdfFilePath, int pageNumber)
{
using (PdfDocument pdfDocument = new PdfDocument(new PdfReader(pdfFilePath)))
{
FilteredEventListener listener = new FilteredEventListener();
LocationTextExtractionStrategy extractionStrategy = listener.AttachEventListener(new LocationTextExtractionStrategy());
PdfCanvasProcessor pdfCanvasProcessor = new PdfCanvasProcessor(listener);
listener.AttachEventListener(extractionStrategy);
pdfCanvasProcessor.ProcessPageContent(pdfDocument.GetPage(pageNumber));
string actualText = extractionStrategy.GetResultantText();
pdfDocument.Close();
return actualText;
}
}
}
}

Trying To Extract Embedded File Attachments From Existing PDF Using C# .NET And PDFBox 1.7.0

I am trying to extract embedded file attachments from an existing PDF using C# .NET and PDFBox.
The following is my code:
using System.Collections.Generic;
using System.IO;
using java.util; // IKVM Java for Microsoft .NET http://www.ikvm.net
using java.io; // IKVM Java for Microsoft .NET http://www.ikvm.net
using org.apache.pdfbox.pdmodel; // PDFBox 1.7.0 http://pdfbox.apache.org
using org.apache.pdfbox.pdmodel.common; // PDFBox 1.7.0 http://pdfbox.apache.org
using org.apache.pdfbox.pdmodel.common.filespecification; // PDFBox 1.7.0 http://pdfbox.apache.org
using org.apache.pdfbox.cos; // PDFBox 1.7.0 http://pdfbox.apache.org
namespace PDFClass
{
public class Class1
{
public Class1 ()
{
}
public void ReadPDFAttachments (string existingFileNameFullPath)
{
PDEmbeddedFilesNameTreeNode efTree;
PDComplexFileSpecification fs;
FileStream stream;
ByteArrayInputStream fakeFile;
PDDocument pdfDocument = new PDDocument();
PDEmbeddedFile ef;
PDDocumentNameDictionary names;
Map efMap = new HashMap();
pdfDocument = PDDocument.load(existingFileNameFullPath);
PDDocumentNameDictionary namesDictionary = new PDDocumentNameDictionary(pdfDocument.getDocumentCatalog());
PDEmbeddedFilesNameTreeNode embeddedFiles = namesDictionary.getEmbeddedFiles(); // some bug is currently preventing this call from working! >:[
if (embeddedFiles != null)
{
var aKids = embeddedFiles.getKids().toArray();
List<PDNameTreeNode> kids = new List<PDNameTreeNode>();
foreach (object oKid in aKids)
{
kids.Add(oKid as PDNameTreeNode);
}
if (kids != null)
{
foreach (PDNameTreeNode kid in kids)
{
PDComplexFileSpecification spec = (PDComplexFileSpecification)kid.getValue("ZUGFERD_XML_FILENAME");
PDEmbeddedFile file = spec.getEmbeddedFile();
fs = new PDComplexFileSpecification();
// Loop through each file for re-embedding
byte[] data = file.getByteArray();
int read = data.Length;
fakeFile = new ByteArrayInputStream(data);
ef = new PDEmbeddedFile(pdfDocument, fakeFile);
fs.setEmbeddedFile(ef);
efMap.put(kid.toString(), fs);
embeddedFiles.setNames(efMap);
names = new PDDocumentNameDictionary(pdfDocument.getDocumentCatalog());
((COSDictionary)efTree.getCOSObject()).removeItem(COSName.LIMITS); // Bug in PDFBox code requires we do this, or attachment will not embed. >:[
names.setEmbeddedFiles(embeddedFiles);
pdfDocument.getDocumentCatalog().setNames(names);
fs.getCOSDictionary().setString("Desc", kid.toString()); // adds a description to attachment in PDF attachment list
}
}
}
}
}
}
The variable embeddedFiles is always null. even though I put a break in the code and can see the PDF file clearly has the attachment in it.
Any assistance would be greatly appreciated!

How to read PDF bookmarks programmatically

I'm using a PDF converter to access the graphical data within a PDF. Everything works fine, except that I don't get a list of the bookmarks. Is there a command-line app or a C# component that can read a PDF's bookmarks? I found the iText and SharpPDF libraries and I'm currently looking through them. Have you ever done such a thing?

Try the following code
PdfReader pdfReader = new PdfReader(filename);
IList<Dictionary<string, object>> bookmarks = SimpleBookmark.GetBookmark(pdfReader);
for(int i=0;i<bookmarks.Count;i++)
{
MessageBox.Show(bookmarks[i].Values.ToArray().GetValue(0).ToString());
if (bookmarks[i].Count > 3)
{
MessageBox.Show(bookmarks[i].ToList().Count.ToString());
}
}
Note: Don't forget to add iTextSharp DLL to your project.

As the bookmarks are in a tree structure (https://en.wikipedia.org/wiki/Tree_(data_structure)),
I've used some recursion here to collect all bookmarks and it's children.
iTextSharp solved it for me.
dotnet add package iTextSharp
Collected all bookmarks with the following code:
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
using iTextSharp.text.pdf;
namespace PdfManipulation
{
class Program
{
static void Main(string[] args)
{
StringBuilder bookmarks = ExtractAllBookmarks("myPdfFile.pdf");
}
private static StringBuilder ExtractAllBookmarks(string pdf)
{
StringBuilder sb = new StringBuilder();
PdfReader reader = new PdfReader(pdf);
IList<Dictionary<string, object>> bookmarksTree = SimpleBookmark.GetBookmark(reader);
foreach (var node in bookmarksTree)
{
sb.AppendLine(PercorreBookmarks(node).ToString());
}
return RemoveAllBlankLines(sb);
}
private static StringBuilder RemoveAllBlankLines(StringBuilder sb)
{
return new StringBuilder().Append(Regex.Replace(sb.ToString(), #"^\s+$[\r\n]*", string.Empty, RegexOptions.Multiline));
}
private static StringBuilder PercorreBookmarks(Dictionary<string, object> bookmark)
{
StringBuilder sb = new StringBuilder();
sb.AppendLine(bookmark["Title"].ToString());
if (bookmark != null && bookmark.ContainsKey("Kids"))
{
IList<Dictionary<string, object>> children = (IList<Dictionary<string, object>>) bookmark["Kids"];
foreach (var bm in children)
{
sb.AppendLine(PercorreBookmarks(bm).ToString());
}
}
return sb;
}
}
}

You can use the PDFsharp library. It is published under the MIT License so it can be used even in corporate development. Here is an untested example.
using PdfSharp.Pdf;
using (PdfDocument document = PdfReader.IO.Open("bookmarked.pdf", IO.PdfDocumentOpenMode.Import))
{
PdfDictionary outline = document.Internals.Catalog.Elements.GetDictionary("/Outlines");
PrintBookmark(outline);
}
void PrintBookmark(PdfDictionary bookmark)
{
Console.WriteLine(bookmark.Elements.GetString("/Title"));
for (PdfDictionary child = bookmark.Elements.GetDictionary("/First"); child != null; child = child.Elements.GetDictionary("/Next"))
{
PrintBookmark(child);
}
}
Gotchas:
PdfSharp doesn't support open pdf's over version 1.6 very well. (throws: cannot handle iref streams. the current implementation of pdfsharp cannot handle this pdf feature introduced with acrobat 6)
There are many types of strings in PDFs which PDFsharp returns as is including UTF-16BE strings. (7.9.2.1 ISO32000 2008)

You might try Docotic.Pdf library for the task if you are fine with a commercial solution.
Here is a sample code to list all top-level items from bookmarks with some of their properties.
using (PdfDocument doc = new PdfDocument("file.pdf"))
{
PdfOutlineItem root = doc.OutlineRoot;
foreach (PdfOutlineItem item in root.Children)
{
Console.WriteLine("{0} ({1} child nodes, points to page {2})",
item.Title, item.ChildCount, item.PageIndex);
}
}
PdfOutlineItem class also provides properties related to outline item styles and more.
Disclaimer: I work for the vendor of the library.

If a commercial library is an option for you you could give Amyuni PDF Creator .Net a try.
Use the class Amyuni.PDFCreator.IacDocument.RootBookmark to retrieve the root of the bookmarks' tree, then the properties in IacBookmark to access each tree element, to navigate through the tree, and to add, edit or remove elements if needed.
Usual disclaimer applies

Convert HTML or PDF to RTF/DOC or HTML/PDF to image using DevExpress or Infragistics

There is a way do convert HTML or PDF to RTF/DOC or HTML/PDF to image using DevExpress or Infragistics?
I tried this using DevExpress:
string html = new StreamReader(Server.MapPath(#".\teste.htm")).ReadToEnd();
RichEditControl richEditControl = new RichEditControl();
string rtf;
try
{
richEditControl.HtmlText = html;
rtf = richEditControl.RtfText;
}
finally
{
richEditControl.Dispose();
}
StreamWriter sw = new StreamWriter(#"D:\teste.rtf");
sw.Write(rtf);
sw.Close();
But I have a complex html content (tables, backgrounds, css etc) and the final result is not good...

To convert Html content into image or Pdf you may use the following code:
using (RichEditControl richEditControl = new RichEditControl()) {
richEditControl.LoadDocument(Server.MapPath(#".\teste.htm"), DocumentFormat.Html);
using (PrintingSystem ps = new PrintingSystem()) {
PrintableComponentLink pcl = new PrintableComponentLink(ps);
pcl.Component = richEditControl;
pcl.CreateDocument();
//pcl.PrintingSystem.ExportToPdf("teste.pdf");
pcl.PrintingSystem.ExportToImage("teste.jpg", System.Drawing.Imaging.ImageFormat.Jpeg);
}
}

I suggest you to use latest DevExpress version (version 10.1.5 this time). It handles tables much better than previous ones.
Please use the following code to avoid encoding issues (StreamReader and StreamWriter in your sample always use Encoding.UTF8 encoding, this will corrupt any content stored with another encoding):
using (RichEditControl richEditControl = new RichEditControl()) {
richEditControl.LoadDocument(Server.MapPath(#".\teste.htm"), DocumentFormat.Html);
richEditControl.SaveDocument(#"D:\teste.rtf", DocumentFormat.Rtf);
}
Also take a look at the richEditControl.Options.Import.Html and richEditControl.Options.Export.Rtf properties, you may find them useful for some cases.

Combine PDFs c#

How can I combine multiple PDFs into one PDF without a 3rd party component?

I don't think you can.
Opensource component PDFSharp has that functionality, and a nice source code sample on file combining

The .NET Framework does not contain the ability to modify/create PDFs. You need a 3rd party component to accomplish what you are looking for.

As others have said, there is nothing built in to do that task. Use iTextSharp with this example code.

AFAIK C# has no built-in support for handling PDF so what you are asking can not be done without using a 3rd party component or a COTS library.
Regarding libraries there is a myriad of possibilities. Just to point a few:
http://csharp-source.net/open-source/pdf-libraries
http://www.codeproject.com/KB/graphics/giospdfnetlibrary.aspx
http://www.pdftron.com/net/index.html

I don't think .NET Framework contains such like libraries. I used iTextsharp with c# to combine pdf files. I think iTextsharp is easyest way to do this. Here is the code I used.
string[] lstFiles=new string[3];
lstFiles[0]=#"C:/pdf/1.pdf";
lstFiles[1]=#"C:/pdf/2.pdf";
lstFiles[2]=#"C:/pdf/3.pdf";
PdfReader reader = null;
Document sourceDocument = null;
PdfCopy pdfCopyProvider = null;
PdfImportedPage importedPage;
string outputPdfPath=#"C:/pdf/new.pdf";
sourceDocument = new Document();
pdfCopyProvider = new PdfCopy(sourceDocument, new System.IO.FileStream(outputPdfPath, System.IO.FileMode.Create));
//Open the output file
sourceDocument.Open();
try
{
//Loop through the files list
for (int f = 0; f < lstFiles.Length-1; f++)
{
int pages =get_pageCcount(lstFiles[f]);
reader = new PdfReader(lstFiles[f]);
//Add pages of current file
for (int i = 1; i <= pages; i++)
{
importedPage = pdfCopyProvider.GetImportedPage(reader, i);
pdfCopyProvider.AddPage(importedPage);
}
reader.Close();
}
//At the end save the output file
sourceDocument.Close();
}
catch (Exception ex)
{
throw ex;
}
private int get_pageCcount(string file)
{
using (StreamReader sr = new StreamReader(File.OpenRead(file)))
{
Regex regex = new Regex(#"/Type\s*/Page[^s]");
MatchCollection matches = regex.Matches(sr.ReadToEnd());
return matches.Count;
}
}

ITextSharp is the way to go

Although it has already been said, you can't manipulate PDFs with the built-in libraries of the .NET Framework. I can however recommend iTextSharp, which is a .NET port of the Java iText. I have played around with it, and found it to be a very easy tool to use.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

ITextSharp 4.1.6 extract PDF content as text - c#

Related

Rewriting simple iTextSharp read page function using iText7 library

Trying To Extract Embedded File Attachments From Existing PDF Using C# .NET And PDFBox 1.7.0

How to read PDF bookmarks programmatically

Convert HTML or PDF to RTF/DOC or HTML/PDF to image using DevExpress or Infragistics

Combine PDFs c#

Categories

Resources