PDF to Text: iTextSharp: Duplicate Pages in Extraction Results

PDF to Text: iTextSharp: Duplicate Pages in Extraction Results - c#

Thanks in advance.
The Background:
I'm working on a console application that extracts data from specific sections in pdf documents. To do this I first need to convert that pdf into a string to work with. To do this I turned to iTextSharp. The pdfs are laid out with two columns per page so I'm using the SimpleTextExtractionStratgey() (I tried iTextSharp.text.pdf.parser.LocationTextExtractionStrategy(); but found it ineffective for the page layout).
Description of content being converted to text:
The pages I seem to be having trouble with have a "header" posted up on the side of the page. Pages with headers are intermittently dispersed through the document.
Image of page layout: http://postimg.org/image/b7i25v0g1/
The Problem:
It seems when it finishes looking through the columns on the page then moves on to that side header. It would then jump to the next page with a side header, convert that to text, then start again from the top of the page where the first header was encountered.
I'd end up with text that looks like:
Page 1 Content
First Header
Second Header
Page 1 Content
Page 2 Content
etc.
Here is the pdf: http://www.filedropper.com/dd35-completeadventurer
I'm not married to iTextSharp I just need a reliable way to convert documents with this format to text. A work around or alternate method would be appreciated.
static public string ToTxt(string #filePath)
{
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
Widgets.ProgressBar(page);
//Convert PDF to Text
ITextExtractionStrategy its = new SimpleTextExtractionStrategy(); //iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = strText + s;
}
reader.Close();
Console.WriteLine("File Extracted");
}
catch (Exception e)
{
Console.WriteLine("Exception: " + e.Message);
}
finally
{
Console.Clear();
}
return strText;
}

As already conjectured in a comment, the duplicate text already is present in the PDF content!
Details
The page contents of pairs of pages facing each other in your document often are identical, each time the contents of the whole spread, and the individual pages merely display only the left or the right half respectively.
E.g. consider the two pages 6 and 7. Their contents are identical:
filling the area of their identical MediaBox. Merely by setting the CropBox (and the ArtBox, BleedBox, and TrimBox) to the left or right half respectively, only the expected content is shown for page 6:
and page 7:
Neither the iText(Sharp) parser framework nor the SimpleTextExtractionStrategy automatically restrict to these boxes, they extract all text drawn anywhere in the content. Thus, the duplicate text.
Preventing duplicate text in the extraction result
Knowing the cause for the text duplication, there are multiple ways to prevent it:
You can try and extract the content only of every other PDF page. Unfortunately the above said is not true for all pages, at least the initial pages (title page, contents, ...) are not created using the scheme explained above, and further into the book there are some artwork pages not following the scheme either. Thus, this option would require quite some management of exceptional pages.
You can extract the contents of each page but keep the contents of the previously processed page in some variable. Now only add the newly extracted content to the result if it does not equal the content of the prior page.
You can use the iText(Sharp) parser filters. If you restrict the text chunks processed by your strategy to only those drawn inside the crop box of the current page, you prevent duplicate text caused by off-page content. You can find an example filtering by region here: ExtractPageContentArea.java / ExtractPageContentArea.cs.

Using mkl's second method (checking each page for repeat) I came up with the following and it works brilliantly; an easy fix:
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
string prevPage = "";
for (int page = 1; page <= reader.NumberOfPages; page++)
{
Widgets.ProgressBar(page);
//Convert PDF to Text
ITextExtractionStrategy its = new SimpleTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
if (prevPage != s)
strText += s;
prevPage = s;
}
reader.Close();
Console.WriteLine("File Extracted");
}
catch (Exception e)
{
Console.WriteLine("Exception: " + e.Message);
}
finally
{
Console.Clear();
}
return strText;
}

Related

iTextSharp How to read Table in PDF file

I am working on convert PDF to text. I can get text from PDF correctly but it is being complicated in table structure. I know PDF doesn't support table structure but I think there is a way get cells correctly. Well, for example:
I want to convert to text like this:
> This is first example.
> This is second example.
But, when I convert PDF to text, theese datas looking like this:
> This is This is
> first example. second example.
How can I get values correctly?
--EDIT:
Here is how did I convert PDF to Text:
OpenFileDialog ofd = new OpenFileDialog();
string filepath;
ofd.Filter = "PDF Files(*.PDF)|*.PDF|All Files(*.*)|*.*";
if (ofd.ShowDialog() == DialogResult.OK)
{
filepath = ofd.FileName.ToString();
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filepath);
for (int page = 1; page < reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText += s;
}
reader.Close();
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}

To make my comment an actual answer...
You use the LocationTextExtractionStrategy for text extraction:
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, page, its);
This strategy arranges all text it finds in left-to-right lines from top to bottom (actually also taking the text line angle into account). Thus, it clearly is not what you need to extract text from tables with cells with multi-line content.
Depending on the document in question there are different approaches one can take:
Use the iText SimpleTextExtractionStrategy if the text drawing operations in the document in question already are in the order one wants for text extraction.
Use a custom text extraction strategy which makes use of tagging information if the document tables are properly tagged.
Use a complex custom text extraction strategy which tries to get hints from text arrangements, line paths, or background colors to guess the table cell structure and extract text cell by cell.
In this case, the OP commented that he changed LocationTextExtractionStrategy with SimpleTextExtractionStrategy, then it worked.

Reading pdf content with itextsharp and C#

I'm looking for a way to select text n characters on either side of a keyword search using itextsharp (v5.5.8). I've gotten to the point where I can use the SimpleTextExtractionStrategy() and return a list of pages where the searched text is found (supposedly). When I do a manual search using PDF Viewer Search, sometimes it's there and sometimes it can't find it on the page itextsharp says it's on. Sometimes, not at all.
The idea is to be able to return 40 characters on either side of the found keyword to allow the user to be able to find the reference easier when they look at the actual document. In another question, I saw a reference to additional text retrieval functions (LocationTextExtractionStrategy, PdfTextExtractor.GetTextFromPage(myReader, pageNum) in combination with some Contains(word)).
Where can I find examples of how to use these functions? And how to create a better strategy?
My current code:
public List<int> ReadPdfFile(string fileName, String searthText)
{
string rootPath = HttpContext.Current.Server.MapPath("~");
string dirPath = rootPath + #"content\publications\";
List<int> pages = new List<int>();
string fullFile = dirPath + fileName;
if (File.Exists(fullFile))
{
PdfReader pdfReader = new PdfReader(fullFile);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
if (currentPageText.Contains(searthText))
{
pages.Add(page);
}
}
pdfReader.Close();
}
return pages;
}
And an example of the output using a simple response.write command...
Document 1.pdf 1 3
Document 2.pdf 1 2 3 4
The numbers after the file name are page numbers where the searched for keyword is found. However, in Document 1, the keyword is also found on the very top of page 4 in the "References" section that began on page 3. It should be found twice in the References.
Thanks,
Bob
P.S. apparently 5.5.8 doesn't have the iTextSharp.text.pdf.parser.TextExtractionStrategy method...

Extract text from pdf by format

I am trying to extract the headlines from pdfs.
Until now I tried to read the plain text and take the first line (which didn't work because in plain text the headlines were not at the beginning) and just read the text from a region (which didn't work, because the regions are not always the same).
The easiest way to do this is in my opinion to read just text with a special format (font, fontsize etc.).
Is there a way to do this?

You can enumerate all text objects on a PDF page using Docotic.Pdf library. For each of the text objects information about the font and the size of the object is available. Below is a sample
public static void listTextObjects(string inputPdf)
{
using (PdfDocument pdf = new PdfDocument(inputPdf))
{
string format = "{0}\n{1}, {2}px at {3}";
foreach (PdfPage page in pdf.Pages)
{
foreach (PdfPageObject obj in page.GetObjects())
{
if (obj.Type != PdfPageObjectType.Text)
continue;
PdfTextData text = (PdfTextData)obj;
string message = string.Format(format, text.Text, text.Font.Name,
text.Size.Height, text.Position);
Console.WriteLine(message);
}
}
}
}
The code will output lines like the following for each text object on each page of the input PDF file.
FACTUUR
Helvetica-BoldOblique, 19.04px at { X=51.12; Y=45.54 }
You can use the retrieved information to find largest text or bold text or text with other properties used to format the headline.
If your PDF is guaranteed to have headline as the topmost text on a page than you can use even simpler approach
public static void printText(string inputPdf)
{
using (PdfDocument pdf = new PdfDocument(inputPdf))
{
foreach (PdfPage page in pdf.Pages)
{
string text = page.GetTextWithFormatting();
Console.WriteLine(text);
}
}
}
The GetTextWithFormatting method returns text in the reading order (i.e from left top to right bottom position).
Disclaimer: I am one of the developer of the library.

Using iTextSharp, trying to extract text from a PDF gives non-readable data

Okay, I'm trying to extract text from a PDF file using iTextSharp... that's all I want. However, when I extract the text, it's giving me garbage instead of text.
Here's the code I'm using...
List<String> pdfText = new List<string>();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
String strPage = PdfTextExtractor.GetTextFromPage(reader, page, its);
strPage = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default,
Encoding.UTF8, Encoding.Default.GetBytes(strPage)));
pdfText.Add(strPage);
}
I then save that data to a text file, but instead of readable text, I get text that looks like binary data... non-printable characters all over the place. I'd post an image of what I see, but it won't let me. Sorry about that.
I have tried without the encoding attempt, and it didn't work any better... still binary-looking data (viewed in Notepad), though I'm not certain it's identical to that produced with the encoding attempt.
Any idea what is happening and how to fix it?

Please open the document in Adobe Reader, then try to copy/paste part of the text.
If you do this with the first page, you'll get:
The following policy (L30304) has been archived by Alpha II. Many policies are part of a larger
jurisdiction, than is indicated by the policy. This policy covers the following states:
• INDIANA
• MICHIGAN
However, if you do this with the second page, you'll get:
In other words: copy/pasting from Adobe Reader gives you garbage.
And if copy/pasting from Adobe Reader gives you garbage, any text extraction tool will give you garbage. You'll need to OCR the document to solve this problem.
Regarding your additional question in the comments: if the PDf employs a custom encoding method, how can Adobe display it properly unless the PDF file contains the information needed to handle it?
This question is answered in a 14-minute movie: https://www.youtube.com/watch?v=wxGEEv7ibHE

try this code:
List<String> pdfText = new List<string>();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
PdfTextExtractor.GetTextFromPage(reader, page, its);
strPage = its.GetResultantText();
pdfText.Add(strPage);
}

Try this code, Worked for me
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
}
return text.ToString();
}

itextsharp and pdf

i want to read a pdf file line per line but i want to maintain his original format
¿can i do this with itextsharp?
i use the next code :
private void button1_Click(object sender, EventArgs e)
{
string text = string.Empty;
string path = string.Empty;
path = "C:\\Documents and Settings\\Rafael\\Desktop\\Imprimiendo\\Print1.pdf";
PdfReader reader = new PdfReader(path);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
text = PdfTextExtractor.GetTextFromPage(reader, page);
richTextBox1.Text = text;
}
reader.Close();
return;
}
thanks, i really need your help

If you want to read PDF file with small data in it, iTextsharp would be the best choice, you may find answer here:
Reading PDF content with itextsharp dll in VB.NET or C#
However, if you have huge data in your PDF file, iTextsharp will have problems in realizing this task. in such a case, you may need a third party library. This article may help you much:
Read PDF file in C#

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.