Reading pdf content with itextsharp and C#

Reading pdf content with itextsharp and C# - c#

I'm looking for a way to select text n characters on either side of a keyword search using itextsharp (v5.5.8). I've gotten to the point where I can use the SimpleTextExtractionStrategy() and return a list of pages where the searched text is found (supposedly). When I do a manual search using PDF Viewer Search, sometimes it's there and sometimes it can't find it on the page itextsharp says it's on. Sometimes, not at all.
The idea is to be able to return 40 characters on either side of the found keyword to allow the user to be able to find the reference easier when they look at the actual document. In another question, I saw a reference to additional text retrieval functions (LocationTextExtractionStrategy, PdfTextExtractor.GetTextFromPage(myReader, pageNum) in combination with some Contains(word)).
Where can I find examples of how to use these functions? And how to create a better strategy?
My current code:
public List<int> ReadPdfFile(string fileName, String searthText)
{
string rootPath = HttpContext.Current.Server.MapPath("~");
string dirPath = rootPath + #"content\publications\";
List<int> pages = new List<int>();
string fullFile = dirPath + fileName;
if (File.Exists(fullFile))
{
PdfReader pdfReader = new PdfReader(fullFile);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
if (currentPageText.Contains(searthText))
{
pages.Add(page);
}
}
pdfReader.Close();
}
return pages;
}
And an example of the output using a simple response.write command...
Document 1.pdf 1 3
Document 2.pdf 1 2 3 4
The numbers after the file name are page numbers where the searched for keyword is found. However, in Document 1, the keyword is also found on the very top of page 4 in the "References" section that began on page 3. It should be found twice in the References.
Thanks,
Bob
P.S. apparently 5.5.8 doesn't have the iTextSharp.text.pdf.parser.TextExtractionStrategy method...

Related

Using ITextSharp TextRenderInfo.GetTextRenderMode ignoring hidden text in pdf

I have a series of PDF files I need to search for keywords, but many of them contain a huge amount of hidden text. What I mean is when you try to CTRL+F to see how many key words are named "CJP" there are about 35 results, but in reality there are only about 9 that are actually visible, the rest just seem to be randomly hidden all over the page. I have tried out several APIs with them all reading 35 and not 9, so I wanted to try out this class named TextRenderInfo in ITextSharp because the method GetTextRenderMode is suppose to return 3 if the text is hidden, meaning I can use that to ignore strings that are invisable.
Here is my current code:
static void Main(string[] args)
{
Gerdau.ITextSharpCount(#"Source.pdf", "CJP");
}
public static int ITextSharpCount(string filePath, string searchString)
{
StringBuilder sb = new StringBuilder();
string file = filePath;
using (PdfReader reader = new PdfReader(file))
{
for (int pageNumber = 1; pageNumber <= reader.NumberOfPages; pageNumber++)
{
textRenderInfo.GetTextRenderMode();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string text = PdfTextExtractor.GetTextFromPage(reader, pageNumber, strategy);
text = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
sb.Append(text);
}
}
int numberOfMatches = Regex.Matches(sb.ToString(), searchString).Count;
return numberOfMatches;
}
The issue is I don't know how to set up the TextRenderInfo class to check for the hidden text. If anyone knows how to do it, it would be a huge help and more code the merrier :).

iTextSharp How to read Table in PDF file

I am working on convert PDF to text. I can get text from PDF correctly but it is being complicated in table structure. I know PDF doesn't support table structure but I think there is a way get cells correctly. Well, for example:
I want to convert to text like this:
> This is first example.
> This is second example.
But, when I convert PDF to text, theese datas looking like this:
> This is This is
> first example. second example.
How can I get values correctly?
--EDIT:
Here is how did I convert PDF to Text:
OpenFileDialog ofd = new OpenFileDialog();
string filepath;
ofd.Filter = "PDF Files(*.PDF)|*.PDF|All Files(*.*)|*.*";
if (ofd.ShowDialog() == DialogResult.OK)
{
filepath = ofd.FileName.ToString();
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filepath);
for (int page = 1; page < reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText += s;
}
reader.Close();
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}

To make my comment an actual answer...
You use the LocationTextExtractionStrategy for text extraction:
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, page, its);
This strategy arranges all text it finds in left-to-right lines from top to bottom (actually also taking the text line angle into account). Thus, it clearly is not what you need to extract text from tables with cells with multi-line content.
Depending on the document in question there are different approaches one can take:
Use the iText SimpleTextExtractionStrategy if the text drawing operations in the document in question already are in the order one wants for text extraction.
Use a custom text extraction strategy which makes use of tagging information if the document tables are properly tagged.
Use a complex custom text extraction strategy which tries to get hints from text arrangements, line paths, or background colors to guess the table cell structure and extract text cell by cell.
In this case, the OP commented that he changed LocationTextExtractionStrategy with SimpleTextExtractionStrategy, then it worked.

PDF to Text: iTextSharp: Duplicate Pages in Extraction Results

Thanks in advance.
The Background:
I'm working on a console application that extracts data from specific sections in pdf documents. To do this I first need to convert that pdf into a string to work with. To do this I turned to iTextSharp. The pdfs are laid out with two columns per page so I'm using the SimpleTextExtractionStratgey() (I tried iTextSharp.text.pdf.parser.LocationTextExtractionStrategy(); but found it ineffective for the page layout).
Description of content being converted to text:
The pages I seem to be having trouble with have a "header" posted up on the side of the page. Pages with headers are intermittently dispersed through the document.
Image of page layout: http://postimg.org/image/b7i25v0g1/
The Problem:
It seems when it finishes looking through the columns on the page then moves on to that side header. It would then jump to the next page with a side header, convert that to text, then start again from the top of the page where the first header was encountered.
I'd end up with text that looks like:
Page 1 Content
First Header
Second Header
Page 1 Content
Page 2 Content
etc.
Here is the pdf: http://www.filedropper.com/dd35-completeadventurer
I'm not married to iTextSharp I just need a reliable way to convert documents with this format to text. A work around or alternate method would be appreciated.
static public string ToTxt(string #filePath)
{
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
Widgets.ProgressBar(page);
//Convert PDF to Text
ITextExtractionStrategy its = new SimpleTextExtractionStrategy(); //iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = strText + s;
}
reader.Close();
Console.WriteLine("File Extracted");
}
catch (Exception e)
{
Console.WriteLine("Exception: " + e.Message);
}
finally
{
Console.Clear();
}
return strText;
}

As already conjectured in a comment, the duplicate text already is present in the PDF content!
Details
The page contents of pairs of pages facing each other in your document often are identical, each time the contents of the whole spread, and the individual pages merely display only the left or the right half respectively.
E.g. consider the two pages 6 and 7. Their contents are identical:
filling the area of their identical MediaBox. Merely by setting the CropBox (and the ArtBox, BleedBox, and TrimBox) to the left or right half respectively, only the expected content is shown for page 6:
and page 7:
Neither the iText(Sharp) parser framework nor the SimpleTextExtractionStrategy automatically restrict to these boxes, they extract all text drawn anywhere in the content. Thus, the duplicate text.
Preventing duplicate text in the extraction result
Knowing the cause for the text duplication, there are multiple ways to prevent it:
You can try and extract the content only of every other PDF page. Unfortunately the above said is not true for all pages, at least the initial pages (title page, contents, ...) are not created using the scheme explained above, and further into the book there are some artwork pages not following the scheme either. Thus, this option would require quite some management of exceptional pages.
You can extract the contents of each page but keep the contents of the previously processed page in some variable. Now only add the newly extracted content to the result if it does not equal the content of the prior page.
You can use the iText(Sharp) parser filters. If you restrict the text chunks processed by your strategy to only those drawn inside the crop box of the current page, you prevent duplicate text caused by off-page content. You can find an example filtering by region here: ExtractPageContentArea.java / ExtractPageContentArea.cs.

Using mkl's second method (checking each page for repeat) I came up with the following and it works brilliantly; an easy fix:
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
string prevPage = "";
for (int page = 1; page <= reader.NumberOfPages; page++)
{
Widgets.ProgressBar(page);
//Convert PDF to Text
ITextExtractionStrategy its = new SimpleTextExtractionStrategy();
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
if (prevPage != s)
strText += s;
prevPage = s;
}
reader.Close();
Console.WriteLine("File Extracted");
}
catch (Exception e)
{
Console.WriteLine("Exception: " + e.Message);
}
finally
{
Console.Clear();
}
return strText;
}

Using iTextSharp, trying to extract text from a PDF gives non-readable data

Okay, I'm trying to extract text from a PDF file using iTextSharp... that's all I want. However, when I extract the text, it's giving me garbage instead of text.
Here's the code I'm using...
List<String> pdfText = new List<string>();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
String strPage = PdfTextExtractor.GetTextFromPage(reader, page, its);
strPage = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default,
Encoding.UTF8, Encoding.Default.GetBytes(strPage)));
pdfText.Add(strPage);
}
I then save that data to a text file, but instead of readable text, I get text that looks like binary data... non-printable characters all over the place. I'd post an image of what I see, but it won't let me. Sorry about that.
I have tried without the encoding attempt, and it didn't work any better... still binary-looking data (viewed in Notepad), though I'm not certain it's identical to that produced with the encoding attempt.
Any idea what is happening and how to fix it?

Please open the document in Adobe Reader, then try to copy/paste part of the text.
If you do this with the first page, you'll get:
The following policy (L30304) has been archived by Alpha II. Many policies are part of a larger
jurisdiction, than is indicated by the policy. This policy covers the following states:
• INDIANA
• MICHIGAN
However, if you do this with the second page, you'll get:
In other words: copy/pasting from Adobe Reader gives you garbage.
And if copy/pasting from Adobe Reader gives you garbage, any text extraction tool will give you garbage. You'll need to OCR the document to solve this problem.
Regarding your additional question in the comments: if the PDf employs a custom encoding method, how can Adobe display it properly unless the PDF file contains the information needed to handle it?
This question is answered in a 14-minute movie: https://www.youtube.com/watch?v=wxGEEv7ibHE

try this code:
List<String> pdfText = new List<string>();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
PdfTextExtractor.GetTextFromPage(reader, page, its);
strPage = its.GetResultantText();
pdfText.Add(strPage);
}

Try this code, Worked for me
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
}
return text.ToString();
}

Itextsharp text extraction

I'm using itextsharp on vb.net to get the text content from a pdf file. The solution works fine for some files but not for other even quite simple ones. The problem is that the token stringvalue is set to null (a set of empty square boxes)
token = New iTextSharp.text.pdf.PRTokeniser(pageBytes)
While token.NextToken()
tknType = token.TokenType()
tknValue = token.StringValue
I can meassure the length of the content but I cannot get the actual string content.
I realized that this happens depending on the font of the pdf. If I create a pdf using either Acrobat or PdfCreator with Courier (that by the way is the default font in my visual studio editor) I can get all the text content. If the same pdf is built using a different font I got the empty square boxes.
Now the question is, How can I extract text regardless of the font setting?
Thanks

complementary for Mark's answer that helps me a lot .iTextSharp implementation namespaces and classes are a bit different from java version
public static string GetTextFromAllPages(String pdfPath)
{
PdfReader reader = new PdfReader(pdfPath);
StringWriter output = new StringWriter();
for (int i = 1; i <= reader.NumberOfPages; i++)
output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));
return output.ToString();
}

Check out PdfTextExtractor.
String pageText =
PdfTextExtractor.getTextFromPage(myReader, pageNum);
or
String pageText =
PdfTextExtractor.getTextFromPage(myReader, pageNum, new LocationTextExtractionStrategy());
Both require fairly recent versions of iText[Sharp]. Actually parsing the content stream yourself is just reinventing the wheel at this point. Spare yourself some pain and let iText do it for you.
PdfTextExtractor will handle all the different font/encoding issues for you... all the ones that can be handled anyway. If you can't copy/paste from Reader accurately, then there's not enough information present in the PDF to get character information from the content stream.

Here is a variant with iTextSharp.text.pdf.PdfName.ANNOTS and iTextSharp.text.pdf.PdfName.CONTENT if some one need it.
string strFile = #"C:\my\path\tothefile.pdf";
iTextSharp.text.pdf.PdfReader pdfRida = new iTextSharp.text.pdf.PdfReader(strFile);
iTextSharp.text.pdf.PRTokeniser prtTokeneiser;
int pageFrom = 1;
int pageTo = pdfRida.NumberOfPages;
iTextSharp.text.pdf.PRTokeniser.TokType tkntype ;
string tknValue;
for (int i = pageFrom; i <= pageTo; i++)
{
iTextSharp.text.pdf.PdfDictionary cpage = pdfRida.GetPageN(i);
iTextSharp.text.pdf.PdfArray cannots = cpage.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
if(cannots!=null)
foreach (iTextSharp.text.pdf.PdfObject oAnnot in cannots.ArrayList)
{
iTextSharp.text.pdf.PdfDictionary cAnnotationDictironary = (iTextSharp.text.pdf.PdfDictionary)pdfRida.GetPdfObject(((iTextSharp.text.pdf.PRIndirectReference)oAnnot).Number);
iTextSharp.text.pdf.PdfObject moreshit = cAnnotationDictironary.Get(iTextSharp.text.pdf.PdfName.CONTENTS);
if (moreshit != null && moreshit.GetType() == typeof(iTextSharp.text.pdf.PdfString))
{
string cStringVal = ((iTextSharp.text.pdf.PdfString)moreshit).ToString();
if (cStringVal.ToUpper().Contains("LOS 8"))
{ // DO SOMETHING FUN
}
}
}
}
pdfRida.Close();

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.