I have a series of PDF files I need to search for keywords, but many of them contain a huge amount of hidden text. What I mean is when you try to CTRL+F to see how many key words are named "CJP" there are about 35 results, but in reality there are only about 9 that are actually visible, the rest just seem to be randomly hidden all over the page. I have tried out several APIs with them all reading 35 and not 9, so I wanted to try out this class named TextRenderInfo in ITextSharp because the method GetTextRenderMode is suppose to return 3 if the text is hidden, meaning I can use that to ignore strings that are invisable.
Here is my current code:
static void Main(string[] args)
{
Gerdau.ITextSharpCount(#"Source.pdf", "CJP");
}
public static int ITextSharpCount(string filePath, string searchString)
{
StringBuilder sb = new StringBuilder();
string file = filePath;
using (PdfReader reader = new PdfReader(file))
{
for (int pageNumber = 1; pageNumber <= reader.NumberOfPages; pageNumber++)
{
textRenderInfo.GetTextRenderMode();
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string text = PdfTextExtractor.GetTextFromPage(reader, pageNumber, strategy);
text = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(text)));
sb.Append(text);
}
}
int numberOfMatches = Regex.Matches(sb.ToString(), searchString).Count;
return numberOfMatches;
}
The issue is I don't know how to set up the TextRenderInfo class to check for the hidden text. If anyone knows how to do it, it would be a huge help and more code the merrier :).
Related
I've been using the iTextSharp library in my C# .NET program to read PDF files. A simplified version of my program uses code like the following procedure to get the text for a specified page number in a PDF File.
using iTextSharp.text.pdf.parser;
using System.Text;
namespace PDFReaderITextSharp
{
public static class PdfHelper
{
public static string GetText(string fileName, int pageNumber)
{
PdfReader pdfReader = new PdfReader(fileName);
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, pageNumber, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
return currentText;
}
}
}
Now, I'd like to construct a similar type of function that does that same thing but using the iText7 library in a .NET Core 3 program, namely, return the text for a specified PDF page. However, when I look at the text that is returned and compare it to the text returned by the above function as well as visually looking at the PDF file using an Adobe Reader, I see document text being represented multiple times.
For example, when I look at the PDF file in Adobe, there are several fields displayed in the form of "Caption: Value". For example, Invoice #: 1234 Invoice Date: 12/32/2019. But the text returned using the iText7 library returns "Invoice#Invoice #: 1234 Invoice DateInvoice Date: 12/32/2019" (the labels are duplicated 2 or more times.)
I wish I could upload the PDF document to help.
Is there something wrong with the iText7 function? What might the iText7 library be doing?
using iText.Kernel.Pdf;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
namespace PDFReader.Helpers
{
public static class PdfHelperIText7
{
public static string GetPdfPageText(string pdfFilePath, int pageNumber)
{
using (PdfDocument pdfDocument = new PdfDocument(new PdfReader(pdfFilePath)))
{
FilteredEventListener listener = new FilteredEventListener();
LocationTextExtractionStrategy extractionStrategy = listener.AttachEventListener(new LocationTextExtractionStrategy());
PdfCanvasProcessor pdfCanvasProcessor = new PdfCanvasProcessor(listener);
listener.AttachEventListener(extractionStrategy);
pdfCanvasProcessor.ProcessPageContent(pdfDocument.GetPage(pageNumber));
string actualText = extractionStrategy.GetResultantText();
pdfDocument.Close();
return actualText;
}
}
}
}
I am working on convert PDF to text. I can get text from PDF correctly but it is being complicated in table structure. I know PDF doesn't support table structure but I think there is a way get cells correctly. Well, for example:
I want to convert to text like this:
> This is first example.
> This is second example.
But, when I convert PDF to text, theese datas looking like this:
> This is This is
> first example. second example.
How can I get values correctly?
--EDIT:
Here is how did I convert PDF to Text:
OpenFileDialog ofd = new OpenFileDialog();
string filepath;
ofd.Filter = "PDF Files(*.PDF)|*.PDF|All Files(*.*)|*.*";
if (ofd.ShowDialog() == DialogResult.OK)
{
filepath = ofd.FileName.ToString();
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filepath);
for (int page = 1; page < reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText += s;
}
reader.Close();
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
}
To make my comment an actual answer...
You use the LocationTextExtractionStrategy for text extraction:
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, page, its);
This strategy arranges all text it finds in left-to-right lines from top to bottom (actually also taking the text line angle into account). Thus, it clearly is not what you need to extract text from tables with cells with multi-line content.
Depending on the document in question there are different approaches one can take:
Use the iText SimpleTextExtractionStrategy if the text drawing operations in the document in question already are in the order one wants for text extraction.
Use a custom text extraction strategy which makes use of tagging information if the document tables are properly tagged.
Use a complex custom text extraction strategy which tries to get hints from text arrangements, line paths, or background colors to guess the table cell structure and extract text cell by cell.
In this case, the OP commented that he changed LocationTextExtractionStrategy with SimpleTextExtractionStrategy, then it worked.
I'm looking for a way to select text n characters on either side of a keyword search using itextsharp (v5.5.8). I've gotten to the point where I can use the SimpleTextExtractionStrategy() and return a list of pages where the searched text is found (supposedly). When I do a manual search using PDF Viewer Search, sometimes it's there and sometimes it can't find it on the page itextsharp says it's on. Sometimes, not at all.
The idea is to be able to return 40 characters on either side of the found keyword to allow the user to be able to find the reference easier when they look at the actual document. In another question, I saw a reference to additional text retrieval functions (LocationTextExtractionStrategy, PdfTextExtractor.GetTextFromPage(myReader, pageNum) in combination with some Contains(word)).
Where can I find examples of how to use these functions? And how to create a better strategy?
My current code:
public List<int> ReadPdfFile(string fileName, String searthText)
{
string rootPath = HttpContext.Current.Server.MapPath("~");
string dirPath = rootPath + #"content\publications\";
List<int> pages = new List<int>();
string fullFile = dirPath + fileName;
if (File.Exists(fullFile))
{
PdfReader pdfReader = new PdfReader(fullFile);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
if (currentPageText.Contains(searthText))
{
pages.Add(page);
}
}
pdfReader.Close();
}
return pages;
}
And an example of the output using a simple response.write command...
Document 1.pdf 1 3
Document 2.pdf 1 2 3 4
The numbers after the file name are page numbers where the searched for keyword is found. However, in Document 1, the keyword is also found on the very top of page 4 in the "References" section that began on page 3. It should be found twice in the References.
Thanks,
Bob
P.S. apparently 5.5.8 doesn't have the iTextSharp.text.pdf.parser.TextExtractionStrategy method...
Okay, I'm trying to extract text from a PDF file using iTextSharp... that's all I want. However, when I extract the text, it's giving me garbage instead of text.
Here's the code I'm using...
List<String> pdfText = new List<string>();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
String strPage = PdfTextExtractor.GetTextFromPage(reader, page, its);
strPage = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default,
Encoding.UTF8, Encoding.Default.GetBytes(strPage)));
pdfText.Add(strPage);
}
I then save that data to a text file, but instead of readable text, I get text that looks like binary data... non-printable characters all over the place. I'd post an image of what I see, but it won't let me. Sorry about that.
I have tried without the encoding attempt, and it didn't work any better... still binary-looking data (viewed in Notepad), though I'm not certain it's identical to that produced with the encoding attempt.
Any idea what is happening and how to fix it?
Please open the document in Adobe Reader, then try to copy/paste part of the text.
If you do this with the first page, you'll get:
The following policy (L30304) has been archived by Alpha II. Many policies are part of a larger
jurisdiction, than is indicated by the policy. This policy covers the following states:
• INDIANA
• MICHIGAN
However, if you do this with the second page, you'll get:
In other words: copy/pasting from Adobe Reader gives you garbage.
And if copy/pasting from Adobe Reader gives you garbage, any text extraction tool will give you garbage. You'll need to OCR the document to solve this problem.
Regarding your additional question in the comments: if the PDf employs a custom encoding method, how can Adobe display it properly unless the PDF file contains the information needed to handle it?
This question is answered in a 14-minute movie: https://www.youtube.com/watch?v=wxGEEv7ibHE
try this code:
List<String> pdfText = new List<string>();
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
PdfTextExtractor.GetTextFromPage(reader, page, its);
strPage = its.GetResultantText();
pdfText.Add(strPage);
}
Try this code, Worked for me
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
}
return text.ToString();
}
I'm using itextsharp on vb.net to get the text content from a pdf file. The solution works fine for some files but not for other even quite simple ones. The problem is that the token stringvalue is set to null (a set of empty square boxes)
token = New iTextSharp.text.pdf.PRTokeniser(pageBytes)
While token.NextToken()
tknType = token.TokenType()
tknValue = token.StringValue
I can meassure the length of the content but I cannot get the actual string content.
I realized that this happens depending on the font of the pdf. If I create a pdf using either Acrobat or PdfCreator with Courier (that by the way is the default font in my visual studio editor) I can get all the text content. If the same pdf is built using a different font I got the empty square boxes.
Now the question is, How can I extract text regardless of the font setting?
Thanks
complementary for Mark's answer that helps me a lot .iTextSharp implementation namespaces and classes are a bit different from java version
public static string GetTextFromAllPages(String pdfPath)
{
PdfReader reader = new PdfReader(pdfPath);
StringWriter output = new StringWriter();
for (int i = 1; i <= reader.NumberOfPages; i++)
output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));
return output.ToString();
}
Check out PdfTextExtractor.
String pageText =
PdfTextExtractor.getTextFromPage(myReader, pageNum);
or
String pageText =
PdfTextExtractor.getTextFromPage(myReader, pageNum, new LocationTextExtractionStrategy());
Both require fairly recent versions of iText[Sharp]. Actually parsing the content stream yourself is just reinventing the wheel at this point. Spare yourself some pain and let iText do it for you.
PdfTextExtractor will handle all the different font/encoding issues for you... all the ones that can be handled anyway. If you can't copy/paste from Reader accurately, then there's not enough information present in the PDF to get character information from the content stream.
Here is a variant with iTextSharp.text.pdf.PdfName.ANNOTS and iTextSharp.text.pdf.PdfName.CONTENT if some one need it.
string strFile = #"C:\my\path\tothefile.pdf";
iTextSharp.text.pdf.PdfReader pdfRida = new iTextSharp.text.pdf.PdfReader(strFile);
iTextSharp.text.pdf.PRTokeniser prtTokeneiser;
int pageFrom = 1;
int pageTo = pdfRida.NumberOfPages;
iTextSharp.text.pdf.PRTokeniser.TokType tkntype ;
string tknValue;
for (int i = pageFrom; i <= pageTo; i++)
{
iTextSharp.text.pdf.PdfDictionary cpage = pdfRida.GetPageN(i);
iTextSharp.text.pdf.PdfArray cannots = cpage.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
if(cannots!=null)
foreach (iTextSharp.text.pdf.PdfObject oAnnot in cannots.ArrayList)
{
iTextSharp.text.pdf.PdfDictionary cAnnotationDictironary = (iTextSharp.text.pdf.PdfDictionary)pdfRida.GetPdfObject(((iTextSharp.text.pdf.PRIndirectReference)oAnnot).Number);
iTextSharp.text.pdf.PdfObject moreshit = cAnnotationDictironary.Get(iTextSharp.text.pdf.PdfName.CONTENTS);
if (moreshit != null && moreshit.GetType() == typeof(iTextSharp.text.pdf.PdfString))
{
string cStringVal = ((iTextSharp.text.pdf.PdfString)moreshit).ToString();
if (cStringVal.ToUpper().Contains("LOS 8"))
{ // DO SOMETHING FUN
}
}
}
}
pdfRida.Close();