iTextSharp PdfTextExtractor GetTextFromPage Throwing NullReferenceException

iTextSharp PdfTextExtractor GetTextFromPage Throwing NullReferenceException - c#

I am using iTextSharp for reading PDF documents but lately it seems that i'm getting a
{"Object reference not set to an instance of an object."}
or NullReferenceException upon getting the text from the page of PdfReader. Before it is working but after this day, it is not already working. I didn't change my code.
Below is my code:
for (int i = 1; i <= reader.NumberOfPages; i++)
{
ITextExtractionStrategy its = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(reader, i, its);
if (currentText.Contains("ADVANCES"))
{
return i;
}
}
return 0;
The above code throws a null reference exception, reader is not null and i is obviously not null being an int.
I am instantiating the PDFreader from the input stream
PdfReader reader = new PdfReader(_stream)
Below is the stack trace:
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.DisplayXObject(PdfName xobjectName)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener)
at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)
To be simple, i tried to create a simple console application that will just read all the text from the PDF file and display it. Below is the code. Result is the same as above, it gives NullReferenceException.
class Program
{
static void Main(string[] args)
{
Console.WriteLine(ExtractTextFromPdf(#"stockQuotes_03232015.pdf"));
}
public static string ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
}
return text.ToString();
}
}
}
Does anyone know what might be going on here or how i might work around it?

To summarize what has been found out in the comments to the question...
In short
The PDF the OP at first used is invalid: It misses required objects which are of interest to the parser.
Since he finally got hold on a valid version, he now is able to parse successfully.
In detail
Depending on the time and mode of request, the web site the PDFs in question were requested from returned different versions of the same document, sometimes complete, sometimes in an invalid manner incomplete.
The test file was stockQuotes_03232015.pdf, i.e. the PDF containing the data generated on the test day:
A complete copy
An incomplete, invalid copy
The complete file could already be recognized by size, in my downloads it is 250933 bytes long while my incomplete file is 81062 bytes long.
Inspecting the files it looks like the incomplete file has been derived from the complete one by some tool which removed duplicate image streams but forgot to change the references to the removed streams by references to the retained stream object.

Please us below codes to read text from PDF. It shows text from PDF in a RichTextBox namely - richTextBox1.
Reference Youtube: https://www.youtube.com/watch?v=22C9N4WP4-s
using (OpenFileDialog ofd = new OpenFileDialog() { Filter = "PDF files|*.pdf", ValidateNames = true, Multiselect = false })
{
if(ofd.ShowDialog() == DialogResult.OK)
{
try
{
iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(ofd.FileName);
StringBuilder sb = new StringBuilder();
for(int i = 1; i<reader.NumberOfPages; i++)
{
sb.Append(iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader,i));
}
richTextBox1.Text = sb.ToString();
reader.Close();
}
catch (Exception ex)
{
MessageBox.Show(ex.Message, "Message", MessageBoxButtons.OK, MessageBoxIcon.Error);
}
}
}

Related

PdfTextExtractor.GetTextFromPage() returns empty string

I'm trying to extract the text from the following PDF with the following code (using iText7 7.2.2) :
var source = (string)GetHttpResult("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf", new CookieContainer());
var bytes = Encoding.UTF8.GetBytes(source);
var stream = new MemoryStream(bytes);
var reader = new PdfReader(stream);
var doc = new PdfDocument(reader);
var pages = doc.GetNumberOfPages();
var text = PdfTextExtractor.GetTextFromPage(doc.GetPage(1));
Loading the PDF in my browser (Edge 100.0) works fine.
GetHttpResult() is a simple HttpClient defining a custom CookieContainer, a custom UserAgent, and calling ReadAsStringAsync(). Nothing fancy.
source has the correct PDF content, starting with "%PDF-1.7".
pages has the correct number of pages, which is 2.
But, whatever I try, text is always empty.
Defining an explicit TextExtractionStrategy, trying some Encodings, extracting from all pages in a loop, ..., nothing matters, text is always empty, with no Exception thrown anywhere.
I think I don't read this PDF how it's "meant" to be read, but what is the correct way then (correct content in source, correct number of pages, no Exception anywhere) ?
Thanks.

That's it ! Thanks to mkl and KJ !
I first downloaded the PDF as a byte array so I'm sure it's not modified in any way.
Then, as pdftotext is able to extract the text from this PDF, I searched for a NuGet package able to do the same. I tested almost ten of them, and FreeSpire.PDF finally did it !
Update : Actually, FreeSpire.PDF missed some words, so I finally found PdfPig, able to extract every single word.
Code using PdfPig :
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;
byte[] bytes;
using (HttpClient client = new())
{
bytes = client.GetByteArrayAsync("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf").GetAwaiter().GetResult();
}
List<string> words = new();
using (PdfDocument document = PdfDocument.Open(bytes))
{
foreach (Page page in document.GetPages())
{
foreach (Word word in page.GetWords())
{
words.Add(word.Text);
}
}
}
string text = string.Join(" ", words);
Code using FreeSpire.PDF :
using Spire.Pdf;
using Spire.Pdf.Exporting.Text;
byte[] bytes;
using (HttpClient client = new())
{
bytes = client.GetByteArrayAsync("https://www.bcr.ro/content/dam/ro/bcr/www_bcr_ro/Aur/Cotatii_Aur.pdf").GetAwaiter().GetResult();
}
string text = string.Empty;
SimpleTextExtractionStrategy strategy = new();
using (PdfDocument doc = new())
{
doc.LoadFromBytes(bytes);
foreach (PdfPageBase page in doc.Pages)
{
text += page.ExtractText(strategy);
}
}

GhostscriptRasterizer.PageCount always returns zero

This problem has already been discussed here:GhostscriptRasterizer Objects Returns 0 as PageCount value
But the answer to this question did not help me solve the problem.
In my case, it doesn’t help from kat to an older version of Ghostscript. 26 and 25. I always have PageCount = 0, and if the version is lower than 27, I get an error "Native Ghostscript library not found."
private static void PdfToPng(string inputFile, string outputFileName)
{
var xDpi = 100; //set the x DPI
var yDpi = 100; //set the y DPI
var pageNumber = 1; // the pages in a PDF document
using (var rasterizer = new GhostscriptRasterizer()) //create an instance for GhostscriptRasterizer
{
rasterizer.Open(inputFile); //opens the PDF file for rasterizing
//set the output image(png's) complete path
var outputPNGPath = Path.Combine(outputFolder, string.Format("{0}_Page{1}.png", outputFileName,pageNumber));
//converts the PDF pages to png's
var pdf2PNG = rasterizer.GetPage(xDpi, yDpi, pageNumber);
//save the png's
pdf2PNG.Save(outputPNGPath, ImageFormat.Png);
Console.WriteLine("Saved " + outputPNGPath);
}
}

I was struggling with the same problem and ended up using iTextSharp just to get the page count. Below is a snippet from the production code:
using (var reader = new PdfReader(pdfFile))
{
// as a matter of fact we need iTextSharp PdfReader (and all of iTextSharp) only to get the page count of PDF document;
// unfortunately GhostScript itself doesn't know how to do it
pageCount = reader.NumberOfPages;
}
Not a perfect solution but this is exactly what solved my problem. I left that comment there to remind myself that I have to find a better way somehow but I’ve never bothered to come back because it just works fine as it is...
PdfReader class is defined in iTextSharp.text.pdf namespace.
And I'm using Ghostscript.NET.GhostscriptPngDevice instead of GhostscriptRasterizer to rasterize the specific page of PDF document.
Here is my method that rasterizes the page and saves it to PNG file
private static void PdfToPngWithGhostscriptPngDevice(string srcFile, int pageNo, int dpiX, int dpiY, string tgtFile)
{
GhostscriptPngDevice dev = new GhostscriptPngDevice(GhostscriptPngDeviceType.PngGray);
dev.GraphicsAlphaBits = GhostscriptImageDeviceAlphaBits.V_4;
dev.TextAlphaBits = GhostscriptImageDeviceAlphaBits.V_4;
dev.ResolutionXY = new GhostscriptImageDeviceResolution(dpiX, dpiY);
dev.InputFiles.Add(srcFile);
dev.Pdf.FirstPage = pageNo;
dev.Pdf.LastPage = pageNo;
dev.CustomSwitches.Add("-dDOINTERPOLATE");
dev.OutputPath = tgtFile;
dev.Process();
}
Hope that would help...

PdfDocument remains locked after closing

I have a windows service which merges PDFs together on the fly and then moves them to another location. I don't have control over what someone wants merged, for the most part. It has happened that every so often a corrupted PDF gets processed and therefore creating the new PdfDocument throws a PdfException "Trailer not found". I am catching the exception and closing the document but it appears after closing the PDF itself is still locked somehow. I need to delete the directory but in trying to do that it throws an IOException and crashes the service.
I have verified that calling the PdfDocument constructor is what locks the pdf and that immediately after closing the file remains locked.
Any ideas? Is there something iText can do to help with is or do I need to come up with some sort of work around where I check for corrupted PDFs up front?
ProcessDirectory
private void ProcessDirectory(string directoryPath)
{
EventLogManager.WriteInformation("ProcessDirectory");
// DON'T TOUCH THE BACKUPS, ERRORS AND WORK DIRECTORIES. Just in case they were made or renamed after the fact for some reason
if (directoryPath != this._errorsPath && directoryPath != this._backupsPath && directoryPath != this._workPath)
{
string pdfJsonPath = System.IO.Path.Combine(directoryPath, "pdf.json");
if (File.Exists(pdfJsonPath))
{
string workPath = System.IO.Path.Combine(this._workPath, System.IO.Path.GetFileName(directoryPath));
try
{
CopyToDirectory(directoryPath, workPath);
PdfMerge pdfMerge = null;
string jsonPath = System.IO.Path.Combine(workPath, "pdf.json");
using (StreamReader r = Helpers.GetStreamReader(jsonPath))
{
string json = r.ReadToEnd();
pdfMerge = JsonConvert.DeserializeObject<PdfMerge>(json);
}
FillFormFields(workPath, pdfMerge);
if (pdfMerge.Pdfs.Any(p => !String.IsNullOrWhiteSpace(p.OverlayFilename)))
{
ApplyOverlays(workPath, pdfMerge);
}
MergePdfs(workPath, pdfMerge);
//NumberPages(workPath, pdfMerge);
FinishPdf(workPath, pdfMerge);
// Move original to backups directory
if (DoSaveBackups)
{
string backupsPath = System.IO.Path.Combine(this._backupsPath, String.Format("{0}_{1}", System.IO.Path.GetFileName(directoryPath), DateTime.Now.ToString("yyyyMMddHHmmss")));
Directory.Move(directoryPath, backupsPath);
}
else
{
Directory.Delete(directoryPath, true);
}
}
catch (Exception ex)
{
EventLogManager.WriteError(ex);
if (DoSaveErrors)
{
// Move original to errors directory
string errorsPath = System.IO.Path.Combine(this._errorsPath, String.Format("{0}_{1}", System.IO.Path.GetFileName(directoryPath), DateTime.Now.ToString("yyyyMMddHHmmss")));
Directory.Move(directoryPath, errorsPath);
}
else
{
Directory.Delete(directoryPath, true);
}
}
// Delete work directory
// THIS IS WHERE THE IOEXCEPTION OCCURS AND THE SERVICE CRASHES
Directory.Delete(workPath, true);
}
else
{
EventLogManager.WriteInformation(String.Format("No pdf.json file. {0} skipped.", directoryPath));
}
}
}
FillFormFields
private void FillFormFields(string directoryPath, PdfMerge pdfMerge)
{
if (pdfMerge != null && pdfMerge.Pdfs != null)
{
string formPath = String.Empty;
string newFilePath;
PdfDocument document = null;
PdfAcroForm form;
PdfFormField pdfFormField;
foreach (var pdf in pdfMerge.Pdfs)
{
try
{
formPath = System.IO.Path.Combine(directoryPath, pdf.Filename);
newFilePath = System.IO.Path.Combine(
directoryPath,
String.Format("{0}{1}", String.Format("{0}{1}", System.IO.Path.GetFileNameWithoutExtension(pdf.Filename), "_Revised"), System.IO.Path.GetExtension(pdf.Filename)));
// THIS IS WHERE THE PDFEXCEPTOIN OCCURS
document = new PdfDocument(Helpers.GetPdfReader(formPath), new PdfWriter(newFilePath));
form = PdfAcroForm.GetAcroForm(document, true);
if (pdf.Fields != null && pdf.Fields.Count > 0)
{
foreach (var field in pdf.Fields)
{
if (field.Value != null)
{
pdfFormField = form.GetField(field.Name);
if (pdfFormField != null)
{
form.GetField(field.Name).SetValue(field.Value);
}
else
{
EventLogManager.WriteWarning(String.Format("Field '{0}' does not exist in '{1}'", field.Name, pdf.Filename));
}
}
}
}
form.FlattenFields();
}
catch (Exception ex)
{
throw new Exception(String.Format("An exception occurred filling form fields for {0}", pdf.Filename), ex);
}
finally
{
if (document != null)
{
document.Close();
}
}
// Now rename the new one back to the old name
File.Delete(formPath);
File.Move(newFilePath, formPath);
}
}
}
UPDATE
It seems in order to everything to dispose properly you have to declare separate PdfReader and PdfWriter objects into using statements and pass those into the PdfDocument. Like this:
using (reader = Helpers.GetPdfReader(formPath))
{
using (writer = new PdfWriter(newFilePath))
{
using (document = new PdfDocument(reader, writer))
{
// The rest of the code here
}
}
}
I'm not sure why this is other than that iText isn't disposing of the individual PdfReader and PdfWriter when disposing of the PdfDocument, which I assumed it would.

Find out which of the itext7 classes implement IDisposable (from the documentation, or the Visual Studio Object Browser etc), and make sure you use them within using blocks, the same way you already have using blocks for StreamReader.
Edit: #sourkrause's solution can be shortened to:
using (reader = Helpers.GetPdfReader(formPath))
using (writer = new PdfWriter(newFilePath))
using (document = new PdfDocument(reader, writer))
{
// The rest of the code here
}

I know this is an old question, but this is my approach to solving in iText7, and it is quite different then the accepted answer. Since I could not use using statements, I took a different approach when closing out the document. This may seem like over kill, but it works very well.
First I closed the Document:
Document.Close();
Nothing out of the ordinary here.. after doing this however, I close / dispose the Reader and Writer instances. After closing them out, I'll set the writer, reader, and document in that order to null. The GC should take care of clearing these up, but for my usage the object that was holding these instance is still being used, so to free up some memory I'm doing this additional step.
step 2
Writer.Close();
Writer.Dispose();
Writer = null;
Step 3
Reader.SetCloseStream(true);
Reader.Close();
Reader = null;
Step 4
Document = null;
I would suggest you wrap each step in a try catch; depending on how your code is running, you could see issues doing this all at once.
I believe the most important part here is the actions taken on the reader. For some reason, the reader does not seem to close the stream when calling .Close() by default.
***While running in production I have still noticed that one file (so far anyways) still held a lock when trying to delete right after closing. I added a catch that waits a few seconds before trying again. That seems to do the trick on those more "stubborn" files.

File.WriteAllBytes generates one pdf fine, but multiple pdfs with errors?

I have a loop that goes through some data and generates pdf files. if I generate one pdf by iteself, it works just fine (the pdf opens), but if I create 2 pdf files, the first one will open fine, but the second one will display and error saying the file is corrupt or something similar. Is there something I am doing wrong in the loop with the stream, etc?
foreach (report r reports)
{
byte[] pdf;
ReportName = r.ReportName;
switch (r.ReportId.ToLower())
{
case "pdf":
pdfBuilder = new pdfHelper(candidate,
pdfTemplates[(Guid)case_report.TemplateId], r.XMLFieldData, DCFormats,
r.ProjectReportName, dependants, DepCount, SpoCount);
pdf = pdfBuilder.GenerateCasePDF();
break;
}
//Add Bookmarks for each report in candidate
ChapterCount++;
ChapterReport = new Chapter(new Paragraph(case_report.ReportName), ChapterCount);
tDoc.Add(ChapterReport);
reader = new PdfReader(pdf);
n = reader.NumberOfPages;
for (int page = 1; page <= n; page++)
copy.AddPage(copy.GetImportedPage(reader, page));
copy.FreeReader(reader);
reader.Close();
}
//Save pdf to folder
ReportName = null;
tDoc.Close();
PubResult = outputStream.ToArray();
File.WriteAllBytes(string.Format(#"{0}\{1}.pdf", JobRootPath, CaseFileName), PubResult);
//Reset for next case
outputStream = new MemoryStream();
tDoc = new iTextSharp.text.Document();
copy = new PdfSmartCopy(tDoc, outputStream);
copy.ViewerPreferences = PdfWriter.PageModeUseOutlines;
copy.SetFullCompression();
tDoc.Open();
}

I'm guessing ChapterCount should get reset to its initial value like all the other variables you're resetting at the end of your for loop.
That aside, I'd recommend moving the body of your for loop, and all related variables, into a new method. Reusing variables tends to lead to errors like this.

Reading data from array and write to append to existing file:My code is not appending data as expected

My application reads data from an array and then writes to an existing file. It shall write to the end of the line, but when I run the application it does not append anything.
After researching I came across this similar post. I modified my code as answered on that post, I now receive an error:
'FileStream' is a namespace but is used like a type.
I Added the System.IO namespace but still the problem persists.
This is my code:
private void button1_Click(object sender, EventArgs e)
{
string file_path = #"C:\Users\myfolder\Desktop\FileStream\Processed\Output.txt";
string data = " ";
try
{
using (FileStream aFile = new FileStream(file_path, FileMode.Append, FileAccess.Write))
using (StreamWriter author = new StreamWriter(aFile, true))
{
string[] output_receiptNos = ReadFile().ToArray();
for (int index = 0; index < output_receiptNos.Length; index++)
{
data = output_receiptNos[index];
author.WriteLine(data);
}
MessageBox.Show("Data Sucessfully Processed");
}
}
catch (Exception err)
{
MessageBox.Show("Could not process Data");
}
}

Your FileStream adds nothing to the work of the StreamWriter with its predefined constructor that takes a string (for filename) and a boolean (for appending(overwrite data). And as written is not compilable because the StreamWriter has no constructors that takes a Stream and a Boolean.
As someone has already mentioned in the comments, you probably have a conflict in your code with some namespace oddly named "FileStream". (A bad idea by the way).
However, I think you could remove the error using directly the StreamWriter class.
Then take your time to find why the compiler thinks that you have a namespace named "FileStream"
using (StreamWriter author = new StreamWriter(file_path, true))
{
string[] output_receiptNos = ReadFile().ToArray();
for (int index = 0; index < output_receiptNos.Length; index++)
{
data = output_receiptNos[index];
author.WriteLine(data);
}
MessageBox.Show("Data Sucessfully Processed");
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

iTextSharp PdfTextExtractor GetTextFromPage Throwing NullReferenceException - c#

Related

PdfTextExtractor.GetTextFromPage() returns empty string

GhostscriptRasterizer.PageCount always returns zero

PdfDocument remains locked after closing

File.WriteAllBytes generates one pdf fine, but multiple pdfs with errors?

Reading data from array and write to append to existing file:My code is not appending data as expected

Categories

Resources