Skip corrupted page in pdf using C#

Skip corrupted page in pdf using C# - c#

I have a corrupted .pdf file with me. When I try to open the file it throws exception on the
PdfReader pdfReader = new PdfReader(fileName);
line if there is any error on a page.
Object reference not set to an instance of an object
Full code:
public string ReadFile(string Filename)
{
string fileName = Server.MapPath(#"PDFFiles//" + Filename);
string pdfText = string.Empty;
if (File.Exists(fileName1))
{
try
{
// Exception on this line
PdfReader pdfReader = new PdfReader(fileName);
for (int i = 1; i <= pdfreader.NumberOfPages; i++)
{
ITextExtractionStrategy itextextStrat = new pdf.parser.SimpleTextExtractionStrategy();
PdfReader reader = new PdfReader(Filename);
String extractText = PdfTextExtractor.GetTextFromPage(reader, i, itextextStrat);
extractText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(extractText)));
pdfText = pdfText + extractText;
reader.Close();
}
}
catch(Execption e)
{
}
}
return pdfText;
}
But I need to loop through the file without an exception. If there is any error on a particular page, I have to skip it and move to the next page. It should not throw exception. How to achieve this?

I believe that a try-catch block should be enough for you. Simply wrap the problematic code to catch any exceptions and the loop will continue no matter what.

Related

Performance issue when using old iText for .NET version when splitting a huge PDF

My goal here is to split a huge PDF (over 1000 pages).
I tried the example below :
public void ExtractPages(string sourcePdfPath, string outputPdfPath,
int startPage, int endPage)
{
PdfReader reader = null;
Document sourceDocument = null;
PdfCopy pdfCopyProvider = null;
PdfImportedPage importedPage = null;
try
{
// Intialize a new PdfReader instance with the contents of the source Pdf file:
reader = new PdfReader(sourcePdfPath);
// For simplicity, I am assuming all the pages share the same size
// and rotation as the first page:
sourceDocument = new Document(reader.GetPageSizeWithRotation(startPage));
// Initialize an instance of the PdfCopyClass with the source
// document and an output file stream:
pdfCopyProvider = new PdfCopy(sourceDocument,
new System.IO.FileStream(outputPdfPath, System.IO.FileMode.Create));
sourceDocument.Open();
// Walk the specified range and add the page copies to the output file:
for (int i = startPage; i <= endPage; i++)
{
importedPage = pdfCopyProvider.GetImportedPage(reader, i);
pdfCopyProvider.AddPage(importedPage);
}
sourceDocument.Close();
reader.Close();
}
catch (Exception ex)
{
throw ex;
}
}
It works but it takes more than 10 min.
My question here is there any way to skip the for loop and getting all the pages quickly ??

ItextSharp-Exception in the reader initialization "The document has no page root"

Hi i'm trying to read various pdf files with ItextSharp.dll, some of them throws me an exception when I try to read it. the exception is this: "The document has no page root (meaning: it's an invalid PDF).". I made some tests in the Merge example, in the Itext web page(Merge-Example) and these are successful. So, can someone guide me to see what am I doing wrong?
This is my code:
public void MergeFiles(String[] strFiles, String strFileresult)
{
Document document = new Document(); ;
PdfCopy copy;
copy = new PdfCopy(document, new FileStream(strFileresult, FileMode.Create));
document.Open();
PdfReader[] reader = new PdfReader[3];
for (int i = 0; i < strFiles.Count(); i++)
{
reader[i] = new PdfReader(strFiles[i]);
copy.AddDocument(reader[i]);
}
document.Close();
for (int i = 0; i < reader.Count(); i++)
{
reader[i].Close();
}
}

I'm not sure what's causing your exact problem but once we get rid of the unnecessary internal arrays and switch to the using pattern to get automatic cleanup everything works just fine.
public void MergeFiles(string[] strFiles, String strFileresult) {
using( var document = new Document()) {
using (var copy = new PdfCopy(document, new FileStream(strFileresult, FileMode.Create))) {
document.Open();
foreach( var file in strFiles) {
using (var reader = new PdfReader(file)) {
copy.AddDocument(reader);
}
}
document.Close();
}
}
}

Error in reading a PDF file C#

I want to read pdf than export all data to doc file. I am doing this using a famous library:itextsharp.
However .pdf file has an interesting feature. Therefore result is not good. The .pdf file example is:
As you can see, in pdf file, the choices(A,B,C,D and E) seem like a line. Therefore , result is like this:
How can i do this correctly? how can i write answers with related choices without newline? (I used SimpleTextExtractionStrategy and LocationTextExtractionStrategy. Both of them do not produce proper outputs. This is SimpleText method's output. This is better than Location. The only problem is that answer and choice are not in the same line)
public string ReadPdfFile(string Filename)
{
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(Filename);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy ();
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText = strText + s + "\r\n";
}
reader.Close();
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
return strText;
}
Thanks

Using iTextSharp for C# and pdfreader returns only footer info

I have 2 pdf libraries which I am reading all docs and parsing specific information from. One library processes without issues. THe other library only returns the footer of all the pages as follows: Page 1 of 6Page 2 of 6Page 3 of 6Page4 of 6.....
The library which is working has one document with multiple pages.
The following is the pdfreader I am using. Has anyone experienced this behavior before and what is different between the documents and how should I handle the case where footer only is returned.
static string ReadPdfFile(string fileName)
{
string curFile = #fileName;
// Console.WriteLine(curFile);
// Console.WriteLine(File.Exists(curFile) ? "File exists." : "File does not exist.");
StringBuilder text = new StringBuilder();
if (File.Exists(curFile))
{
Console.Error.WriteLine("in: " + fileName);
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText =
Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}

Extract text by line from PDF using iTextSharp c#

I need to run some analysis my extracting data from a PDF document.
Using iTextSharp, I used the PdfTextExtractor.GetTextFromPage method to extract contents from a PDF document and it returned me in a single long line.
Is there a way to get the text by line so that i can store them in an array? So that i can analyze the data by line which will be more flexible.
Below is the code I used:
string urlFileName1 = "pdf_link";
PdfReader reader = new PdfReader(urlFileName1);
string text = string.Empty;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader, page);
}
reader.Close();
candidate3.Text = text.ToString();

public void ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
ITextExtractionStrategy Strategy = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string page = "";
page = PdfTextExtractor.GetTextFromPage(reader, i,Strategy);
string[] lines = page.Split('\n');
foreach (string line in lines)
{
MessageBox.Show(line);
}
}
}
}

I know this is posting on an older post, but I spent a lot of time trying to figure this out so I'm going to share this for the future people trying to google this:
using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFApp2
{
class Program
{
static void Main(string[] args)
{
string filePath = #"Your said path\the file name.pdf";
string outPath = #"the output said path\the text file name.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page ++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
//Creating and appending to a text file
using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
I had the program read in a PDF, from a set path, and just output to a text file, but you can manipulate that to anything. This was building off of Snziv Gupta's response.

All the other code samples here didn't work for me, probably due to changes to the itext7 API.
This minimal example here works ok:
var pdfReader = new iText.Kernel.Pdf.PdfReader(fileName);
var pdfDocument = new iText.Kernel.Pdf.PdfDocument(pdfReader);
var contents = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(pdfDocument.GetFirstPage());

LocationTextExtractionStrategy will automatically insert '\n' in the output text. However, sometimes it will insert '\n' where it shouldn't.
In that case you need to build a custom TextExtractionStrategy or RenderListener. Bascially the code that detects newline is the method
public virtual bool SameLine(ITextChunkLocation other) {
return OrientationMagnitude == other.OrientationMagnitude &&
DistPerpendicular == other.DistPerpendicular;
}
In some cases '\n' shouldn't be inserted if there is only small difference between DistPerpendicular and other.DistPerpendicular, so you need to change it to something like Math.Abs(DistPerpendicular - other.DistPerpendicular) < 10
Or you can put that piece of code in the RenderText method of your custom TextExtractionStrategy/RenderListener class

Use LocationTextExtractionStrategy in lieu of SimpleTextExtractionStrategy. LocationTextExtractionStrategy extracted text contains the new line character at the end of line.
ITextExtractionStrategy Strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
string pdftext = PdfTextExtractor.GetTextFromPage(reader,pageno, Strategy);
string[] words = pdftext.Split('\n');
return words;

Try
String page = PdfTextExtractor.getTextFromPage(reader, 2);
String s1[]=page.split("\n");

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Skip corrupted page in pdf using C# - c#

I believe that a try-catch block should be enough for you. Simply wrap the problematic code to catch any exceptions and the loop will continue no matter what.

Related

Performance issue when using old iText for .NET version when splitting a huge PDF

ItextSharp-Exception in the reader initialization "The document has no page root"

Error in reading a PDF file C#

Using iTextSharp for C# and pdfreader returns only footer info

Extract text by line from PDF using iTextSharp c#

Categories

Resources