When extracting Urdu (rtl language) text from pdf using iTextsharp, it's showing me mirror (reversed) text, is there any example I can follow to extract Urdu text correctly from pdf?
static string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 2; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}
In Persian which is an rtl language just like Urdu, I use a custom method after usual extraction with iTextSharp:
public string ReverseTheString(string source)
{
try
{
return new string(source.ToCharArray().Reverse().ToArray());
}
catch (Exception ex)
{
return null;
}
}
I am trying to extract data entered into a form by users from a PDF file. The file consists of a few basic fields such as First Name, Surname, Date of Birth etc. The user will fill these fields in and send back the document. I am only interested in extracting the text fields where they have entered their data. Here is the code I have so far, which returns all of the data within the PDF:
public static string extractedText()
{
OpenFileDialog dlg = new OpenFileDialog();
string filepath, text;
dlg.Filter = "PDF Files(*.PDF)|*.PDF|All Files(*.*)|*.*";
if (dlg.ShowDialog() == DialogResult.OK)
{
filepath = dlg.FileName.ToString();
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filepath);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText = strText + s;
text = strText;
}
reader.Close();
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
return strText;
}
else
return "abc";
}
The above code successfully returns all the data, however, I only need a select few fields (text fields). How can I be more specific about the data I am extracting?
I have 2 pdf libraries which I am reading all docs and parsing specific information from. One library processes without issues. THe other library only returns the footer of all the pages as follows: Page 1 of 6Page 2 of 6Page 3 of 6Page4 of 6.....
The library which is working has one document with multiple pages.
The following is the pdfreader I am using. Has anyone experienced this behavior before and what is different between the documents and how should I handle the case where footer only is returned.
static string ReadPdfFile(string fileName)
{
string curFile = #fileName;
// Console.WriteLine(curFile);
// Console.WriteLine(File.Exists(curFile) ? "File exists." : "File does not exist.");
StringBuilder text = new StringBuilder();
if (File.Exists(curFile))
{
Console.Error.WriteLine("in: " + fileName);
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText =
Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}
I need to run some analysis my extracting data from a PDF document.
Using iTextSharp, I used the PdfTextExtractor.GetTextFromPage method to extract contents from a PDF document and it returned me in a single long line.
Is there a way to get the text by line so that i can store them in an array? So that i can analyze the data by line which will be more flexible.
Below is the code I used:
string urlFileName1 = "pdf_link";
PdfReader reader = new PdfReader(urlFileName1);
string text = string.Empty;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader, page);
}
reader.Close();
candidate3.Text = text.ToString();
public void ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
ITextExtractionStrategy Strategy = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string page = "";
page = PdfTextExtractor.GetTextFromPage(reader, i,Strategy);
string[] lines = page.Split('\n');
foreach (string line in lines)
{
MessageBox.Show(line);
}
}
}
}
I know this is posting on an older post, but I spent a lot of time trying to figure this out so I'm going to share this for the future people trying to google this:
using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFApp2
{
class Program
{
static void Main(string[] args)
{
string filePath = #"Your said path\the file name.pdf";
string outPath = #"the output said path\the text file name.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page ++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
//Creating and appending to a text file
using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
I had the program read in a PDF, from a set path, and just output to a text file, but you can manipulate that to anything. This was building off of Snziv Gupta's response.
All the other code samples here didn't work for me, probably due to changes to the itext7 API.
This minimal example here works ok:
var pdfReader = new iText.Kernel.Pdf.PdfReader(fileName);
var pdfDocument = new iText.Kernel.Pdf.PdfDocument(pdfReader);
var contents = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(pdfDocument.GetFirstPage());
LocationTextExtractionStrategy will automatically insert '\n' in the output text. However, sometimes it will insert '\n' where it shouldn't.
In that case you need to build a custom TextExtractionStrategy or RenderListener. Bascially the code that detects newline is the method
public virtual bool SameLine(ITextChunkLocation other) {
return OrientationMagnitude == other.OrientationMagnitude &&
DistPerpendicular == other.DistPerpendicular;
}
In some cases '\n' shouldn't be inserted if there is only small difference between DistPerpendicular and other.DistPerpendicular, so you need to change it to something like Math.Abs(DistPerpendicular - other.DistPerpendicular) < 10
Or you can put that piece of code in the RenderText method of your custom TextExtractionStrategy/RenderListener class
Use LocationTextExtractionStrategy in lieu of SimpleTextExtractionStrategy. LocationTextExtractionStrategy extracted text contains the new line character at the end of line.
ITextExtractionStrategy Strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
string pdftext = PdfTextExtractor.GetTextFromPage(reader,pageno, Strategy);
string[] words = pdftext.Split('\n');
return words;
Try
String page = PdfTextExtractor.getTextFromPage(reader, 2);
String s1[]=page.split("\n");
I have a corrupted .pdf file with me. When I try to open the file it throws exception on the
PdfReader pdfReader = new PdfReader(fileName);
line if there is any error on a page.
Object reference not set to an instance of an object
Full code:
public string ReadFile(string Filename)
{
string fileName = Server.MapPath(#"PDFFiles//" + Filename);
string pdfText = string.Empty;
if (File.Exists(fileName1))
{
try
{
// Exception on this line
PdfReader pdfReader = new PdfReader(fileName);
for (int i = 1; i <= pdfreader.NumberOfPages; i++)
{
ITextExtractionStrategy itextextStrat = new pdf.parser.SimpleTextExtractionStrategy();
PdfReader reader = new PdfReader(Filename);
String extractText = PdfTextExtractor.GetTextFromPage(reader, i, itextextStrat);
extractText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(extractText)));
pdfText = pdfText + extractText;
reader.Close();
}
}
catch(Execption e)
{
}
}
return pdfText;
}
But I need to loop through the file without an exception. If there is any error on a particular page, I have to skip it and move to the next page. It should not throw exception. How to achieve this?
I believe that a try-catch block should be enough for you. Simply wrap the problematic code to catch any exceptions and the loop will continue no matter what.