How to only extract text fields from a PDF using iTextSharp C#? - c#

I am trying to extract data entered into a form by users from a PDF file. The file consists of a few basic fields such as First Name, Surname, Date of Birth etc. The user will fill these fields in and send back the document. I am only interested in extracting the text fields where they have entered their data. Here is the code I have so far, which returns all of the data within the PDF:
public static string extractedText()
{
OpenFileDialog dlg = new OpenFileDialog();
string filepath, text;
dlg.Filter = "PDF Files(*.PDF)|*.PDF|All Files(*.*)|*.*";
if (dlg.ShowDialog() == DialogResult.OK)
{
filepath = dlg.FileName.ToString();
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filepath);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText = strText + s;
text = strText;
}
reader.Close();
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
return strText;
}
else
return "abc";
}
The above code successfully returns all the data, however, I only need a select few fields (text fields). How can I be more specific about the data I am extracting?

Related

Extracting Urdu Text from PDF using iTextsharp

When extracting Urdu (rtl language) text from pdf using iTextsharp, it's showing me mirror (reversed) text, is there any example I can follow to extract Urdu text correctly from pdf?
static string ReadPdfFile(string fileName)
{
StringBuilder text = new StringBuilder();
if (File.Exists(fileName))
{
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 2; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.UTF8.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}
In Persian which is an rtl language just like Urdu, I use a custom method after usual extraction with iTextSharp:
public string ReverseTheString(string source)
{
try
{
return new string(source.ToCharArray().Reverse().ToArray());
}
catch (Exception ex)
{
return null;
}
}

Error in reading a PDF file C#

I want to read pdf than export all data to doc file. I am doing this using a famous library:itextsharp.
However .pdf file has an interesting feature. Therefore result is not good. The .pdf file example is:
As you can see, in pdf file, the choices(A,B,C,D and E) seem like a line. Therefore , result is like this:
How can i do this correctly? how can i write answers with related choices without newline? (I used SimpleTextExtractionStrategy and LocationTextExtractionStrategy. Both of them do not produce proper outputs. This is SimpleText method's output. This is better than Location. The only problem is that answer and choice are not in the same line)
public string ReadPdfFile(string Filename)
{
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(Filename);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy ();
String s = PdfTextExtractor.GetTextFromPage(reader, page, its);
s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
strText = strText + s + "\r\n";
}
reader.Close();
}
catch (Exception ex)
{
MessageBox.Show(ex.Message);
}
return strText;
}
Thanks

Using iTextSharp for C# and pdfreader returns only footer info

I have 2 pdf libraries which I am reading all docs and parsing specific information from. One library processes without issues. THe other library only returns the footer of all the pages as follows: Page 1 of 6Page 2 of 6Page 3 of 6Page4 of 6.....
The library which is working has one document with multiple pages.
The following is the pdfreader I am using. Has anyone experienced this behavior before and what is different between the documents and how should I handle the case where footer only is returned.
static string ReadPdfFile(string fileName)
{
string curFile = #fileName;
// Console.WriteLine(curFile);
// Console.WriteLine(File.Exists(curFile) ? "File exists." : "File does not exist.");
StringBuilder text = new StringBuilder();
if (File.Exists(curFile))
{
Console.Error.WriteLine("in: " + fileName);
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText =
Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8,
Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
}
pdfReader.Close();
}
return text.ToString();
}

Extract text by line from PDF using iTextSharp c#

I need to run some analysis my extracting data from a PDF document.
Using iTextSharp, I used the PdfTextExtractor.GetTextFromPage method to extract contents from a PDF document and it returned me in a single long line.
Is there a way to get the text by line so that i can store them in an array? So that i can analyze the data by line which will be more flexible.
Below is the code I used:
string urlFileName1 = "pdf_link";
PdfReader reader = new PdfReader(urlFileName1);
string text = string.Empty;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader, page);
}
reader.Close();
candidate3.Text = text.ToString();
public void ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
ITextExtractionStrategy Strategy = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string page = "";
page = PdfTextExtractor.GetTextFromPage(reader, i,Strategy);
string[] lines = page.Split('\n');
foreach (string line in lines)
{
MessageBox.Show(line);
}
}
}
}
I know this is posting on an older post, but I spent a lot of time trying to figure this out so I'm going to share this for the future people trying to google this:
using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFApp2
{
class Program
{
static void Main(string[] args)
{
string filePath = #"Your said path\the file name.pdf";
string outPath = #"the output said path\the text file name.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page ++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
//Creating and appending to a text file
using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
I had the program read in a PDF, from a set path, and just output to a text file, but you can manipulate that to anything. This was building off of Snziv Gupta's response.
All the other code samples here didn't work for me, probably due to changes to the itext7 API.
This minimal example here works ok:
var pdfReader = new iText.Kernel.Pdf.PdfReader(fileName);
var pdfDocument = new iText.Kernel.Pdf.PdfDocument(pdfReader);
var contents = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(pdfDocument.GetFirstPage());
LocationTextExtractionStrategy will automatically insert '\n' in the output text. However, sometimes it will insert '\n' where it shouldn't.
In that case you need to build a custom TextExtractionStrategy or RenderListener. Bascially the code that detects newline is the method
public virtual bool SameLine(ITextChunkLocation other) {
return OrientationMagnitude == other.OrientationMagnitude &&
DistPerpendicular == other.DistPerpendicular;
}
In some cases '\n' shouldn't be inserted if there is only small difference between DistPerpendicular and other.DistPerpendicular, so you need to change it to something like Math.Abs(DistPerpendicular - other.DistPerpendicular) < 10
Or you can put that piece of code in the RenderText method of your custom TextExtractionStrategy/RenderListener class
Use LocationTextExtractionStrategy in lieu of SimpleTextExtractionStrategy. LocationTextExtractionStrategy extracted text contains the new line character at the end of line.
ITextExtractionStrategy Strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
string pdftext = PdfTextExtractor.GetTextFromPage(reader,pageno, Strategy);
string[] words = pdftext.Split('\n');
return words;
Try
String page = PdfTextExtractor.getTextFromPage(reader, 2);
String s1[]=page.split("\n");

Uploading a text file in ASP.NET

I have a web page on which users can upload text files (but a text file, i.e. a file with the extension .txt, could be of many encodings, e.g. ASCII, UTF8, UNICODE .. etc), I'm trying to validate the contents in memory before I save the file to the disk, if the content is not valid, I don't save the file. I'm reading the content from the file upload control (fileUpload1.FileContent which returns a stream of bytes), is there an easy way in .NET to convert the content of the uploaded file to a string (i.e. the byte stream returned from fileUpload1.FileContent) or will I have to check the first bytes to detect the encoding first?
Thanks
I think you can do this:
StreamReader reader = new StreamReader(fileUpload1.FileContent);
string text = reader.ReadToEnd();
Example of text file format
Code#Name#Fathername#DOB#Location#MobileNo
1#XYZ#YYY#09-06-89#LKO#9999999999
protected void btnUpload_click(object sender, EventArgs e)
{
if (Page.IsValid)
{
bool logval = true;
if (logval == true)
{
if (fuUploadExcelName.HasFile)
{
String img_1 = fuUploadExcelName.PostedFile.FileName;
String img_2 = System.IO.Path.GetFileName(img_1);
string extn = System.IO.Path.GetExtension(img_1);
string frstfilenamepart = "Text" + DateTime.Now.ToString("ddMMyyyyhhmmss");/*Filename for storing in Desired path*/
UploadExcelName.Value = frstfilenamepart + extn;
fuUploadExcelName.SaveAs(Server.MapPath("~/Text/") + "/" + UploadExcelName.Value);/*Uploaded text file will be store at this path*/
string filename = UploadExcelName.Value;
string filePath = Server.MapPath("~/Text/" + filename);
StreamReader file = new StreamReader(filePath);
string[] ColumnNames = file.ReadLine().Split('#');/*read data from textfile*/
DataTable dt = new DataTable();
foreach (string Column in ColumnNames)
{
dt.Columns.Add(Column);/*adding the columns/
}
string NewLine;
while ((NewLine = file.ReadLine()) != null)
{
DataRow dr = dt.NewRow();
string[] values = NewLine.Split('#');
for (int i = 0; i < values.Length; i++)
{
dr[i] = values[i].TrimEnd();
}
dt.Rows.Add(dr);
}
file.Close();
grdview.DataSource = dt;/*make datasouce from text file*
grdview.DataBind();/*binding the grid*/
}
}
}
}

Categories