Test Pdf Files with C# selenium web driver using iTextSharp library

Test Pdf Files with C# selenium web driver using iTextSharp library - c#

Now, I need to test a pdf file downloaded from a web site to. I've searched and found that code but I don't understand how to open the pdf by using the name only?
How can I open it from the downloads folder.
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.Text;
namespace PDFExtractor
{
public class PDFExtractor
{
public static string ExtractTextFromPDF(string pdfFileName)
{
StringBuilder result = new StringBuilder();
// Create a reader for the given PDF file
using (PdfReader reader = new PdfReader(pdfFileName))
{
// Read pages
for (int page = 1; page <= reader.NumberOfPages; page++)
{
SimpleTextExtractionStrategy strategy =
new SimpleTextExtractionStrategy();
string pageText =
PdfTextExtractor.GetTextFromPage(reader, page, strategy);
result.Append(pageText);
}
}
return result.ToString();
}
}
}

you just have to enter the path to this pdf.
Like #"C:\Users\Tom\Desktop\PDF.pdf

Related

Image' is an ambiguous reference between 'System.Drawing.Image' and 'iText.Layout.Element.Image'

With that code I can split a multi tiff and save the images to files.
public void SplitImage(string file)
{
Bitmap bitmap = (Bitmap)Image.FromFile(file);
int count = bitmap.GetFrameCount(FrameDimension.Page);
var new_files = file.Split("_");
String new_file = new_files[new_files.Length - 1];
for (int idx = 0; idx < count; idx++)
{
bitmap.SelectActiveFrame(FrameDimension.Page, idx);
bitmap.Save($"C:\\temp\\{idx}-{new_file}", ImageFormat.Tiff);
}
}
here the code for the Pdf creation
public void CreatePDFFromImages(string path_multi_tiff)
{
Image img = new Image(ImageDataFactory.Create(path_multi_tiff));
var p = new Paragraph("Image").Add(img);
var writer = new PdfWriter("C:\\temp\\test.pdf");
var pdf = new PdfDocument(writer);
var document = new Document(pdf);
document.Add(new Paragraph("Images"));
document.Add(p);
document.Close();
Console.WriteLine("Done !");
}
now I would like to save the images to pdf pages and tried it with iText7. But this fails as
using System.Drawing.Imaging;
using Image = iText.Layout.Element.Image;
are to close to have them both in the same class. How could I save the splitted images to PDF pages ? I would like to avoid saving first to files and reloading all the images.

The line
using Image = iText.Layout.Element.Image;
is a so-called using alias directive. It creates the alias Image for the namespace or type iText.Layout.Element.Image. If this poses a problem, you can simply create a different alias. For example
using A = iText.Layout.Element.Image;
will create the alias A for the namespace or type iText.Layout.Element.Image.

Extract value from PDF file to variable

I am trying to get "Invoice number", in this case INV-3337 from PDF file and would like to store it as variable for future use in the code.
Currently I am working on example and using this PDF for test purposes:
https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
With my current code I am able to parse whole content to .txt format. Can somebody guide me how to get only needed value and store it into variable? Can it be done directly with itextsharp? Or do I need to parse first all to .txt file, then parse .txt file, store value as variable, delete .txt file and proceed forward?
Note! There will be a lot of PDF files to parse in real setup.
Here is my current code:
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.IO;
using System.Text;
namespace PDF_parser
{
class Program
{
static void Main(string[] args)
{
string filePath = #"C:\temp\parser\Invoice_Template.pdf";
string outPath = #"C:\temp\parser\Invoice_Template.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
//Creating and appending to a text file
using (StreamWriter file = new StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
EDIT:
Did I understand it right?
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.IO;
using System.Text;
namespace PDF_parser
{
class Program
{
static void Main(string[] args)
{
string filePath = #"C:\temp\parser\Invoice_Template.pdf";
string outPath = #"C:\temp\parser\Invoice_Template.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
//Creating and appending to a text file
using (StreamWriter file = new StreamWriter(outPath, true))
{
// file.WriteLine(line);
int indexOccurrance = line.LastIndexOf("Invoice Number");
if(indexOccurrance > 0)
{
var invoiceNumber = line.Substring(indexOccurrance, (line.Length - indexOccurrance) );
}
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}

One option is to search for "Invoice Number" in each line text using LastIndexOf.
If found then use Substring to get rest of that line (which will be Invoice Number)
Something like:
int indexOccurrance = line.LastIndexOf("Invoice Number");
if(indexOccurrance > 0)
{
var invoiceNumber = line.Substring(indexOccurrance, (line.Length - indexOccurrance) );
}

Itextsharp can't extract pdf unicode content in c#

I am trying to get the content of pdf file using itextsharp as you can see :
static void Main(string[] args)
{
StringBuilder text = new StringBuilder();
using (PdfReader reader = new PdfReader(#"D:\a.pdf"))
{
for (int i = 1; i <= reader.NumberOfPages; i++)
{
text.Append(PdfTextExtractor.GetTextFromPage(reader, i));
}
}
System.IO.File.WriteAllText(#"c:/a.txt",text.ToString());
Console.ReadLine();
}
My pdf content is written in Persian ,and after running the above code to result is like this :
But this is not correct result.should i set any option in itextsharp

It is hard to say without an original file but in case you have characters/words incorrectly placed then you should try to use LocationTextExtractionStrategy like this:
text.Append(PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());

Extract text by line from PDF using iTextSharp c#

I need to run some analysis my extracting data from a PDF document.
Using iTextSharp, I used the PdfTextExtractor.GetTextFromPage method to extract contents from a PDF document and it returned me in a single long line.
Is there a way to get the text by line so that i can store them in an array? So that i can analyze the data by line which will be more flexible.
Below is the code I used:
string urlFileName1 = "pdf_link";
PdfReader reader = new PdfReader(urlFileName1);
string text = string.Empty;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader, page);
}
reader.Close();
candidate3.Text = text.ToString();

public void ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
ITextExtractionStrategy Strategy = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string page = "";
page = PdfTextExtractor.GetTextFromPage(reader, i,Strategy);
string[] lines = page.Split('\n');
foreach (string line in lines)
{
MessageBox.Show(line);
}
}
}
}

I know this is posting on an older post, but I spent a lot of time trying to figure this out so I'm going to share this for the future people trying to google this:
using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFApp2
{
class Program
{
static void Main(string[] args)
{
string filePath = #"Your said path\the file name.pdf";
string outPath = #"the output said path\the text file name.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page ++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
//Creating and appending to a text file
using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
I had the program read in a PDF, from a set path, and just output to a text file, but you can manipulate that to anything. This was building off of Snziv Gupta's response.

All the other code samples here didn't work for me, probably due to changes to the itext7 API.
This minimal example here works ok:
var pdfReader = new iText.Kernel.Pdf.PdfReader(fileName);
var pdfDocument = new iText.Kernel.Pdf.PdfDocument(pdfReader);
var contents = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(pdfDocument.GetFirstPage());

LocationTextExtractionStrategy will automatically insert '\n' in the output text. However, sometimes it will insert '\n' where it shouldn't.
In that case you need to build a custom TextExtractionStrategy or RenderListener. Bascially the code that detects newline is the method
public virtual bool SameLine(ITextChunkLocation other) {
return OrientationMagnitude == other.OrientationMagnitude &&
DistPerpendicular == other.DistPerpendicular;
}
In some cases '\n' shouldn't be inserted if there is only small difference between DistPerpendicular and other.DistPerpendicular, so you need to change it to something like Math.Abs(DistPerpendicular - other.DistPerpendicular) < 10
Or you can put that piece of code in the RenderText method of your custom TextExtractionStrategy/RenderListener class

Use LocationTextExtractionStrategy in lieu of SimpleTextExtractionStrategy. LocationTextExtractionStrategy extracted text contains the new line character at the end of line.
ITextExtractionStrategy Strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
string pdftext = PdfTextExtractor.GetTextFromPage(reader,pageno, Strategy);
string[] words = pdftext.Split('\n');
return words;

Try
String page = PdfTextExtractor.getTextFromPage(reader, 2);
String s1[]=page.split("\n");

using ITextSharp to extract and update links in an existing PDF

I need to post several (read: a lot) PDF files to the web but many of them have hard coded file:// links and links to non-public locations. I need to read through these PDFs and update the links to the proper locations. I've started writing an app using itextsharp to read through the directories and files, find the PDFs and iterate through each page. What I need to do next is find the links and then update the incorrect ones.
string path = "c:\\html";
DirectoryInfo rootFolder = new DirectoryInfo(path);
foreach (DirectoryInfo di in rootFolder.GetDirectories())
{
// get pdf
foreach (FileInfo pdf in di.GetFiles("*.pdf"))
{
string contents = string.Empty;
Document doc = new Document();
PdfReader reader = new PdfReader(pdf.FullName);
using (MemoryStream ms = new MemoryStream())
{
PdfWriter writer = PdfWriter.GetInstance(doc, ms);
doc.Open();
for (int p = 1; p <= reader.NumberOfPages; p++)
{
byte[] bt = reader.GetPageContent(p);
}
}
}
}
Quite frankly, once I get the page content I'm rather lost on this when it comes to iTextSharp. I've read through the itextsharp examples on sourceforge, but really didn't find what I was looking for.
Any help would be greatly appreciated.
Thanks.

This one is a little complicated if you don't know the internals of the PDF format and iText/iTextSharp's abstraction/implementation of it. You need to understand how to use PdfDictionary objects and look things up by their PdfName key. Once you get that you can read through the official PDF spec and poke around a document pretty easily. If you do care I've included the relevant parts of the PDF spec in parenthesis where applicable.
Anyways, a link within a PDF is stored as an annotation (PDF Ref 12.5). Annotations are page-based so you need to first get each page's annotation array individually. There's a bunch of different possible types of annotations so you need to check each one's SUBTYPE and see if its set to LINK (12.5.6.5). Every link should have an ACTION dictionary associated with it (12.6.2) and you want to check the action's S key to see what type of action it is. There's a bunch of possible ones for this, link's specifically could be internal links or open file links or play sound links or something else (12.6.4.1). You are looking only for links that are of type URI (note the letter I and not the letter L). URI Actions (12.6.4.7) have a URI key that holds the actual address to navigate to. (There's also an IsMap property for image maps that I can't actually imagine anyone using.)
Whew. Still reading? Below is a full working VS 2010 C# WinForms app based on my post here targeting iTextSharp 5.1.1.0. This code does two main things: 1) Create a sample PDF with a link in it pointing to Google.com and 2) replaces that link with a link to bing.com. The code should be pretty well commented but feel free to ask any questions that you might have.
using System;
using System.Text;
using System.Windows.Forms;
using iTextSharp.text;
using iTextSharp.text.pdf;
using System.IO;
namespace WindowsFormsApplication1
{
public partial class Form1 : Form
{
//Folder that we are working in
private static readonly string WorkingFolder = Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), "Hyperlinked PDFs");
//Sample PDF
private static readonly string BaseFile = Path.Combine(WorkingFolder, "OldFile.pdf");
//Final file
private static readonly string OutputFile = Path.Combine(WorkingFolder, "NewFile.pdf");
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
CreateSamplePdf();
UpdatePdfLinks();
this.Close();
}
private static void CreateSamplePdf()
{
//Create our output directory if it does not exist
Directory.CreateDirectory(WorkingFolder);
//Create our sample PDF
using (iTextSharp.text.Document Doc = new iTextSharp.text.Document(PageSize.LETTER))
{
using (FileStream FS = new FileStream(BaseFile, FileMode.Create, FileAccess.Write, FileShare.Read))
{
using (PdfWriter writer = PdfWriter.GetInstance(Doc, FS))
{
Doc.Open();
//Turn our hyperlink blue
iTextSharp.text.Font BlueFont = FontFactory.GetFont("Arial", 12, iTextSharp.text.Font.NORMAL, iTextSharp.text.BaseColor.BLUE);
Doc.Add(new Paragraph(new Chunk("Go to URL", BlueFont).SetAction(new PdfAction("http://www.google.com/", false))));
Doc.Close();
}
}
}
}
private static void UpdatePdfLinks()
{
//Setup some variables to be used later
PdfReader R = default(PdfReader);
int PageCount = 0;
PdfDictionary PageDictionary = default(PdfDictionary);
PdfArray Annots = default(PdfArray);
//Open our reader
R = new PdfReader(BaseFile);
//Get the page cont
PageCount = R.NumberOfPages;
//Loop through each page
for (int i = 1; i <= PageCount; i++)
{
//Get the current page
PageDictionary = R.GetPageN(i);
//Get all of the annotations for the current page
Annots = PageDictionary.GetAsArray(PdfName.ANNOTS);
//Make sure we have something
if ((Annots == null) || (Annots.Length == 0))
continue;
//Loop through each annotation
foreach (PdfObject A in Annots.ArrayList)
{
//Convert the itext-specific object as a generic PDF object
PdfDictionary AnnotationDictionary = (PdfDictionary)PdfReader.GetPdfObject(A);
//Make sure this annotation has a link
if (!AnnotationDictionary.Get(PdfName.SUBTYPE).Equals(PdfName.LINK))
continue;
//Make sure this annotation has an ACTION
if (AnnotationDictionary.Get(PdfName.A) == null)
continue;
//Get the ACTION for the current annotation
PdfDictionary AnnotationAction = (PdfDictionary)AnnotationDictionary.Get(PdfName.A);
//Test if it is a URI action
if (AnnotationAction.Get(PdfName.S).Equals(PdfName.URI))
{
//Change the URI to something else
AnnotationAction.Put(PdfName.URI, new PdfString("http://www.bing.com/"));
}
}
}
//Next we create a new document add import each page from the reader above
using (FileStream FS = new FileStream(OutputFile, FileMode.Create, FileAccess.Write, FileShare.None))
{
using (Document Doc = new Document())
{
using (PdfCopy writer = new PdfCopy(Doc, FS))
{
Doc.Open();
for (int i = 1; i <= R.NumberOfPages; i++)
{
writer.AddPage(writer.GetImportedPage(R, i));
}
Doc.Close();
}
}
}
}
}
}
EDIT
I should note, this only changes the actual link. Any text within the document won't get updated. Annotations are drawn on top of text but aren't really tied to the text underneath in anyway. That's another topic completely.

Noted if the Action is indirect it will not return a dictionary and you will have an error:
PdfDictionary AnnotationAction = (PdfDictionary)AnnotationDictionary.Get(PdfName.A);
In cases of possible indirect dictionaries:
PdfDictionary Action = null;
//Get action directly or by indirect reference
PdfObject obj = Annotation.Get(PdfName.A);
if (obj.IsIndirect) {
Action = PdfReader.GetPdfObject(obj);
} else {
Action = (PdfDictionary)obj;
}
In that case you have to investigate the returned dictionary to figure out where the URI is found. As with an indirect /Launch dictionary the URI is located in the /F item being of type PRIndirectReference with the /Type being a /FileSpec and the URI located in the value of /F

Added code for dealing with indirect and launch actions and null annotation-dictionary:
PdfReader r = new PdfReader(#"d:\kb2\" + f);
for (int i = 1; i <= r.NumberOfPages; i++) {
//Get the current page
var PageDictionary = r.GetPageN(i);
//Get all of the annotations for the current page
var Annots = PageDictionary.GetAsArray(PdfName.ANNOTS);
//Make sure we have something
if ((Annots == null) || (Annots.Length == 0))
continue;
foreach (var A in Annots.ArrayList) {
var AnnotationDictionary = PdfReader.GetPdfObject(A) as PdfDictionary;
if (AnnotationDictionary == null)
continue;
//Make sure this annotation has a link
if (!AnnotationDictionary.Get(PdfName.SUBTYPE).Equals(PdfName.LINK))
continue;
//Make sure this annotation has an ACTION
if (AnnotationDictionary.Get(PdfName.A) == null)
continue;
var annotActionObject = AnnotationDictionary.Get(PdfName.A);
var AnnotationAction = (PdfDictionary)(annotActionObject.IsIndirect() ? PdfReader.GetPdfObject(annotActionObject) : annotActionObject);
var type = AnnotationAction.Get(PdfName.S);
//Test if it is a URI action
if (type.Equals(PdfName.URI)) {
//Change the URI to something else
string relativeRef = AnnotationAction.GetAsString(PdfName.URI).ToString();
AnnotationAction.Put(PdfName.URI, new PdfString(url));
} else if (type.Equals(PdfName.LAUNCH)) {
//Change the URI to something else
var filespec = AnnotationAction.GetAsDict(PdfName.F);
string url = filespec.GetAsString(PdfName.F).ToString();
AnnotationAction.Put(PdfName.F, new PdfString(url));
}
}
}
//Next we create a new document add import each page from the reader above
using (var output = File.OpenWrite(outputFile.FullName)) {
using (Document Doc = new Document()) {
using (PdfCopy writer = new PdfCopy(Doc, output)) {
Doc.Open();
for (int i = 1; i <= r.NumberOfPages; i++) {
writer.AddPage(writer.GetImportedPage(r, i));
}
Doc.Close();
}
}
}
r.Close();

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Test Pdf Files with C# selenium web driver using iTextSharp library - c#

you just have to enter the path to this pdf. Like #"C:\Users\Tom\Desktop\PDF.pdf

Related

Image' is an ambiguous reference between 'System.Drawing.Image' and 'iText.Layout.Element.Image'

Extract value from PDF file to variable

Itextsharp can't extract pdf unicode content in c#

Extract text by line from PDF using iTextSharp c#

using ITextSharp to extract and update links in an existing PDF

Categories

Resources