Extract value from PDF file to variable

Extract value from PDF file to variable - c#

I am trying to get "Invoice number", in this case INV-3337 from PDF file and would like to store it as variable for future use in the code.
Currently I am working on example and using this PDF for test purposes:
https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
With my current code I am able to parse whole content to .txt format. Can somebody guide me how to get only needed value and store it into variable? Can it be done directly with itextsharp? Or do I need to parse first all to .txt file, then parse .txt file, store value as variable, delete .txt file and proceed forward?
Note! There will be a lot of PDF files to parse in real setup.
Here is my current code:
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.IO;
using System.Text;
namespace PDF_parser
{
class Program
{
static void Main(string[] args)
{
string filePath = #"C:\temp\parser\Invoice_Template.pdf";
string outPath = #"C:\temp\parser\Invoice_Template.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
//Creating and appending to a text file
using (StreamWriter file = new StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
EDIT:
Did I understand it right?
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System;
using System.IO;
using System.Text;
namespace PDF_parser
{
class Program
{
static void Main(string[] args)
{
string filePath = #"C:\temp\parser\Invoice_Template.pdf";
string outPath = #"C:\temp\parser\Invoice_Template.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
//Creating and appending to a text file
using (StreamWriter file = new StreamWriter(outPath, true))
{
// file.WriteLine(line);
int indexOccurrance = line.LastIndexOf("Invoice Number");
if(indexOccurrance > 0)
{
var invoiceNumber = line.Substring(indexOccurrance, (line.Length - indexOccurrance) );
}
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}

One option is to search for "Invoice Number" in each line text using LastIndexOf.
If found then use Substring to get rest of that line (which will be Invoice Number)
Something like:
int indexOccurrance = line.LastIndexOf("Invoice Number");
if(indexOccurrance > 0)
{
var invoiceNumber = line.Substring(indexOccurrance, (line.Length - indexOccurrance) );
}

Related

Need to write all results to a txt file

I have the query below and it is working fine, I can see the results in the console. But I need to write all the result into a txt file to load it into a table. The query as is only write one line to the txt file. How can I make it write all lines from the output into the txt file? Really appreciate any help on this.
using System;
using Microsoft.AnalysisServices.AdomdClient;
using System.IO;
using System.Diagnostics;
using System.Text;
namespace PowerQry
{
class Program
{
#pragma warning disable IDE0060 // Remove unused parameter
static void Main(string[] args)
#pragma warning restore IDE0060 // Remove unused parameter
{
AdomdConnection adomdConnection = new AdomdConnection("Data Source=localhost:49971");
String query = #"
EVALUATE
SUMMARIZECOLUMNS(
Customer[City],
Customer[Country-Region],
Customer[Customer],
Customer[Customer ID],
Customer[CustomerKey]
)
";
AdomdCommand adomdCommand = new AdomdCommand(query, adomdConnection);
/*******************************************************
connection
*******************************************************/
adomdConnection.Open();
AdomdDataReader reader = adomdCommand.ExecuteReader();
// Create a loop for every row in the resultset
while (reader.Read())
{
String rowResults = "";
// Create a loop for every column in the current row --need to add header
for (
int columnNumber = 0;
columnNumber < reader.FieldCount;
columnNumber++
)
{
rowResults += $"\t{reader.GetValue(columnNumber)}";
}
//Console.WriteLine(rowResults);
//--write all lines to txt
{
string UserName = System.Environment.UserName;
string fileName = #"C:\Temp\nm.txt";
FileStream ostrm;
StreamWriter writer;
TextWriter oldOut = Console.Out;
try
{
ostrm = new FileStream(fileName, FileMode.OpenOrCreate, FileAccess.Write);
writer = new StreamWriter(ostrm);
for (int i = 0; i < 10; i++) ;
}
catch (Exception e)
{
Console.WriteLine("Cannot open Redirect.txt for writing");
Console.WriteLine(e.Message);
return;
}
Console.SetOut(writer);
Console.WriteLine(rowResults);
Console.SetOut(oldOut);
writer.Close();
ostrm.Close();
Console.WriteLine("Done");
}
//==
}
adomdConnection.Close();
}
}
}

You can use File.AppendAllLines.
Instead of append to a single string instance you can use List<string>
//String rowResults = "";
List<string> rows = new List<string>();
// Create a loop for every column in the current row --need to add header
for (int columnNumber = 0; columnNumber < reader.FieldCount;columnNumber++)
{
// tabulator should be removed
rows.Add($"\t{reader.GetValue(columnNumber)}");
// rowResults += $"\t{reader.GetValue(columnNumber)}";
}
File.AppendAllLines(your path - for example #"C:\output.txt", rows);

c# how can i read text from tags in acrobat pdf

How to extract text from the tags using c#?

Although I haven't tested it, something like this came to my mind. If it is not accepted, I can delete it.
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
using System.Text;
namespace PDFExtractor
{
public class PDFExtractor
{
public static string ExtractTextFromPDF(string pdfFileName)
{
StringBuilder result = new StringBuilder();
// Create a reader for the given PDF file
using (PdfReader reader = new PdfReader(pdfFileName))
{
// Read pages
for (int page = 1; page <= reader.NumberOfPages; page++)
{
SimpleTextExtractionStrategy strategy =
new SimpleTextExtractionStrategy();
string pageText =
PdfTextExtractor.GetTextFromPage(reader, page, strategy);
result.Append(pageText);
}
}
return result.ToString();
}
}
public static string GetStrBetweenTags(string value, string startTag, string endTag)
{
if (value.Contains(startTag) && value.Contains(endTag))
{
int index = value.IndexOf(startTag) + startTag.Length;
return value.Substring(index, value.IndexOf(endTag) - index);
}
else
return null;
}
// var str = GetStrBetweenTags(ExtractTextFromPDF("\path of PDf file\"), "<figure>", "</figure");
}

You can use PdfPig to extract a page's marked contents and what they contain (text, images, paths and children):
using System;
using UglyToad.PdfPig;
[...]
using (PdfDocument document = PdfDocument.Open("file.pdf"))
{
for (int i = 0; i < document.NumberOfPages; i++)
{
var page = document.GetPage(i + 1);
var mcs = page.GetMarkedContents();
foreach (var mc in mcs)
{
var letters = mc.Letters;
var paths = mc.Paths;
var images = mc.Images;
foreach (var letter in letters)
{
Console.Write(letter.Value);
}
Console.WriteLine();
}
}
}

How do i create the text file to be without spaces or empty lines ? Just one block of text

This is how i use in form1 constructor with the text file:
Create empty text file:
ww = new StreamWriter(#"c:\temp\test.txt");
Encoding since it's hebrew the content and downloading it:
client.Encoding = System.Text.Encoding.GetEncoding(1255);
page = client.DownloadString("http://rotter.net/scoopscache.html");
client = WebClient
page = string
Then i extract the date and time and the text from the page:
TextExtractor.ExtractDateTime(page, newText, dateTime);
StreamWriter w = new StreamWriter(#"d:\rotterhtml\rotterscoops.html");
w.Write(page);
w.Close();
TextExtractor.ExtractText(#"d:\rotterhtml\rotterscoops.html", newText, dateTime);
Then i write the new content with the spaces to a text file test.txt:
combindedString = string.Join(Environment.NewLine, newText);
ww.Write(combindedString);
ww.Close();
combindedString = string
And this is the class TextExtractor where i extract the date and time and text from page:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Text.RegularExpressions;
namespace ScrollLabelTest
{
class TextExtractor
{
public static void ExtractText(string filePath, List<string> newText, List<string> dateTime)
{
//newText = new List<string>();
List<string> text = new List<string>();
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.Load(filePath, System.Text.Encoding.GetEncoding(65001));
if (htmlDoc.DocumentNode != null)
{
var nodes = htmlDoc.DocumentNode.SelectNodes("//a/b");
foreach (var node in nodes)
{
text.Add(node.InnerText);
}
}
List<string> t = filterNumbers(text);
for (int i = 0; i < t.Count; i++)
{
newText.Add(t[i]);
newText.Add(dateTime[i]);
newText.Add("");
}
}
public static void ExtractDateTime(string text, List<string> newText, List<string> dateTime)
{
//dateTime = new List<string>();
string pattern1 = "<span style=color:#000099;>(?'hebrew'[^<]*)</span>";
Regex expr1 = new Regex(pattern1, RegexOptions.Singleline);
MatchCollection matches = expr1.Matches(text);
foreach (Match match in matches)
{
string hebrew = match.Groups["hebrew"].Value;
string pattern2 = #"[^\s$]*:[^:]*:\s+\d\d:\d\d";
Regex expr2 = new Regex(pattern2);
Match match2 = expr2.Match(hebrew);
string results = match2.Value;
int i = results.IndexOf("שעה");
results = results.Insert(i + "שעה".Length, " ");
dateTime.Add("דווח במקור " + results);
}
}
private static List<string> filterNumbers(List<string> mix)
{
List<string> onlyStrings = new List<string>();
foreach (var itemToCheck in mix)
{
int number = 0;
if (!int.TryParse(itemToCheck, out number))
{
onlyStrings.Add(itemToCheck);
}
}
return onlyStrings;
}
}
}
And this is the text file test.txt in the end after all the extraction:
Text File
You can see the first line is empty line then the fiest text line not beging from the left first left side but there is a space from the left side. Then between each two lines there is a space/empty line.
What i want is that the text file will be without any space not from any line beginning and not between any line/s and that in the top there will be no first empty line.
Just one block of text.

This will fix it for you:
using (StreamWriter sw = new StreamWriter(#"C:\temp\test1.txt", false))
{
using (StreamReader sr = new StreamReader(#"C:\temp\test.txt"))
{
while (sr.Peek() >= 0)
{
var strReadLine = sr.ReadLine().Trim().Replace("\t", "").Replace("\r\n", "");
if (!String.IsNullOrWhiteSpace(strReadLine))
{
sw.WriteLine(strReadLine);
}
}
}
}

Extract text by line from PDF using iTextSharp c#

I need to run some analysis my extracting data from a PDF document.
Using iTextSharp, I used the PdfTextExtractor.GetTextFromPage method to extract contents from a PDF document and it returned me in a single long line.
Is there a way to get the text by line so that i can store them in an array? So that i can analyze the data by line which will be more flexible.
Below is the code I used:
string urlFileName1 = "pdf_link";
PdfReader reader = new PdfReader(urlFileName1);
string text = string.Empty;
for (int page = 1; page <= reader.NumberOfPages; page++)
{
text += PdfTextExtractor.GetTextFromPage(reader, page);
}
reader.Close();
candidate3.Text = text.ToString();

public void ExtractTextFromPdf(string path)
{
using (PdfReader reader = new PdfReader(path))
{
StringBuilder text = new StringBuilder();
ITextExtractionStrategy Strategy = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
for (int i = 1; i <= reader.NumberOfPages; i++)
{
string page = "";
page = PdfTextExtractor.GetTextFromPage(reader, i,Strategy);
string[] lines = page.Split('\n');
foreach (string line in lines)
{
MessageBox.Show(line);
}
}
}
}

I know this is posting on an older post, but I spent a lot of time trying to figure this out so I'm going to share this for the future people trying to google this:
using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFApp2
{
class Program
{
static void Main(string[] args)
{
string filePath = #"Your said path\the file name.pdf";
string outPath = #"the output said path\the text file name.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page ++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
//creating the string array and storing the PDF line by line
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
//Creating and appending to a text file
using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
I had the program read in a PDF, from a set path, and just output to a text file, but you can manipulate that to anything. This was building off of Snziv Gupta's response.

All the other code samples here didn't work for me, probably due to changes to the itext7 API.
This minimal example here works ok:
var pdfReader = new iText.Kernel.Pdf.PdfReader(fileName);
var pdfDocument = new iText.Kernel.Pdf.PdfDocument(pdfReader);
var contents = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(pdfDocument.GetFirstPage());

LocationTextExtractionStrategy will automatically insert '\n' in the output text. However, sometimes it will insert '\n' where it shouldn't.
In that case you need to build a custom TextExtractionStrategy or RenderListener. Bascially the code that detects newline is the method
public virtual bool SameLine(ITextChunkLocation other) {
return OrientationMagnitude == other.OrientationMagnitude &&
DistPerpendicular == other.DistPerpendicular;
}
In some cases '\n' shouldn't be inserted if there is only small difference between DistPerpendicular and other.DistPerpendicular, so you need to change it to something like Math.Abs(DistPerpendicular - other.DistPerpendicular) < 10
Or you can put that piece of code in the RenderText method of your custom TextExtractionStrategy/RenderListener class

Use LocationTextExtractionStrategy in lieu of SimpleTextExtractionStrategy. LocationTextExtractionStrategy extracted text contains the new line character at the end of line.
ITextExtractionStrategy Strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
string pdftext = PdfTextExtractor.GetTextFromPage(reader,pageno, Strategy);
string[] words = pdftext.Split('\n');
return words;

Try
String page = PdfTextExtractor.getTextFromPage(reader, 2);
String s1[]=page.split("\n");

Unable to open CSV file from internet using C#

I'm new with C#. I've written code to open a CSV file from my documents on my local machine. It works well and the data parsing works. Trouble is when I change the code to open the file from an internet site I cannot get it to work. I am able to open this file using VBA but I now want to use C# ADO.NET. I cannot find the answer by searching with Google. Can anyone help with the code and/or point me to a website with a good tutorial. All help much appreciated. Code attached, I'm sure the problem is with lines 24 - 26;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Net;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
//
// Read in a file line-by-line, and store it all in a List.
//
int i = 0;
DateTime dte;
List<string> list = new List<string>();
float[] Prices = new float[4];
WebClient wc = new WebClient();
byte[] data = wc.DownloadData("http://www.datasource.com/apps/qt/csv/pricehistory.ac?section=yearly_price_download&code=XXX");
using (StreamReader reader = new StreamReader(wc))
{
string line;
while ((line = reader.ReadLine()) != null)
{
//list.Add(line); // Add to list.
Console.WriteLine(line); // Write to console.
string[] parts = line.Split(',');
int DateSetter = 1;
int DateDone = 0;
int CountFloat = 0;
int PricesDone = 0;
Double Volume = 0;
foreach (string part in parts)
{
Console.WriteLine("{0} : {1}", i, part);
if (DateSetter == 1)
{
dte = DateTime.Parse(part);
DateSetter = 2;
Console.WriteLine(dte);
}
if (DateDone == 1)
{
if (DateSetter < 6)
{
Prices[CountFloat] = float.Parse(part);
CountFloat++;
DateSetter++;
Console.WriteLine(Prices[3]);
}
}
DateDone = 1;
if (PricesDone == 1)
{
Volume = double.Parse(part);
Console.WriteLine(Volume);
}
if (DateSetter == 6)
{
PricesDone = 1;
}
}
}
}
Console.ReadLine();
}
}
}

Your code as pasted would not compile. You can however use the WebClient to download to a string, then split the string into lines:
string content;
using(WebClient wc = new WebClient())
content = wc.DownloadString("http://www.datasource.com/apps/qt/csv/pricehistory.ac?section=yearly_price_download&code=XXX");
foreach(var line in content.Split(new string [] {Environment.NewLine}, StringSplitOptions.None))
{
//...
}

Another option is to download data as you're doing, and then wrap it with a MemoryStream:
WebClient wc = new WebClient();
byte[] data = wc.DownloadData(
"http://www.datasource.com/apps/qt/csv/pricehistory.ac?section=yearly_price_download&code=XXX");
using (var ms = new MemoryStream(data))
{
using (var reader = new StreamReader(ms))
{
string line;
while ((line = reader.ReadLine()) != null)
{
// do whatever
}
}
}
The advantage of this over splitting the string is that it uses considerably less memory.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Extract value from PDF file to variable - c#

Related

Need to write all results to a txt file

c# how can i read text from tags in acrobat pdf

How do i create the text file to be without spaces or empty lines ? Just one block of text

Extract text by line from PDF using iTextSharp c#

Unable to open CSV file from internet using C#

Categories

Resources