C# Extract text from PDF using PdfSharp

C# Extract text from PDF using PdfSharp - c#

Is there a possibility to extract plain text from a PDF-File with PdfSharp?
I don't want to use iTextSharp because of its license.

Took Sergio's answer and made some extension methods. I also changed the accumulation of strings into an iterator.
public static class PdfSharpExtensions
{
public static IEnumerable<string> ExtractText(this PdfPage page)
{
var content = ContentReader.ReadContent(page);
var text = content.ExtractText();
return text;
}
public static IEnumerable<string> ExtractText(this CObject cObject)
{
if (cObject is COperator)
{
var cOperator = cObject as COperator;
if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
foreach (var txt in ExtractText(cOperand))
yield return txt;
}
}
else if (cObject is CSequence)
{
var cSequence = cObject as CSequence;
foreach (var element in cSequence)
foreach (var txt in ExtractText(element))
yield return txt;
}
else if (cObject is CString)
{
var cString = cObject as CString;
yield return cString.Value;
}
}
}

I have implemented it somehow similar to how David did it.
Here is my code:
...
{
// ....
var page = document.Pages[1];
CObject content = ContentReader.ReadContent(page);
var extractedText = ExtractText(content);
// ...
}
private IEnumerable<string> ExtractText(CObject cObject)
{
var textList = new List<string>();
if (cObject is COperator)
{
var cOperator = cObject as COperator;
if (cOperator.OpCode.Name == OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
{
textList.AddRange(ExtractText(cOperand));
}
}
}
else if (cObject is CSequence)
{
var cSequence = cObject as CSequence;
foreach (var element in cSequence)
{
textList.AddRange(ExtractText(element));
}
}
else if (cObject is CString)
{
var cString = cObject as CString;
textList.Add(cString.Value);
}
return textList;
}

PDFSharp provides all the tools to extract the text from a PDF. Use the ContentReader class to access the commands within each page and extract the strings from TJ/Tj operators.
I've uploaded a simple implementation to github.

Related

Iterating through a list populated by a text file to look for a reference C#

I'm new to C# and i'm trying to check whether a guest has checked in on a hotel app. I'm trying to get all bookings in a text file pass them in to a list then read through the list looking for the booking in reference.
The problem i'm having is that it only seems to put the first line of the text file into the list. Could anyone help me solve this?
One way i've tried:
public void CheckBookingReference()
{
List<string> BookingList = File.ReadAllLines(BookingFilePath).ToList();
foreach (var BookingLine in BookingList.ToList())
{
string[] bookings = BookingLine.Split(',');
int _BookingReferenceNumber = int.Parse(bookings[5]);
if (_BookingReferenceNumber == BookingReferenceNumber)
{
Console.WriteLine("Booking found, check in complete.");
break;
}
else
{
throw new Exception("BookingNotFoundException");
}
}
}
Another way i've also tried:
public void CheckBookingReference()
{
List<string> BookingList = new List<string>();
using (var sr = new StreamReader(BookingFilePath))
{
while (sr.Peek() >= 0)
BookingList.Add(sr.ReadLine());
foreach (var BookingLine in BookingList.ToList())
{
string[] bookings = BookingLine.Split(',');
int _BookingReferenceNumber = int.Parse(bookings[5]);
if (_BookingReferenceNumber == BookingReferenceNumber)
{
//throw new Exception("GuestAlreadyCheckedInException");
Console.WriteLine("booking found");
break;
}
else if (_BookingReferenceNumber != BookingReferenceNumber)
{
Console.WriteLine("not found");
break;
}
}
}
}

With the code you have, if the first line doesn't match, you throw the exception.
Try this:
public void CheckBookingReference()
{
List<string> BookingList = File.ReadAllLines(BookingFilePath).ToList();
foreach (var BookingLine in BookingList.ToList())
{
string[] bookings = BookingLine.Split(',');
int _BookingReferenceNumber = int.Parse(bookings[5]);
if (_BookingReferenceNumber == BookingReferenceNumber)
{
Console.WriteLine("Booking found, check in complete.");
return;
}
}
throw new Exception("BookingNotFoundException");
}

elastic search field update from csv file in c#

I am new to programming in C#
I want to read the csv file and update the fields in elastic search. I already have this function made by someone else and I want to call this function and update fields.
public static bool UpdateDocumentFields(dynamic updatedFields, IHit<ClientDocuments> metaData)
{
var settings = (SystemWideSettings)SystemSettings.Instance;
if (settings == null)
throw new SettingsNotFoundException("System settings have not been initialized.");
if (ElasticSearch.Connect(settings.ElasticAddressString(), null, out ElasticClient elasticClient, out string status))
{
return ElasticSearch.UpdateDocumentFields(updatedFields, metaData, elasticClient);
}
return false;
}
what I have done myself so far is :
private void btnToEs_Click(object sender, EventArgs e)
{
var manager = new StateOne.FileProcessing.DocumentManager.DocumentManager();
var path = "C:/Temp/MissingElasticSearchRecords.csv";
var lines = File.ReadLines(path);
foreach (string line in lines)
{
var item = ClientDocuments.GetClientDocumentFromFilename(line);
if (item != null)
{
var output = new List<ClientDocuments>();
manager.TryGetClientDocuments(item.AccountNumber.ToString(), 100, out output);
if (output!= null && output.Count <= 0)
{
var docs = new List<ClientDocuments>();
//var doc = docs.Add();
// var doc = docs.Add();
//docs.Add(doc);
foreach (var doc in output)
{
if (!item.FileExists)
{
//insert in elastic search
manager.UpdateDocumentFields(output, );
}
else
{
//ignore
}
}
My problem here is manager.updateDocumentFields() is a function. I am calling it but what I need to pass as a parameter? Thanks in advance.

Get string data Xpath

I need help to get data from the site. I use geckofx in my application. I want it to retrieve text data from the xpath location after loading the page
XPathResult xpathResult = geckoWebBrowser1.Document.EvaluateXPath("/html/body/table[3]/tbody/tr[1]/td[2]/a[1]");
IEnumerable<GeckoNode> foundNodes = xpathResult.GetNodes();
How to download data as text?

It looks like you are struggling to retrieve the text from the GeckoFX objects.
Here are a few calls and operations that should get you started:
//get by XPath
XPathResult xpathResult = _browser.Document.EvaluateXPath("//*[#id]/div/p[2]");
var foundNodes = xpathResult.GetNodes();
foreach (var node in foundNodes)
{
var x = node.TextContent; // get text text contained by this node (including children)
GeckoHtmlElement element = node as GeckoHtmlElement; //cast to access.. inner/outerHtml
string inner = element.InnerHtml;
string outer = element.OuterHtml;
//iterate child nodes
foreach (var child in node.ChildNodes)
{
}
}
//get by id
GeckoHtmlElement htmlElementById = _browser.Document.GetHtmlElementById("mw-content-text");
//get by tag
GeckoElementCollection byTag = _browser.Document.GetElementsByTagName("input");
foreach (var ele in byTag)
{
var y = ele.GetAttribute("value");
}
//get by class
var byClass = _browser.Document.GetElementsByClassName("input");
foreach (var node in byClass)
{
//...
}
//cast to a different object
var username = ((GeckoInputElement)_browser.Document.GetHtmlElementById("yourUsername")).Value;
//create new object from DomObject
var button = new GeckoButtonElement(_browser.Document.GetElementById("myBtn").DomObject);

public string extract(string xpath, string type)
{
string result = string.Empty;
GeckoHtmlElement elm = null;
GeckoWebBrowser wb = geckoWebBrowser1;//(GeckoWebBrowser)GetCurrentWB();
if (wb != null)
{
elm = GetElement(wb, xpath);
if (elm != null)
//UpdateUrlAbsolute(wb.Document, elm);
if (elm != null)
{
switch (type)
{
case "html":
result = elm.OuterHtml;
break;
case "text":
if (elm.GetType().Name == "GeckoTextAreaElement")
{
result = ((GeckoTextAreaElement)elm).Value;
}
else
{
result = elm.TextContent.Trim();
}
break;
case "value":
result = ((GeckoInputElement)elm).Value;
break;
default:
result = extractData(elm, type);
break;
}
}
}
return result;
}
private string extractData(GeckoHtmlElement ele, string attribute)
{
var result = string.Empty;
if (ele != null)
{
var tmp = ele.GetAttribute(attribute);
/*if (tmp == null)
{
tmp = extractData(ele.Parent, attribute);
}*/
if (tmp != null)
result = tmp.Trim();
}
return result;
}
private object GetCurrentWB()
{
if (tabControl1.SelectedTab != null)
{
if(tabControl1.SelectedTab.Controls.Count > 0)
//if (tabControl1.SelectedTab.Controls.Count > 0)
{
Control ctr = tabControl1.SelectedTab.Controls[0];
if (ctr != null)
{
return ctr as object;
}
}
}
return null;
}
private GeckoHtmlElement GetElement(GeckoWebBrowser wb, string xpath)
{
GeckoHtmlElement elm = null;
if (xpath.StartsWith("/"))
{
if (xpath.Contains("#class") || xpath.Contains("#data-type"))
{
var html = GetHtmlFromGeckoDocument(wb.Document);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var node = doc.DocumentNode.SelectSingleNode(xpath);
if (node != null)
{
var currentXpath = "/" + node.XPath;
elm = (GeckoHtmlElement)wb.Document.EvaluateXPath(currentXpath).GetNodes().FirstOrDefault();
}
}
else
{
elm = (GeckoHtmlElement)wb.Document.EvaluateXPath(xpath).GetNodes().FirstOrDefault();
}
}
else
{
elm = (GeckoHtmlElement)wb.Document.GetElementById(xpath);
}
return elm;
}
private string GetHtmlFromGeckoDocument(GeckoDocument doc)
{
var result = string.Empty;
GeckoHtmlElement element = null;
var geckoDomElement = doc.DocumentElement;
if (geckoDomElement is GeckoHtmlElement)
{
element = (GeckoHtmlElement)geckoDomElement;
result = element.InnerHtml;
}
return result;
}
private void button5_Click(object sender, EventArgs e)
{
var text = extract("/html/body/table[3]/tbody/tr[1]/td[2]/a[2]", "text");
MessageBox.Show(text);
}
I also insert the code from which I used a little longer code but it also works. maybe someone will need it. The creator of the code is Đinh Công Thắng, Web Automation App,
Regards

Extract text from a XPS Document [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
i need to extract the text of a specific page from a XPS document.
The extracted text should be written in a string. I need this to read out the extracted text using Microsofts SpeechLib.
Please examples only in C#.
Thanks

Add References to ReachFramework and WindowsBase and the following using statement:
using System.Windows.Xps.Packaging;
Then use this code:
XpsDocument _xpsDocument=new XpsDocument("/path",System.IO.FileAccess.Read);
IXpsFixedDocumentSequenceReader fixedDocSeqReader
=_xpsDocument.FixedDocumentSequenceReader;
IXpsFixedDocumentReader _document = fixedDocSeqReader.FixedDocuments[0];
IXpsFixedPageReader _page
= _document.FixedPages[documentViewerElement.MasterPageNumber];
StringBuilder _currentText = new StringBuilder();
System.Xml.XmlReader _pageContentReader = _page.XmlReader;
if (_pageContentReader != null)
{
while (_pageContentReader.Read())
{
if (_pageContentReader.Name == "Glyphs")
{
if (_pageContentReader.HasAttributes)
{
if (_pageContentReader.GetAttribute("UnicodeString") != null )
{
_currentText.
Append(_pageContentReader.
GetAttribute("UnicodeString"));
}
}
}
}
}
string _fullPageText = _currentText.ToString();
Text exists in Glyphs -> UnicodeString string attribute. You have to use XMLReader for fixed page.

Method that returns text from all pages (modified Amir:s code, hope that's ok):
/// <summary>
/// Get all text strings from an XPS file.
/// Returns a list of lists (one for each page) containing the text strings.
/// </summary>
private static List<List<string>> ExtractTextFromXps(string xpsFilePath)
{
var xpsDocument = new XpsDocument(xpsFilePath, FileAccess.Read);
var fixedDocSeqReader = xpsDocument.FixedDocumentSequenceReader;
if (fixedDocSeqReader == null)
return null;
const string UnicodeString = "UnicodeString";
const string GlyphsString = "Glyphs";
var textLists = new List<List<string>>();
foreach (IXpsFixedDocumentReader fixedDocumentReader in fixedDocSeqReader.FixedDocuments)
{
foreach (IXpsFixedPageReader pageReader in fixedDocumentReader.FixedPages)
{
var pageContentReader = pageReader.XmlReader;
if (pageContentReader == null)
continue;
var texts = new List<string>();
while (pageContentReader.Read())
{
if (pageContentReader.Name != GlyphsString)
continue;
if (!pageContentReader.HasAttributes)
continue;
if (pageContentReader.GetAttribute(UnicodeString) != null)
texts.Add(pageContentReader.GetAttribute(UnicodeString));
}
textLists.Add(texts);
}
}
xpsDocument.Close();
return textLists;
}
Usage:
var txtLists = ExtractTextFromXps(#"C:\myfile.xps");
int pageIdx = 0;
foreach (List<string> txtList in txtLists)
{
pageIdx++;
Console.WriteLine("== Page {0} ==", pageIdx);
foreach (string txt in txtList)
Console.WriteLine(" "+txt);
Console.WriteLine();
}

private string ReadXpsFile(string fileName)
{
XpsDocument _xpsDocument = new XpsDocument(fileName, System.IO.FileAccess.Read);
IXpsFixedDocumentSequenceReader fixedDocSeqReader
= _xpsDocument.FixedDocumentSequenceReader;
IXpsFixedDocumentReader _document = fixedDocSeqReader.FixedDocuments[0];
FixedDocumentSequence sequence = _xpsDocument.GetFixedDocumentSequence();
string _fullPageText="";
for (int pageCount = 0; pageCount < sequence.DocumentPaginator.PageCount; ++pageCount)
{
IXpsFixedPageReader _page
= _document.FixedPages[pageCount];
StringBuilder _currentText = new StringBuilder();
System.Xml.XmlReader _pageContentReader = _page.XmlReader;
if (_pageContentReader != null)
{
while (_pageContentReader.Read())
{
if (_pageContentReader.Name == "Glyphs")
{
if (_pageContentReader.HasAttributes)
{
if (_pageContentReader.GetAttribute("UnicodeString") != null)
{
_currentText.
Append(_pageContentReader.
GetAttribute("UnicodeString"));
}
}
}
}
}
_fullPageText += _currentText.ToString();
}
return _fullPageText;
}

Full Code of Class:
using System.Collections.Generic;
using System.Drawing;
using System.Windows.Forms;
using System.Windows.Xps.Packaging;
namespace XPS_Data_Transfer
{
internal static class XpsDataReader
{
public static List<string> ReadXps(string address, int pageNumber)
{
var xpsDocument = new XpsDocument(address, System.IO.FileAccess.Read);
var fixedDocSeqReader = xpsDocument.FixedDocumentSequenceReader;
if (fixedDocSeqReader == null) return null;
const string uniStr = "UnicodeString";
const string glyphs = "Glyphs";
var document = fixedDocSeqReader.FixedDocuments[pageNumber - 1];
var page = document.FixedPages[0];
var currentText = new List<string>();
var pageContentReader = page.XmlReader;
if (pageContentReader == null) return null;
while (pageContentReader.Read())
{
if (pageContentReader.Name != glyphs) continue;
if (!pageContentReader.HasAttributes) continue;
if (pageContentReader.GetAttribute(uniStr) != null)
currentText.Add(Dashboard.CleanReversedPersianText(pageContentReader.GetAttribute(uniStr)));
}
return currentText;
}
}
}
that return a list of string data from custom page of custom file.

Get creation date for the file with SharpSvn

I'm playing with SharpSvn and have to analyze folder in repository
and I need to know when some file has be created there,
not last modification date, but when it was created,
Do you have any idea how to do that?
I started with the following:
Collection<SvnLogEventArgs> logitems;
var c = client.GetLog(new Uri(server_path), out logitems);
foreach (var i in logitems)
{
var properties = i.CustomProperties;
foreach (var p in properties)
{
Console.WriteLine(p.ToString());
Console.WriteLine(p.Key);
Console.WriteLine(p.StringValue);
}
}
But I don't see any creation date there.
Does someone know where to get it?

Looks like I can't do that. Here is how I solved this problem:
I'm getting the time if it was the SvnChangeAction.Add.
Here is the code (SvnFile is my own class, not from SharpSvn):
public static List<SvnFile> GetSvnFiles(this SvnClient client, string uri_path)
{
// get logitems
Collection<SvnLogEventArgs> logitems;
client.GetLog(new Uri(uri_path), out logitems);
var result = new List<SvnFile>();
// get craation date for each
foreach (var logitem in logitems.OrderBy(logitem => logitem.Time))
{
foreach (var changed_path in logitem.ChangedPaths)
{
string filename = Path.GetFileName(changed_path.Path);
if (changed_path.Action == SvnChangeAction.Add)
{
result.Add(new SvnFile() { Name = filename, Created = logitem.Time });
}
}
}
return result;
}

Slightly different code than the previous one:
private static DateTime findCreationDate(SvnClient client, SvnListEventArgs item)
{
Collection<SvnLogEventArgs> logList = new Collection<SvnLogEventArgs>();
if (item.BasePath != "/" + item.Name)
{
client.GetLog(new Uri(item.RepositoryRoot + item.BasePath + "/" + item.Name), out logList);
foreach (var logItem in logList)
{
foreach (var changed_path in logItem.ChangedPaths)
{
string filename = Path.GetFileName(changed_path.Path);
if (filename == item.Name && changed_path.Action == SvnChangeAction.Add)
{
return logItem.Time;
}
}
}
}
return new DateTime();
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

C# Extract text from PDF using PdfSharp - c#

Is there a possibility to extract plain text from a PDF-File with PdfSharp? I don't want to use iTextSharp because of its license.

PDFSharp provides all the tools to extract the text from a PDF. Use the ContentReader class to access the commands within each page and extract the strings from TJ/Tj operators. I've uploaded a simple implementation to github.

Related

Iterating through a list populated by a text file to look for a reference C#

elastic search field update from csv file in c#

Get string data Xpath

Extract text from a XPS Document [closed]

Get creation date for the file with SharpSvn

Categories

Resources