Extract text from a XPS Document [closed] - c#

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
i need to extract the text of a specific page from a XPS document.
The extracted text should be written in a string. I need this to read out the extracted text using Microsofts SpeechLib.
Please examples only in C#.
Thanks

Add References to ReachFramework and WindowsBase and the following using statement:
using System.Windows.Xps.Packaging;
Then use this code:
XpsDocument _xpsDocument=new XpsDocument("/path",System.IO.FileAccess.Read);
IXpsFixedDocumentSequenceReader fixedDocSeqReader
=_xpsDocument.FixedDocumentSequenceReader;
IXpsFixedDocumentReader _document = fixedDocSeqReader.FixedDocuments[0];
IXpsFixedPageReader _page
= _document.FixedPages[documentViewerElement.MasterPageNumber];
StringBuilder _currentText = new StringBuilder();
System.Xml.XmlReader _pageContentReader = _page.XmlReader;
if (_pageContentReader != null)
{
while (_pageContentReader.Read())
{
if (_pageContentReader.Name == "Glyphs")
{
if (_pageContentReader.HasAttributes)
{
if (_pageContentReader.GetAttribute("UnicodeString") != null )
{
_currentText.
Append(_pageContentReader.
GetAttribute("UnicodeString"));
}
}
}
}
}
string _fullPageText = _currentText.ToString();
Text exists in Glyphs -> UnicodeString string attribute. You have to use XMLReader for fixed page.

Method that returns text from all pages (modified Amir:s code, hope that's ok):
/// <summary>
/// Get all text strings from an XPS file.
/// Returns a list of lists (one for each page) containing the text strings.
/// </summary>
private static List<List<string>> ExtractTextFromXps(string xpsFilePath)
{
var xpsDocument = new XpsDocument(xpsFilePath, FileAccess.Read);
var fixedDocSeqReader = xpsDocument.FixedDocumentSequenceReader;
if (fixedDocSeqReader == null)
return null;
const string UnicodeString = "UnicodeString";
const string GlyphsString = "Glyphs";
var textLists = new List<List<string>>();
foreach (IXpsFixedDocumentReader fixedDocumentReader in fixedDocSeqReader.FixedDocuments)
{
foreach (IXpsFixedPageReader pageReader in fixedDocumentReader.FixedPages)
{
var pageContentReader = pageReader.XmlReader;
if (pageContentReader == null)
continue;
var texts = new List<string>();
while (pageContentReader.Read())
{
if (pageContentReader.Name != GlyphsString)
continue;
if (!pageContentReader.HasAttributes)
continue;
if (pageContentReader.GetAttribute(UnicodeString) != null)
texts.Add(pageContentReader.GetAttribute(UnicodeString));
}
textLists.Add(texts);
}
}
xpsDocument.Close();
return textLists;
}
Usage:
var txtLists = ExtractTextFromXps(#"C:\myfile.xps");
int pageIdx = 0;
foreach (List<string> txtList in txtLists)
{
pageIdx++;
Console.WriteLine("== Page {0} ==", pageIdx);
foreach (string txt in txtList)
Console.WriteLine(" "+txt);
Console.WriteLine();
}

private string ReadXpsFile(string fileName)
{
XpsDocument _xpsDocument = new XpsDocument(fileName, System.IO.FileAccess.Read);
IXpsFixedDocumentSequenceReader fixedDocSeqReader
= _xpsDocument.FixedDocumentSequenceReader;
IXpsFixedDocumentReader _document = fixedDocSeqReader.FixedDocuments[0];
FixedDocumentSequence sequence = _xpsDocument.GetFixedDocumentSequence();
string _fullPageText="";
for (int pageCount = 0; pageCount < sequence.DocumentPaginator.PageCount; ++pageCount)
{
IXpsFixedPageReader _page
= _document.FixedPages[pageCount];
StringBuilder _currentText = new StringBuilder();
System.Xml.XmlReader _pageContentReader = _page.XmlReader;
if (_pageContentReader != null)
{
while (_pageContentReader.Read())
{
if (_pageContentReader.Name == "Glyphs")
{
if (_pageContentReader.HasAttributes)
{
if (_pageContentReader.GetAttribute("UnicodeString") != null)
{
_currentText.
Append(_pageContentReader.
GetAttribute("UnicodeString"));
}
}
}
}
}
_fullPageText += _currentText.ToString();
}
return _fullPageText;
}

Full Code of Class:
using System.Collections.Generic;
using System.Drawing;
using System.Windows.Forms;
using System.Windows.Xps.Packaging;
namespace XPS_Data_Transfer
{
internal static class XpsDataReader
{
public static List<string> ReadXps(string address, int pageNumber)
{
var xpsDocument = new XpsDocument(address, System.IO.FileAccess.Read);
var fixedDocSeqReader = xpsDocument.FixedDocumentSequenceReader;
if (fixedDocSeqReader == null) return null;
const string uniStr = "UnicodeString";
const string glyphs = "Glyphs";
var document = fixedDocSeqReader.FixedDocuments[pageNumber - 1];
var page = document.FixedPages[0];
var currentText = new List<string>();
var pageContentReader = page.XmlReader;
if (pageContentReader == null) return null;
while (pageContentReader.Read())
{
if (pageContentReader.Name != glyphs) continue;
if (!pageContentReader.HasAttributes) continue;
if (pageContentReader.GetAttribute(uniStr) != null)
currentText.Add(Dashboard.CleanReversedPersianText(pageContentReader.GetAttribute(uniStr)));
}
return currentText;
}
}
}
that return a list of string data from custom page of custom file.

Related

OpenXml how to split a word document into pages

I want to split a Word document into separate pages and save them to files like Page_1.docx; Page_2.docx ......
At first I tried to do this:
using (var sourceWordDoc = WordprocessingDocument.Open(
wordFilePath
, false))
{
var sourceElements = sourceWordDoc.MainDocumentPart.Document.Body.Elements();
var pageElements = new List<DocumentFormat.OpenXml.OpenXmlElement>();
var pageIndex = 1;
foreach (var sourceElement in sourceElements)
{
var run = sourceElement.GetFirstChild<Run>();
if (run != null)
{
var lastRenderedPageBreak = run.GetFirstChild<LastRenderedPageBreak>();
var pageBreak = run.GetFirstChild<Break>();
if (lastRenderedPageBreak != null || pageBreak != null)
{
//Create and save page
using (var destinationWordDoc = WordprocessingDocument.Create(
$"Page_{pageIndex}.docx",
DocumentFormat.OpenXml.WordprocessingDocumentType.Document))
{
var destinationPart = destinationWordDoc.AddMainDocumentPart();
destinationPart.Document = new Document();
var destinationBody = destinationPart.Document.AppendChild(new Body());
foreach (var pageElement in pageElements)
{
destinationBody.Append(pageElement.CloneNode(true));
}
destinationPart.Document.Save();
}
pageElements.Clear();
pageIndex++;
}
}
pageElements.Add(sourceElement);
}
}
But in this variand, Header is not copied.
Then I wrote the following code to copy the Header (for test):
using (var sourceWordDoc = WordprocessingDocument.Open(
wordFilePath
, false))
{
var sourceHeaderPart = sourceWordDoc.MainDocumentPart.HeaderParts.FirstOrDefault();
var sourceDocument = sourceWordDoc.MainDocumentPart.Document;
using (var destinationWordDoc = WordprocessingDocument.Create(
somePath
, WordprocessingDocumentType.Document))
{
var destinationMainDocumentPart = destinationWordDoc.AddMainDocumentPart();
destinationMainDocumentPart.Document = new DocumentFormat.OpenXml.Wordprocessing.Document();
var destinationBody = destinationMainDocumentPart.Document.AppendChild(new DocumentFormat.OpenXml.Wordprocessing.Body());
var rId = string.Empty;
if (sourceHeaderPart != null)
{
var destinationHeaderPart = destinationMainDocumentPart.AddNewPart<HeaderPart>();
rId = destinationMainDocumentPart.GetIdOfPart(destinationHeaderPart);
destinationHeaderPart.FeedData(sourceHeaderPart.GetStream());
}
var sourceElements = sourceDocument.Body.Elements();
foreach (var sourceElement in sourceElements)
{
if (sourceElement.GetType() == typeof(DocumentFormat.OpenXml.Wordprocessing.SectionProperties))
{
var sourceSectionsElements = sourceElement.Elements();
foreach (var sourceSectionsElement in sourceSectionsElements)
{
if (sourceSectionsElement.GetType() == typeof(DocumentFormat.OpenXml.Wordprocessing.HeaderReference))
{
var sourceHeaderReference = (DocumentFormat.OpenXml.Wordprocessing.HeaderReference)sourceSectionsElement;
sourceHeaderReference.Id = rId;
break;
}
}
}
destinationBody.Append(sourceElement.CloneNode(true));
}
}
}
But now I have a new problem, images from the header are not transferred.
Are there any options to break the document into separate pages and at the same time preserve the entire content of the page?

C# List embedded files inside a PDF file and get a file stream to one of the embedded files

I can create a PDF with embedded files using LaTeX for example:
\usepackage{embedfile}
\embedfile{abc.data}
\embedfile{def.data}
Using the Acrobat Reader I'm able to extract again the two data files.
But how can I do that from C#?
How can I list the files which are embedded inside a PDF, similiar to a directory list?
How can I get a file stream (readonly) to one of the embedded files inside a PDF?
Using the iTextSharp-LGPL PDF library I was able to solve it.
List embedded file names
public static string[] ListEmbeddedFileNames(string pdfFileName)
{
string[] fileNames = new string[0];
var reader = new iTextSharp.text.pdf.PdfReader(pdfFileName);
if (reader != null)
{
var root = reader.Catalog;
if (root != null)
{
var names = root.GetAsDict(iTextSharp.text.pdf.PdfName.NAMES);
if (names != null)
{
var embeddedFiles = names.GetAsDict(iTextSharp.text.pdf.PdfName.EMBEDDEDFILES);
if (embeddedFiles != null)
{
var namesArray = embeddedFiles.GetAsArray(iTextSharp.text.pdf.PdfName.NAMES);
if (namesArray != null)
{
int n = namesArray.Size / 2; // I don't understand why I have to divide by 2
fileNames = new string[n];
for (int i = 0; i < n; i++) fileNames[i] = namesArray[2 * i].ToString();
}
}
}
}
reader.Close();
}
return fileNames;
}
Get embedded file stream
public static Stream GetEmbeddedFileStream(string pdfFileName, string embeddedFileName)
{
byte[] data = GetEmbeddedFileData(pdfFileName, embeddedFileName);
if (data == null)
return null;
else
return new MemoryStream(data);
}
and
public static byte[] GetEmbeddedFileData(string pdfFileName, string embeddedFileName)
{
byte[] attachedFileBytes = null;
var reader = new iTextSharp.text.pdf.PdfReader(pdfFileName);
if (reader != null)
{
var root = reader.Catalog;
if (root != null)
{
var names = root.GetAsDict(iTextSharp.text.pdf.PdfName.NAMES);
if (names != null)
{
var embeddedFiles = names.GetAsDict(iTextSharp.text.pdf.PdfName.EMBEDDEDFILES);
if (embeddedFiles != null)
{
var namesArray = embeddedFiles.GetAsArray(iTextSharp.text.pdf.PdfName.NAMES);
if (namesArray != null)
{
int n = namesArray.Size;
for (int i = 0; i < n; i++)
{
i++;
var fileArray = namesArray.GetAsDict(i);
var file = fileArray.GetAsDict(iTextSharp.text.pdf.PdfName.EF);
foreach (iTextSharp.text.pdf.PdfName key in file.Keys)
{
string attachedFileName = fileArray.GetAsString(key).ToString();
if (attachedFileName == embeddedFileName)
{
var stream = (iTextSharp.text.pdf.PRStream)iTextSharp.text.pdf.PdfReader.GetPdfObject(file.GetAsIndirectObject(key));
attachedFileBytes = iTextSharp.text.pdf.PdfReader.GetStreamBytes(stream);
break;
}
}
if (attachedFileBytes != null) break;
}
}
}
}
}
reader.Close();
}
return attachedFileBytes;
}
I've checked my solution by computing MD5 checksums over the embedded files.

C# Extract Text from .XPS Document

I have been using Another StackOverflow answer to this question as a reference to solving this problem, however I have run into a problem. I am getting an error at FixedDocumentSequence saying that it could not be found. I have added references to PresentationCore, PresentationFramework, WindowsBase and ReachFramework already, and I'm not quite sure if I need to add another reference for the FixedDocumentSequence.
Here is my code:
public string convertXPS(string fileName)
{
XpsDocument _xpsDocument = new XpsDocument(fileName, System.IO.FileAccess.Read);
IXpsFixedDocumentSequenceReader fixedDocSeqReader = _xpsDocument.FixedDocumentSequenceReader;
IXpsFixedDocumentReader _document = fixedDocSeqReader.FixedDocuments[0];
FixedDocumentSequence sequence = _xpsDocument.GetFixedDocumentSequence();
string _fullPageText = "";
for (int pageCount = 0; pageCount < sequence.DocumentPaginator.PageCount; ++pageCount)
{
IXpsFixedPageReader _page = _document.FixedPages[pageCount];
StringBuilder _currentText = new StringBuilder();
System.Xml.XmlReader _pageContentReader = _page.XmlReader;
if (_pageContentReader != null)
{
while (_pageContentReader.Read())
{
if (_pageContentReader.Name == "Glyphs")
{
if (_pageContentReader.HasAttributes)
{
if (_pageContentReader.GetAttribute("UnicodeString") != null)
{
_currentText.
Append(_pageContentReader.
GetAttribute("UnicodeString"));
}
}
}
}
}
_fullPageText += _currentText.ToString();
}
return _fullPageText;
}
[STAThread]
static void Main(string[] args)
{
try
{
XpsDocument _xpsDocument = new XpsDocument(#"C:\Users\admin-\Desktop\testing.xps", System.IO.FileAccess.Read);
IXpsFixedDocumentSequenceReader fixedDocSeqReader = _xpsDocument.FixedDocumentSequenceReader;
IXpsFixedDocumentReader _document = fixedDocSeqReader.FixedDocuments[0];
FixedDocumentSequence sequence = _xpsDocument.GetFixedDocumentSequence();
string _fullPageText = "";
for (int pageCount = 0; pageCount < sequence.DocumentPaginator.PageCount; ++pageCount)
{
IXpsFixedPageReader _page = _document.FixedPages[pageCount];
StringBuilder _currentText = new StringBuilder();
System.Xml.XmlReader _pageContentReader = _page.XmlReader;
if (_pageContentReader != null)
{
while (_pageContentReader.Read())
{
if (_pageContentReader.Name == "Glyphs")
{
if (_pageContentReader.HasAttributes)
{
if (_pageContentReader.GetAttribute("UnicodeString") != null)
{
_currentText.
Append(_pageContentReader.
GetAttribute("UnicodeString"));
}
}
}
}
}
_fullPageText += _currentText.ToString();
}
}
catch(Exception e)
{
}
}
I don't think there is much change in the code, try to add the [STAThread] which helped me to read the xps, also i only used the above mentioned references to read the file,also i got the same error that you got, but somehow solved it, you are 90% closer to get the result
Also see which reference is needed to add System.Windows.Documents;

Extract texts from xps document to textbox

I keep running in to this code when researching, however copying this to my form gives me an error in the documentViewerElement part saying The name 'documentViewerElement' does not exist in the current context
XpsDocument _xpsDocument=new XpsDocument("/path",System.IO.FileAccess.Read);
IXpsFixedDocumentSequenceReader fixedDocSeqReader
=_xpsDocument.FixedDocumentSequenceReader;
IXpsFixedDocumentReader _document = fixedDocSeqReader.FixedDocuments[0];
IXpsFixedPageReader _page
= _document.FixedPages[documentViewerElement.MasterPageNumber];
StringBuilder _currentText = new StringBuilder();
System.Xml.XmlReader _pageContentReader = _page.XmlReader;
if (_pageContentReader != null)
{
while (_pageContentReader.Read())
{
if (_pageContentReader.Name == "Glyphs")
{
if (_pageContentReader.HasAttributes)
{
if (_pageContentReader.GetAttribute("UnicodeString") != null )
{
_currentText.
Append(_pageContentReader.
GetAttribute("UnicodeString"));
}
}
}
}
}
string _fullPageText = _currentText.ToString();
I'm hoping to get all the texts from an xps document and put it on a rich text box.
documentViewerElement is not defined hence your error.
In the following line:
IXpsFixedPageReader _page
= _document.FixedPages[documentViewerElement.MasterPageNumber];
documentViewerElement.MasterPageNumber is just the page number, so change it to the xps page you want to read, e.g.
IXpsFixedPageReader _page
= _document.FixedPages[0];
To read the text from the entire xps file you could try the following (it's pretty much the same as your code it's just looping (Taken from here).
private string ReadXpsFile(string fileName)
{
XpsDocument _xpsDocument = new XpsDocument(fileName, System.IO.FileAccess.Read);
IXpsFixedDocumentSequenceReader fixedDocSeqReader = _xpsDocument.FixedDocumentSequenceReader;
IXpsFixedDocumentReader _document = fixedDocSeqReader.FixedDocuments[0];
FixedDocumentSequence sequence = _xpsDocument.GetFixedDocumentSequence();
string _fullPageText="";
for (int pageCount = 0; pageCount < sequence.DocumentPaginator.PageCount; ++pageCount)
{
IXpsFixedPageReader _page = _document.FixedPages[pageCount];
StringBuilder _currentText = new StringBuilder();
System.Xml.XmlReader _pageContentReader = _page.XmlReader;
if (_pageContentReader != null)
{
while (_pageContentReader.Read())
{
if (_pageContentReader.Name == "Glyphs")
{
if (_pageContentReader.HasAttributes)
{
if (_pageContentReader.GetAttribute("UnicodeString") != null)
{
_currentText.
Append(_pageContentReader.
GetAttribute("UnicodeString"));
}
}
}
}
}
_fullPageText += _currentText.ToString();
}
return _fullPageText;
}

C# Extract text from PDF using PdfSharp

Is there a possibility to extract plain text from a PDF-File with PdfSharp?
I don't want to use iTextSharp because of its license.
Took Sergio's answer and made some extension methods. I also changed the accumulation of strings into an iterator.
public static class PdfSharpExtensions
{
public static IEnumerable<string> ExtractText(this PdfPage page)
{
var content = ContentReader.ReadContent(page);
var text = content.ExtractText();
return text;
}
public static IEnumerable<string> ExtractText(this CObject cObject)
{
if (cObject is COperator)
{
var cOperator = cObject as COperator;
if (cOperator.OpCode.Name== OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
foreach (var txt in ExtractText(cOperand))
yield return txt;
}
}
else if (cObject is CSequence)
{
var cSequence = cObject as CSequence;
foreach (var element in cSequence)
foreach (var txt in ExtractText(element))
yield return txt;
}
else if (cObject is CString)
{
var cString = cObject as CString;
yield return cString.Value;
}
}
}
I have implemented it somehow similar to how David did it.
Here is my code:
...
{
// ....
var page = document.Pages[1];
CObject content = ContentReader.ReadContent(page);
var extractedText = ExtractText(content);
// ...
}
private IEnumerable<string> ExtractText(CObject cObject)
{
var textList = new List<string>();
if (cObject is COperator)
{
var cOperator = cObject as COperator;
if (cOperator.OpCode.Name == OpCodeName.Tj.ToString() ||
cOperator.OpCode.Name == OpCodeName.TJ.ToString())
{
foreach (var cOperand in cOperator.Operands)
{
textList.AddRange(ExtractText(cOperand));
}
}
}
else if (cObject is CSequence)
{
var cSequence = cObject as CSequence;
foreach (var element in cSequence)
{
textList.AddRange(ExtractText(element));
}
}
else if (cObject is CString)
{
var cString = cObject as CString;
textList.Add(cString.Value);
}
return textList;
}
PDFSharp provides all the tools to extract the text from a PDF. Use the ContentReader class to access the commands within each page and extract the strings from TJ/Tj operators.
I've uploaded a simple implementation to github.

Categories