Read a docx file in C# using OpenXml

Read a docx file in C# using OpenXml - c#

I am new to C# and OpenXml. I need help with reading a .docx file and storing each paragraph in the Array.
I am Using OpenXml to read a word(.docx) file. I was able to read the file and print it. But the problem is I was only able to print the concatenated paragraph. I couldn't find a way to store each paragraph as array of Strings(Like in Python using docx library you automatically store paragraph as a list of string, I was looking something similar to that).
using System;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
namespace ConsoleApp1
{
class Program
{
static void Main(string[] args)
{
OpenWordprocessingDocumentReadonly(#"E:\WordDocTest\Test.docx");
}
public static void OpenWordprocessingDocumentReadonly(string filepath)
{
// Open a WordprocessingDocument based on a filepath.
using (WordprocessingDocument wordDocument =
WordprocessingDocument.Open(filepath, false))
{
// Assign a reference to the existing document body.
Body body = wordDocument.MainDocumentPart.Document.Body;
Console.WriteLine(body.InnerText);
wordDocument.Close();
}
}
}
}
Test.docx Looks Like this
1. Test
This is Test 1.
Test1 part a.
2. noTest
This is Test2.
The Output that I got was : TestThis is Test 1.Test1 part a.noTestThis is Test 2.
What I want to learn is about the way to store each paragraph or line in an Array of String and be able to iterate through that array.

#Nirakar Nepal You could try loop through the paras list and extract the next sibling, e.g. 'foreach (var para in paras) { richTextBox1.Text += para.NextSibling().InnerText + "\n"; } ' This of course assumes you are printing the output to a richtextbox. This will show whatever happens to be after the headingYou can avoid using arrays and instead unleash the wonderful power of Openxml combined with Linq and Lists. If you want to work with paragraphs you could create a list lik this:
var paras = body.OfType<Paragraph>();
You can then expand on this to return specific elements using Where, for example:
var paras = body.OfType<Paragraph>()
.Where(p => p.ParagraphProperties != null &&
p.ParagraphProperties.ParagraphStyleId != null &&
p.ParagraphProperties.ParagraphStyleId.Val.Value.Contains("Heading1")).ToList();
To return the paragraph which follows the heading you could try loop through the paras list and extract the next sibling, e.g.
foreach (var para in paras) {
richTextBox1.Text += para.NextSibling().InnerText + "\n";
}
This of course assumes you are printing the output to a richtextbox. This will show whatever happens to be after the heading. Again your code code could include .where to filter the results

Related

How to find the closest sub-paragraph where a table is, in a docx document?

From a docx file, I would like to extract only the tables and their related heading. In other words, I am interested in the tables and the heading each table belongs to ("lies under").
I am using DocumentFormat.OpenXml library.
Here is my draft:
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;
...
using (var doc = WordprocessingDocument.Open(filewithpath, false))
{
Body body = doc.MainDocumentPart.Document.Body;
List<Table> tables = GetTables(body);
List<string> Paragraphs = new List<string>();
foreach (Table table in tables)
{
Paragraphs.Add(table.???); //I have no idea what to write here
}
}
Thanks in advance!

You can loop through siblings of table and look for "heading" style in paragraph style info.
You can check this answer to get "heading" style.
A pseudo code would be like:
siblings = table.GetElementsBefore(); // You can get all the siblings before the table
siblings_rev = siblings.Reverse(); // reverse it to start from closest sibling
foreach(sibling in siblings_rev)
if(sibling.style_properties.contains("header"))
string title = sibling.value;

Duplication with OpenXML (word document) and ID issues

Is it possible to duplicate a word document element with OpenXML without having any issues of "duplicate id" ?
Actually, to duplicate, I clone the elements inside the body and append the cloned elements in the body. But if any of the element have an ID, I'm having errors when I open the document in word.
Here is an example of error from OpenXML validator :
[60] Description="Attribute 'id' should have unique value. Its
current value 'Rectangle 11' duplicates with
others."
And here is my code :
Document document = wordDocument.MainDocumentPart.Document;
Body body = document.Body;
IEnumerable<OpenXmlElement> elements = ((Body)body.CloneNode(true)).Elements();
foreach (var element in elements)
{
OpenXmlElement e = (OpenXmlElement)element.CloneNode(true);
body.AppendChild(e);
}

You can't just copy elements with an id, you have to duplicate Parts too (search OpenXmlPart for more informations).
You can do this by combining functions AddPart() and GetIdOfPart() (accessible from MainDocumentPart)
First try:
when you have an element with an id, use AddPart(OpenXmlPart part) to add the element part and retrieve the new generated id of the part with GetIdOfPart(OpenXmlPart part)
After that, you can replace in your cloned OpenXmlElement the id by the new one
Second try:
or you could imagine an other way like:
Check highest id of existing parts (and save it)
Clone all parts from the start and choose yourself the id (by adding the highest saved id)
When you copy each element and find an id, add the saved highest id to match with the new part
I hope one of this way will help you, but in any case you will need to clone parts

DocIO is a .NET class library that can read, write and render Microsoft Word documents. Using DocIO, you can clone the elements such as paragraph, table, text run or the entire document and append it where you need.
The whole suite of controls is available for free (commercial applications also) through the community license program if you qualify. The community license is the full product with no limitations or watermarks.
Herewith we have a given simple example code snippet which clone all the paragraphs and tables in the document body and append them at the end of the same document.
using Syncfusion.DocIO.DLS;
namespace DocIO_Clone
{
class Program
{
static void Main(string[] args)
{
using (WordDocument document = new WordDocument(#"InputWordFile.docx"))
{
int sectionCount = document.Sections.Count;
for (int i = 0; i < sectionCount; i++)
{
IWSection section = document.Sections[i];
int entityCount = section.Body.ChildEntities.Count;
for (int j = 0; j < entityCount; j++)
{
IEntity entity = section.Body.ChildEntities[j];
switch(entity.EntityType)
{
case EntityType.Paragraph:
IWParagraph paragraph = entity.Clone() as IWParagraph;
document.LastSection.Body.ChildEntities.Add(paragraph);
break;
case EntityType.Table:
IWTable table = entity.Clone() as IWTable;
document.LastSection.Body.ChildEntities.Add(table);
break;
}
}
}
document.Save("ResultDocument.docx");
}
}
}
}
For further information, please refer our help documentation
Note: I work for Syncfusion

Conditional new Break for multi-column docx file, C#

This is a follow-up question for Creating Word file from ObservableCollection with C#.
I have a .docx file with a Body that has 2 columns for its SectionProperties. I have a dictionary of foreign words with their translation. On each line I need [Word] = [Translation] and whenever a new letter starts it should be in its own line, with 2 or 3 line breaks before and after that letter, like this:
A
A-word = translation
A-word = translation
B
B-word = translation
B-word = translation
...
I structured this in a for loop, so that in every iteration I'm creating a new paragraph with a possible Run for the letter (if a new one starts), a Run for the word and a Run for the translation. So the Run with the first letter is in the same Paragraph as the word and translation Run and it appends 2 or 3 Break objects before and after the Text.
In doing so the second column can sometimes start with 1 or 2 empty lines. Or the first column on the next page can start with empty lines.
This is what I want to avoid.
So my question is, can I somehow check if the end of the page is reached, or the text is at the top of the column, so I don't have to add a Break? Or, can I format the Column itself so that it doesn't start with an empty line?
I have tried putting the letter Run in a separate, optional, Paragraph, but again, I find myself having to input line breaks and the problem remains.

In the spirit of my other answer you can extend the template capability.
Use the Productivity tool to generate a single page break object, something like:
private readonly Paragraph PageBreakPara = new Paragraph(new Run(new Break() { Type = BreakValues.Page}));
Make a helper method that finds containers of a text tag:
public IEnumerable FindElements(OpenXmlCompositeElement searchParent, string tagRegex)
where T: OpenXmlElement
{
var regex = new Regex(tagRegex);
return searchParent.Descendants()
.Where(e=>(!(e is OpenXmlCompositeElement)
&& regex.IsMatch(e.InnerText)))
.SelectMany(e =>
e.Ancestors()
.OfType<T>()
.Union(e is T ? new T[] { (T)e } : new T[] {} ))
.ToList(); // can skip, prevents reevaluations
}
And another one that duplicates a range from the document and deletes range:
public IEnumerable<T> DuplicateRange<T>(OpenXmlCompositeElement root, string tagRegex)
where T: OpenXmlElement
{
// tagRegex must describe exactly two tags, such as [pageStart] and [pageEnd]
// or [page] [/page] - or whatever pattern you choose
var tagElements = FindElements(root, tagRegex);
var fromEl = tagElements.First();
var toEl = tagElements.Skip(1).First(); // throws exception if less than 2 el
// you may want to find a common parent here
// I'll assume you've prepared the template so the elements are siblings.
var result = new List<OpenXmlElement>();
var step = fromEl.NextSibling();
while (step !=null && toEl!=null && step!=toEl){
// another method called DeleteRange will instead delete elements in that range within this loop
var copy = step.CloneNode();
toEl.InsertAfterSelf(copy);
result.Add(copy);
step = step.NextSibling();
}
return result;
}
public IEnumerable<OpenXmlElement> ReplaceTag(OpenXmlCompositeElement parent, string tagRegex, string replacement){
var replaceElements = FindElements<OpenXmlElement>(parent, tagRegex);
var regex = new Regex(tagRegex);
foreach(var el in replaceElements){
el.InnerText = regex.Replace(el.InnerText, replacement);
}
return replaceElements;
}
Now you can have a document that looks like this:
[page]
[TitleLetter]
[WordTemplate][Word]: [Translation] [/WordTemplate]
[pageBreak]
[/page]
With that document you can duplicate the [page]..[/page] range, process it per letter and once you're out of letters - delete the template range:
var vocabulary = Dictionary>;
foreach (var letter in vocabulary.Keys.OrderByDescending(c=>c)){
// in reverse order because the copy range comes after the template range
var pageTemplate = DuplicateRange(wordDocument,"\\[/?page\\]");
foreach (var p in pageTemplate.OfType<OpenXmlCompositeElement>()){
ReplaceTag(p, "[TitleLetter]",""+letter);
var pageBr = ReplaceTag(p, "[pageBreak]","");
if (pageBr.Any()){
foreach(var pbr in pageBr){
pbr.InsertAfterSelf(PageBreakPara.CloneNode());
}
}
var wordTemplateFound = FindElements(p, "\\[/?WordTemplate\\]");
if (wordTemplateFound .Any()){
foreach (var word in vocabulary[letter].Keys){
var wordTemplate = DuplicateRange(p, "\\[/?WordTemplate\\]")
.First(); // since it's a single paragraph template
ReplaceTag(wordTemplate, "\\[/?WordTemplate\\]","");
ReplaceTag(wordTemplate, "\\[Word]",word);
ReplaceTag(wordTemplate, "\\[Translation\\]",vocabulary[letter][word]);
}
}
}
}
...Or something like it.
Look into SdtElements if things start getting too complicated
Don't use AltChunk despite the popularity of that answer, it requires Word to open and process the file, so you can't use some library to make a PDF out of it
Word documents are messy, the solution above should work (haven't tested) but the template must be carefully crafted, make backups of your template often
making a robust document engine isn't easy (since Word is messy), do the minimum you need and rely on the template being in your control (not user-editable).
the code above is far from optimized or streamlined, I've tried to condense it in the smallest footprint possible at the cost of presentability. There are probably bugs too :)

Determine if input file is usable by program

I have a C# program that looks through directories for .txt files and loads each into a DataTable.
static IEnumerable<string> ReadAsLines(string fileName)
{
using (StreamReader reader = new StreamReader(fileName))
while (!reader.EndOfStream)
yield return reader.ReadLine();
}
public DataTable GetTxtData()
{
IEnumerable<string> reader = ReadAsLines(this.File);
DataTable txtData = new DataTable();
string[] headers = reader.First().Split('\t');
foreach (string columnName in headers)
txtData.Columns.Add(columnName);
IEnumerable<string> records = reader.Skip(1);
foreach (string rec in records)
txtData.Rows.Add(rec.Split('\t'));
return txtData;
}
This works great for regular tab-delimited files. However, the catch is that not every .txt file in the folders I need to use contains tab-delimited data. Some .txt files are actually SQL queries, notes, etc. that have been saved as plain text files, and I have no way of determining that beforehand. Trying to use the above code on such files clearly won't lead to the expected result.
So my question is this: How can I tell whether a .txt file actually contains tab-delimited data before I try to read it into a DataTable using the above code?
Just searching the file for any tab character won't work because, for example, a SQL query saved as plain text might have tabs for code formatting.
Any guidance here at all would be much appreciated!

If each line contains the same number of elements, then simply read each line, and verify that you get the correct number of fields in each record. If not error out.
if (headers.Count() != CORRECTNUMBER)
{
// ERROR
}
foreach (string rec in records)
{
string[] recordData = rec.Split('\t');
if (recordData.Count() != headers.Count())
{
// ERROR
}
txtData.Rows.Add(recordData);
}

To do this you need a set of "signature" logic providers which can check a given sample of the file for "signature" content. This is similar to how virus scanners work.
Consider you would create a set of classes where the ISignature was implemented by set of classes;
class TSVFile : ISignature
{
enumFileType ISignature.Evaluate(IEnumerable<byte> inputStream);
}
class SQLFile : ISignature
{
enumFileType ISignature.Evaluate(IEnumerable<byte> inputStream);
}
each one would read an appropriate number of bytes in and return the known file type, if it can be evaluated. Each file parser would need its own logic to determine how many bytes to read and on what basis to make its evaluation.

Print an array/list to excel in c#

I am able to save a single value into excel but I need help to save a full list/array into an excel sheet.
Code I have so far:
var MovieNames = session.Query<Movie>()
.ToArray();
List<string> MovieList = new List<string>();
foreach (var movie in MovieNames)
{
MovieList.Add(movie.MovieName);
}
//If I want to print a single value or a string,
//I can use the following to print/save to excel
// How can I do this if I want to print that entire
//list thats generated in "MovieList"
return File(new System.Text.UTF8Encoding().GetBytes(MovieList), "text/csv", "demo.csv");

You could use FileHelpers to serialize some strongly typed object into CSV. Just promise me to never roll your own CSV parser.

If you mean you want to create a .csv file with all movie names in one column so you can open it in Excel then simply loop over it:
byte[] content;
using (var ms = new MemoryStream())
{
using (var writer = new StreamWriter(ms))
{
foreach (var movieName in MovieList)
writer.WriteLine(movieName);
}
content = ms.ToArray();
}
return File(content, "text/csv", "demo.csv");
Edit
You can add more columns and get fancier with your output but then you run into the problem that you have check for special characters which need escaping (like , and "). If you want to do more than just a simple output then I suggest you follow #Darins suggestion and use the FileHelpers utilities. If you can't or don't want to use them then this article has an implementation of a csv writer.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Read a docx file in C# using OpenXml - c#

Related

How to find the closest sub-paragraph where a table is, in a docx document?

Duplication with OpenXML (word document) and ID issues

Conditional new Break for multi-column docx file, C#

Determine if input file is usable by program

Print an array/list to excel in c#

Categories

Resources