.Net Aspose.Words reading from word document

.Net Aspose.Words reading from word document - c#

I use Aspose to write to a word file and read and write to an excel file. I am trying to get Aspose.Words to read from a Word document but I keep getting error messages. I have followed examples from Aspose.com and those aren't useful. I posted on the forum and I still haven't heard back from them.
I am wanting to read these field codes (example: {fillin "Date" \d ""}) from each word document, which there are multiples of them. Once I am able to get those extracted, I want to put those in a List<string> and verify that they aren't duplicates. After I have gone through all the documents, I need to print that list into an excel spreadsheet.
Can someone help me resolve this error issues or tell me an easier way to go about this?
Issue #1 - won't recognize the file.
//string path = #"C:\Users\kbangert\Desktop\Karpel\HonoluluHIChargeCode2\Charge Language\10C104X.doc";
//string file = ConfigurationManager.AppSettings["filePath"] + "10C104X.doc";
//Document doc = new Document(path);
//Document doc = new Document(file);
Document doc = new Document(#"10C104.docx");
I have tried different techniques and I get the same errors - "UnsupportedFileFormatException was unhandled" or "FileCorreptionException was unhandled". I know the files are fine so is it the field codes that are causing this issue?
Issue #2 - Cannot resolve symbol 'Fields' or 'FieldCollection'
StringBuilder sb = new StringBuilder();
FieldCollection fields = doc.Range.Fields;
foreach (Field field in fields)
sb.AppendLine(field.GetFieldCode());
This came from the developers at Aspose and this throws the above errors.

For issue #1, there is no telling exactly why the file is failing, so you would need to get Aspose involved in that. However, we use the the following Aspose method to determine whether or not Aspose will be able to open the file:
Aspose.Words.FileFormatUtil.DetectFileFormat(fileName).LoadFormat
You should also double check to see if the file exists where your app expects it to exist before trying to open it using System.IO.File.Exists.
For issue #2 you need to add the namespace for the fields to the top of your class:
using Aspose.Words.Fields;
In addition, it looks like the developers may have provided information different than the version you are working with. In our code, the reference is to doc.Range.FormFields and the collection type is FormFieldCollection. I am not sure what the GetFieldCode equivalent is.

Related

How can I export a piece of a DOCX file and keep the same paragraph numbering?

TL;DR:
How can I capture the paragraph numbering as a 'part' of the text and export it to a DOCX?
Problem
I have a document that's split into sections and sub-sections that reads similarly to a set of state statutes (Statute 208, with subsections Statute 208.1, Statute 208.2, etc.). We created this by modifying the numbering.xml file within the .docx zip.
I want to export a 'sub-section' (208.5) and its text to a separate .docx file. My VSTO add-in exports the text well enough, but the numbering resets to 208.1. This does make some sense as it's now the first paragraph with that <ilvl> in the document.
PDF works okay
Funnily enough, I'm able to call Word.Range's ExportAsFixedFormat function and export this selection to PDF just fine - even retaining the numbering. This led me down a path of trying to 'render' the selection, possibly as it would be printed, in order to throw it into a new .docx file, but I haven't figured that out, either.
What I've tried:
Range.ExportFragment() using both wdFormatStrictOpenXMLDocument and wdFormatDocumentDefaultas the wdSaveType values.
These export but also reset the numbering.
Document.PrintOut() using PrintToFile = true and a valid filename. I realize now that this, quite literally, generates 'printout instructions' and won't inject a new file at path filename with any valid file structure.
Plainly doesn't work. :)
Application.Selection.XML to a variable content and calling Document.Content.InsertXML(content) on a newly added Document object.
Still resets the numbering.
Code Section for Context
using Word = Microsoft.Office.Interop.Word;
Word.Range range = Application.ActiveDocument.Range(startPosition, endPosition);
range.Select();
//export to DOCX?
Application.Selection.Range.ExportFragment(
filename, Word.WdSaveFormat.wdFormatDocumentDefault);

You could use ConvertNumbersToText(wdNumberAllNumbers) before exporting, then _Document.Undo() or close without saving after the export.

There is some good information at this (dated) link that still should work with current Word APIs:
https://forums.windowssecrets.com/showthread.php/27711-Determining-which-ListTemplates-item-is-in-use-(VBA-Word-2000)
Information at that link suggests that you can create a name/handle for your ListTemplate so that you can reference it in code--as long as your statute-style bullets are associated with a named style for the document. The idea is to first name the ListTemplate that's associated with the statute bullet style for the active document and then reference that name when accessing the ListLevels collection.
For instance, you could have code that looks something like this:
ActiveDocument.Styles("StatutesBulletStyle").ListTemplate.Name = "StatuteBulletListTemplate";
After the above assignment, you can refer to the template by name:
ActiveDocument.ListTemplates("StatuteBulletListTemplate").ListLevels(1).StartAt = 5;
Using the above technique no longer requires that you try to figure out what the active template is...
Does that help?

Is there any way to assign Id's to paragraphs in Open XML SDK 2.5?

I'm working on an application which has to create word documents with the use of Office Open XML SDK 2.5. The idea that I'm having now is that I will start from a template with an empty body (so I have all the namespaces etc. defined already), and add Paragraphsto it. If I need images I will add the ImageParts and try to give the ImagePart the Id present in the predefined paragraphpart which will contain the image. I will store the paragraphs as xml in a database, fetch the ones I need, fill in/modify some values if needed and insert them into my word document. But this is the tricky part, how can I easily insert them in a way so I don't have to query on their content to later on find one of the paragraphs? In other words, I need Id's. I have some options in mind:
For each possible paragraph I have, manually create a SdtBlock. This SdtBlock will have an Id which matches the Id of each paragraph in the database. This seems like a lot of manual work though, and I'd rather be able to create future word documents easier...
I chose this approach but I insert Building Blocks which can be stored in templates with a specific tagname.
Create the paragraphs, copy the xml from the developer tool, and manually add a ParagraphId. This seems even more of a nightmare though, because for every future new paragraphs I will have to create new Id's etc. Also it would be impossible to insert tables as there is no way (afaik) to give those an Id.
Work with bookmarks to know where to insert the data. I don't really like this either as bookmarks are visible for everyone. I know I can replace them, but then I don't have any way to identify individual paragraphs later on.
**** my database and just add everything in the template :D Remove the paragraphs I don't need by deleting the bookmarks with their content. This idea seems the worst of all though as I don't want to depend on having a templatefile with all possible content per word-file I need to generate.
Anyone with experience in OpenXml who knows which approach would be the best? Maybe there is another approach which is better and I have completely overlooked? The ideal solution would be that I can add Ids in Office Word but that's a no-go as I haven't found anything to do that yet.
Thanks in advance!

Content Controls (std) were designed for this, although I'm not sure the designers ever contemplated "targeting" each and every paragraph in the document...
Back in the 2003/2007 days it was possible to add custom XML mark-up to a Word document, which would have been exactly what you're looking for. But Microsoft lost a patent court case around 2009 and had to pull the functionality. So content controls are really your only good choice.
Your approach could, possibly, be combined with the BuildingBlocks concept. BuildingBlocks are Word content stored in a Word template file as valid Word Open XML. They can be assigned to "galleries" and categorized. There is a Content Control of type BuildingBlock that can be associated with a specific Gallery and Category which might help you to a certain extent and would be an alternative to storing content in a database.

Ok, I did a small research, you can do it in strict OpenXML, but only before you open your file in Word. Word will remove everything it cannot read.
using (WordprocessingDocument document = WordprocessingDocument.Open(path, true)) {
document.MainDocumentPart.Document.Body.Ancestors().First()
.SetAttribute(new OpenXmlAttribute() {
LocalName = "someIdName",
Value = "111" });
}
Here, for example, I set attribute "someIdName", which doesn't exits in OpenXML, to some random element. You can set it anywhere and use it as id

Creating Word file from ObservableCollection with C#

I have an observable collection with a class that has 2 string properties: Word and Translation. I want to create a word file in format:
word = translation word = translation
word = translation word = translation...
The word document needs to be in 2 Columns (PageLayout) and the Word should be in bold.
I have first tried Microsoft.Office.Interop.Word.
PageSetup.TextColumns.SetCount(2) sets the PageLayout. As for the text itself I used a foreach loop and in each iteration I did this:
paragraph.Range.Text = Word + " = " + Translation;
object boldStart = paragraph.Range.Start;
object boldEnd = paragraph.Range.Start + Word.Length;
Word.Range boldPart = document.Range(boldStart, boldEnd);
boldPart.Bold = 1;
paragraph.Range.InsertParagraphAfter();
This does exactly what I want, but if there are 1000 items in the collection it takes about 10sec, much much more if the number is 10k+. I then used a StringBuilder and just set document.Content.Text = sb.ToString(); and that takes less than a sec, but I can't set the word to be bold that way.
Then I switched to using Open XML SDK 2.5, but even after reading the msdn documentation I still have no idea how to make just a part of the text bold, and I don't know if it's even possible to set PageLayout Columns count. The only thing I could do was to make it look the same as with Interop.Word, but with just 1 column and <1sec creation time.
Should I be using Interop.Word or Open XML (or maybe combined) for this? And can someone pls show me how to write this properly, so it doesn't take forever if the collection is relatively large? Any help is appreciated. :)

OOXML can be intimidating at first. http://officeopenxml.com/anatomyofOOXML.php has some good examples. Whenever you get confused unzip the docx and browse the contents to see how it's done.
The basic idea is you'd open Word, create a template with the styling you want and a code word to find the paragraph, then multiply the paragraph, replacing the text in that template with each word.
Your Word template would look like this:
Here's some pseudo code to get you started, assuming you have the SDK installed
var templateRegex = new Regex("\\[templateForWords\\]");
var wordPlacementRegex = new Regex("\\[word\\]");
var translationPlacementRegex = new Regex("\\[translation]\\]");
using (var document = WordprocessingDocument.Open(stream, true))
{
MainDocumentPart mainPart = document.MainDocumentPart;
// do your work here...
var paragraphTemplate = mainPart.Document.Body
.Descendants<Paragraph>()
.Where(p=>templateRegex.IsMatch(p.InnerText)); //pseudo
//... or whatever gives you the text of the Para, I don't have the SDK right now
foreach (string word in YourDictionary){
var paraClone = paragraphTemplate.Clone(); // pseudo
// you may need to do something like
// paraClone.Descendents<Text>().Where(t=>regex.IsMatch(t.Value))
// to find the exact element containing template text
paraClone.Text = templateRegex.Replace(paraClone.Text,"");// pseudo
paraClone.Text = wordPlacementRegex.Replace(paraClone.Text,word);
paraClone.Text = translationPlacementRegex.Replace(paraClone.Text,YourDictionary[word]);
paragraphTemplate.Parent.InsertAfter(paraClone,ParagraphTemplate); // pseudo
}
paragraphTemplate.Remove();
// document should auto-save
document.Package.Flush();
}

OpenXML is absolutely better, because it is faster, has less bugs, more reliable and flexible in runtime (especially in server environment). And it's not really difficult to find out how to make one or another element using OpenXML. As docx file is just a zip file with xml files inside, I open it and read the xml to get the idea, how word itself makes it. First of all, I create a document, then format it (in your case, you can create some file with two columns and bold words inside), save it, rename it to .zip file. Then open it, open "word" directory inside and the file "document.xml" inside the directory. This document contains essential part of xml, looking at this it's not difficult to figure out how to recreate it in OpenXML

Open XML is a much better option than Office COM. But the problem is that it is a low-level file format library that unlike Office COM doesn’t work on a high abstraction level. You might want to go that route but I recommend you to first consider looking into a commercial library that will give you the benefits of a high-level DOM without the need to have MS Word installed on the production machine. Our company recently purchased this toolkit which allows you to use template based approach and also DOM/programmatic approach to generate/modify/create documents.

Use OpenXML to replace text in DOCX file - strange content

I'm trying to use the OpenXML SDK and the samples on Microsoft's pages to replace placeholders with real content in Word documents.
It used to work as described here, but after editing the template file in Word adding headers and footers it stopped working. I wondered why and some debugging showed me this:
Which is the content of texts in this piece of code:
using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(DocumentFile, true))
{
var texts = wordDoc.MainDocumentPart.Document.Body.Descendants<Text>().ToList();
}
So what I see here is that the body of the document is "fragmented", even though in Word the content looks like this:
Can somebody tell me how I can get around this?
I have been asked what I'm trying to achieve. Basically I want to replace user defined "placeholders" with real content. I want to treat the Word document like a template. The placeholders can be anything. In my above example they look like {var:Template1}, but that's just something I'm playing with. It could basically be any word.
So for example if the document contains the following paragraph:
Do not use the name USER_NAME
The user should be able to replace the USER_NAME placeholder with the word admin for example, keeping the formatting intact. The result should be
Do not use the name admin
The problem I see with working on paragraph level, concatenating the content and then replacing the content of the paragraph, I fear I'm losing the formatting that should be kept as in
Do not use the name admin

Various things can fragment text runs. Most frequently proofing markup (as apparently is the case here, where there are "squigglies") or rsid (used to compare documents and track who edited what, when), as well as the "Go back" bookmark Word sets in the background. These become readily apparent if you view the underlying WordOpenXML (using the Open XML SDK Productivity Tool, for example) in the document.xml "part".
It usually helps to go an element level "higher". In this case, get the list of Paragraph descendants and from there get all the Text descendants and concatenate their InnerText.

OpenXML is indeed fragmenting your text:
I created a library that does exactly this : render a word template with the values from a JSON.
From the documenation of docxtemplater :
Why you should use a library for this
Docx is a zipped format that contains some xml. If you want to build a simple replace {tag} by value system, it can already become complicated, because the {tag} is internally separated into <w:t>{</w:t><w:t>tag</w:t><w:t>}</w:t>. If you want to embed loops to iterate over an array, it becomes a real hassle.
The library basically will do the following to keep formatting :
If the text is :
<w:t>Hello</w:t>
<w:t>{name</w:t>
<w:t>} !</w:t>
<w:t>How are you ?</w:t>
The result would be :
<w:t>Hello</w:t>
<w:t>John !</w:t>
<w:t>How are you ?</w:t>
You also have to replace the tag by <w:t xml:space=\"preserve\"> to ensure that the space is not stripped out if they is any in your variables.

Fillable doc files

I have a samples of some documents in .doc format. So I need to create some "fillable# areas instead of certain values in samples. Then I need to automatically fill this documents using C#. So what do you think about it? Is that possible? Thanks in advance, guys! P.S.: if you need some information from me please feel free to ask me about additions to my question.

Besides simply injecting/replacing text into the document itself you could also utilize docvariables. You can define/create them in your document and then you can codewise set the values.
Using docvariables you seperate the design of the worddoc (where is the text shown) from setting the values which might be usefull for your case.
You can certainly manipulate them using C# but a bit more info using a vba sample can found at What is a DOCVARIABLE in word
One little warning when using c# to edit them. If you set the value of a docvariable to "" (empty string) it results in the docvariable being deleted from the document. If you want to keep the docvariable around set it's value to a " " (space)

Yes this is possible, you can create in your Document a placeholder areas which you search and change when you access the file. Check these results on how to modify the word document using C#

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.