Duplication with OpenXML (word document) and ID issues - c#

Is it possible to duplicate a word document element with OpenXML without having any issues of "duplicate id" ?
Actually, to duplicate, I clone the elements inside the body and append the cloned elements in the body. But if any of the element have an ID, I'm having errors when I open the document in word.
Here is an example of error from OpenXML validator :
[60] Description="Attribute 'id' should have unique value. Its
current value 'Rectangle 11' duplicates with
others."
And here is my code :
Document document = wordDocument.MainDocumentPart.Document;
Body body = document.Body;
IEnumerable<OpenXmlElement> elements = ((Body)body.CloneNode(true)).Elements();
foreach (var element in elements)
{
OpenXmlElement e = (OpenXmlElement)element.CloneNode(true);
body.AppendChild(e);
}

You can't just copy elements with an id, you have to duplicate Parts too (search OpenXmlPart for more informations).
You can do this by combining functions AddPart() and GetIdOfPart() (accessible from MainDocumentPart)
First try:
when you have an element with an id, use AddPart(OpenXmlPart part) to add the element part and retrieve the new generated id of the part with GetIdOfPart(OpenXmlPart part)
After that, you can replace in your cloned OpenXmlElement the id by the new one
Second try:
or you could imagine an other way like:
Check highest id of existing parts (and save it)
Clone all parts from the start and choose yourself the id (by adding the highest saved id)
When you copy each element and find an id, add the saved highest id to match with the new part
I hope one of this way will help you, but in any case you will need to clone parts

DocIO is a .NET class library that can read, write and render Microsoft Word documents. Using DocIO, you can clone the elements such as paragraph, table, text run or the entire document and append it where you need.
The whole suite of controls is available for free (commercial applications also) through the community license program if you qualify. The community license is the full product with no limitations or watermarks.
Herewith we have a given simple example code snippet which clone all the paragraphs and tables in the document body and append them at the end of the same document.
using Syncfusion.DocIO.DLS;
namespace DocIO_Clone
{
class Program
{
static void Main(string[] args)
{
using (WordDocument document = new WordDocument(#"InputWordFile.docx"))
{
int sectionCount = document.Sections.Count;
for (int i = 0; i < sectionCount; i++)
{
IWSection section = document.Sections[i];
int entityCount = section.Body.ChildEntities.Count;
for (int j = 0; j < entityCount; j++)
{
IEntity entity = section.Body.ChildEntities[j];
switch(entity.EntityType)
{
case EntityType.Paragraph:
IWParagraph paragraph = entity.Clone() as IWParagraph;
document.LastSection.Body.ChildEntities.Add(paragraph);
break;
case EntityType.Table:
IWTable table = entity.Clone() as IWTable;
document.LastSection.Body.ChildEntities.Add(table);
break;
}
}
}
document.Save("ResultDocument.docx");
}
}
}
}
For further information, please refer our help documentation
Note: I work for Syncfusion

Related

How to get list of fonts used within one PDF file and copy them to another?

I'm trying to convert a PDF1.7 document to a PDFA/3B one, and currently I need to get all fonts in the source document and copy them into the target (if this actually the way to do it). So currently I have the following:
for (int i = 1; i < source.GetNumberOfPdfObjects(); i++)
{
var obj = source.GetPdfObject(i);
if (!obj?.IsDictionary() ?? true)
continue;
var dict = obj as PdfDictionary;
if (dict == null)
continue;
if (PdfName.Font.Equals(dict.GetAsName(PdfName.Type)))
{
var fontDescriptor = dict.GetAsDictionary(PdfName.FontDescriptor);
if (fontDescriptor == null)
continue;
//What else?
}
}
But I got stuck trying to get the font.
Is this the way to get the fonts from one doc or is there an easier way? And how does one copy them into the new doc?
To get all the fonts in the document and copy them into the target document, you need the following code:
for (int i = 1; i <= pdfDocument.getNumberOfPdfObjects(); i++) {
PdfObject object = pdfDocument.getPdfObject(i);
if (object.isDictionary() && PdfName.Font.equals(((PdfDictionary)object).getAsName(PdfName.Type))) {
object.copyTo(targetDocument);
}
}
However, please don't expect that all the content will be preserved on pages etc. This code just does what you ask for - copy the fonts to the new document. Preserving content and references to fonts is much more complicated than just copying the fonts.
Also, don't expect that by copying objects from an arbitrary PDF document to a document that you will make claim to be PDF/A-3B-compliant that document will acquire such compliance. This is simply not true. There are a lot of requirements PDF/A standard imposes and among them there are some requirements for fonts which are not necessarily fulfilled in your original document.

C# find a specific element with two or more xml files?

I try to explain my problem:
Okay, I need the KSCHL and the Info.
I need the KSCHL from the result file and then I want to search after the KSCHL in the other file "Data".
In the first file I have all KSCHL.
var kschlResultList = docResult.SelectNodes(...);
var kschlDataList = docData.SelectNodes(...);
var infoDataList = docData.SelectNodes(...);
for (int i = 0; i < kschlResultList.Count; i++)
{
string kschlResult = kschlResultList[i].InnerText;
for (int x = 0; x < kschlDataList.Count; x++)
{
string kschlData = kschlDataList[x].InnerText;
if (kschlData == kschlResult)
{
for (int y = 0; y < infoDataList.Count; y++)
{
string infoData = infoDataList[y].InnerText;
if (infoData == kschlResult)
{
//I know the If is false
string infoFromKschl = infoData;
}
}
}
}
}
The problem is now to find the KSCHL (from the first file) in the second file and then to search after the "info".
So if I have the KSCHL "KVZ1" in the first file, then I want to search this KSCHL in the second file and the associated Info for it.
Hope you understand :)
You don't have to loop quite so much. :-)
Using XPath - the special strings inside SelectNodes() or SelectSingleNode(), you can go pretty directly to what you want.
You can see a great basic example - several really - of how to select an XML node based on another node at the same level here:
How to select a node using XPath if sibling node has a specific value?
In your case, we can get to a list of the INFO values more simply by looping just through the KSCHL values. I use them as text, because I want to make a new XPath string with them.
I'm not clear exactly what format you want the results in, so I'm simply pushing them into a SortedDictionary for now.
At that last step, you could do other things as is most useful to you..... such as push them into a database, dump them in a file, send them to another function.
/***************************************************************
*I'm not sure how you want to use the results still,
* so I'll just stick them in a Dictionary for this example.
* ***********************************************************/
SortedDictionary<string, string> objLookupResults = new SortedDictionary<string, string>();
// --- note how I added /text()... doesn't change much, but being specific <<<<<<
var kschlResultList = docresult.SelectNodes("//root/CalculationLogCompact/CalculationLogRowCompact/KSCHL/text()");
foreach (System.Xml.XmlText objNextTextNode in kschlResultList) {
// get the actual text from the XML text node
string strNextKSCHL = objNextTextNode.InnerText;
// use it to make the XPath to get the INFO --- see the [KSCHL/text()= ...
string strNextXPath = "//SNW5_Pricing_JKV-Q10_full/PricingProcedure[KSCHL/text()=\"" + strNextKSCHL + "\" and PRICE>0]/INFO/text()";
// and get that INFO text! I use SelectSingleNode here, assuming only one INFO for each KSCHL..... if there can be more than one INFO for each KSCHL, then we'd need another loop here
string strNextINFO = docdata.SelectSingleNode(strNextXPath)?.InnerText; // <<< note I added the ? because now there may be no result with the rule PRICE>0.
// --- then you need to put this result somewhere useful to you.
// I'm not sure what that is, so I'll stick it in the Dictionary object.
if (strNextINFO != null) {
objLookupResults.Add(strNextKSCHL, strNextINFO);
}
}

Conditional new Break for multi-column docx file, C#

This is a follow-up question for Creating Word file from ObservableCollection with C#.
I have a .docx file with a Body that has 2 columns for its SectionProperties. I have a dictionary of foreign words with their translation. On each line I need [Word] = [Translation] and whenever a new letter starts it should be in its own line, with 2 or 3 line breaks before and after that letter, like this:
A
A-word = translation
A-word = translation
B
B-word = translation
B-word = translation
...
I structured this in a for loop, so that in every iteration I'm creating a new paragraph with a possible Run for the letter (if a new one starts), a Run for the word and a Run for the translation. So the Run with the first letter is in the same Paragraph as the word and translation Run and it appends 2 or 3 Break objects before and after the Text.
In doing so the second column can sometimes start with 1 or 2 empty lines. Or the first column on the next page can start with empty lines.
This is what I want to avoid.
So my question is, can I somehow check if the end of the page is reached, or the text is at the top of the column, so I don't have to add a Break? Or, can I format the Column itself so that it doesn't start with an empty line?
I have tried putting the letter Run in a separate, optional, Paragraph, but again, I find myself having to input line breaks and the problem remains.
In the spirit of my other answer you can extend the template capability.
Use the Productivity tool to generate a single page break object, something like:
private readonly Paragraph PageBreakPara = new Paragraph(new Run(new Break() { Type = BreakValues.Page}));
Make a helper method that finds containers of a text tag:
public IEnumerable FindElements(OpenXmlCompositeElement searchParent, string tagRegex)
where T: OpenXmlElement
{
var regex = new Regex(tagRegex);
return searchParent.Descendants()
.Where(e=>(!(e is OpenXmlCompositeElement)
&& regex.IsMatch(e.InnerText)))
.SelectMany(e =>
e.Ancestors()
.OfType<T>()
.Union(e is T ? new T[] { (T)e } : new T[] {} ))
.ToList(); // can skip, prevents reevaluations
}
And another one that duplicates a range from the document and deletes range:
public IEnumerable<T> DuplicateRange<T>(OpenXmlCompositeElement root, string tagRegex)
where T: OpenXmlElement
{
// tagRegex must describe exactly two tags, such as [pageStart] and [pageEnd]
// or [page] [/page] - or whatever pattern you choose
var tagElements = FindElements(root, tagRegex);
var fromEl = tagElements.First();
var toEl = tagElements.Skip(1).First(); // throws exception if less than 2 el
// you may want to find a common parent here
// I'll assume you've prepared the template so the elements are siblings.
var result = new List<OpenXmlElement>();
var step = fromEl.NextSibling();
while (step !=null && toEl!=null && step!=toEl){
// another method called DeleteRange will instead delete elements in that range within this loop
var copy = step.CloneNode();
toEl.InsertAfterSelf(copy);
result.Add(copy);
step = step.NextSibling();
}
return result;
}
public IEnumerable<OpenXmlElement> ReplaceTag(OpenXmlCompositeElement parent, string tagRegex, string replacement){
var replaceElements = FindElements<OpenXmlElement>(parent, tagRegex);
var regex = new Regex(tagRegex);
foreach(var el in replaceElements){
el.InnerText = regex.Replace(el.InnerText, replacement);
}
return replaceElements;
}
Now you can have a document that looks like this:
[page]
[TitleLetter]
[WordTemplate][Word]: [Translation] [/WordTemplate]
[pageBreak]
[/page]
With that document you can duplicate the [page]..[/page] range, process it per letter and once you're out of letters - delete the template range:
var vocabulary = Dictionary>;
foreach (var letter in vocabulary.Keys.OrderByDescending(c=>c)){
// in reverse order because the copy range comes after the template range
var pageTemplate = DuplicateRange(wordDocument,"\\[/?page\\]");
foreach (var p in pageTemplate.OfType<OpenXmlCompositeElement>()){
ReplaceTag(p, "[TitleLetter]",""+letter);
var pageBr = ReplaceTag(p, "[pageBreak]","");
if (pageBr.Any()){
foreach(var pbr in pageBr){
pbr.InsertAfterSelf(PageBreakPara.CloneNode());
}
}
var wordTemplateFound = FindElements(p, "\\[/?WordTemplate\\]");
if (wordTemplateFound .Any()){
foreach (var word in vocabulary[letter].Keys){
var wordTemplate = DuplicateRange(p, "\\[/?WordTemplate\\]")
.First(); // since it's a single paragraph template
ReplaceTag(wordTemplate, "\\[/?WordTemplate\\]","");
ReplaceTag(wordTemplate, "\\[Word]",word);
ReplaceTag(wordTemplate, "\\[Translation\\]",vocabulary[letter][word]);
}
}
}
}
...Or something like it.
Look into SdtElements if things start getting too complicated
Don't use AltChunk despite the popularity of that answer, it requires Word to open and process the file, so you can't use some library to make a PDF out of it
Word documents are messy, the solution above should work (haven't tested) but the template must be carefully crafted, make backups of your template often
making a robust document engine isn't easy (since Word is messy), do the minimum you need and rely on the template being in your control (not user-editable).
the code above is far from optimized or streamlined, I've tried to condense it in the smallest footprint possible at the cost of presentability. There are probably bugs too :)

How to determine if Aspose.Words bookmark contains nested bookmarks

I'm using the following code to iterate through all of the bookmarks in a Microsoft Word document:
foreach (var bookmark in _document.Range.Bookmarks.Cast<Bookmark>())
{
//code
}
How can I determine if bookmark contains nested bookmarks? I need to execute a separate set of logic based on whether or not a bookmark has other bookmarks within it.
The above foreach loop will get all the instances of Bookmark. To get the nested bookmark is a bit tricky, as there is no direct API in Aspose.Words to do this.
I have written a program to do this, the source code is not tiny, so I am sharing the Visual Studio project on Google Drive here.
Below is the summary
foreach (var bookmark in wordDoc.Range.Bookmarks.Cast<Aspose.Words.Bookmark>())
{
Console.WriteLine(bookmark.Name);
// Get all the nodes between bookmark start and end
ArrayList extractedNodes = ExtractContent(bookmark.BookmarkStart, bookmark.BookmarkEnd, true);
for (int i = 0; i < extractedNodes.Count; i++)
{
// Skip first and last nodes
if (i == 0 || i == extractedNodes.Count - 1)
continue;
// See if there is any bookmarks in this node
Node node = (Node)extractedNodes[i];
if (node.Range.Bookmarks.Count > 0)
Console.WriteLine("Nested bookmark found");
}
}
For details of ExtractContent() method, please visit http://www.aspose.com/docs/display/wordsnet/Extract+Content+Overview+and+Code

OpenXml: Worksheet Child Elements change in ordering results in a corrupt file

I am trying to use openxml to produce automated excel files. One problem I am facing is to accomodate my object model with open xml object model for excel. I have to come to a point where I realise that the order in which I append the child elements for a worksheet matters.
For Example:
workSheet.Append(sheetViews);
workSheet.Append(columns);
workSheet.Append(sheetData);
workSheet.Append(mergeCells);
workSheet.Append(drawing);
the above ordering doesnot give any error.
But the following:
workSheet.Append(sheetViews);
workSheet.Append(columns);
workSheet.Append(sheetData);
workSheet.Append(drawing);
workSheet.Append(mergeCells);
gives an error
So this doesn't let me to create a drawing object whenever I want to and append it to the worksheet. Which forces me to create these elements before using them.
Can anyone tell me if I have understood the problem correctly ? Because I believe we should be able to open any excel file create a new child element for a worksheet if necessary and append it. But now this might break the order in which these elements are supposed to be appended.
Thanks.
According to the Standard ECMA-376 Office Open XML File Formats, CT_Worksheet has a required sequence:
The reason the following is crashing:
workSheet.Append(sheetViews);
workSheet.Append(columns);
workSheet.Append(sheetData);
workSheet.Append(drawing);
workSheet.Append(mergeCells);
Is because you have drawing before mergeCells. As long as you append your mergeCells after drawing, your code should work fine.
Note: You can find the full XSD in ECMA-376 3rd edition Part 1 (.zip) -> OfficeOpenXML-XMLSchema-Strict -> sml.xsd.
I found that for all "Singleton" children where the parent objects has a Property defined (such as Worksheet.sheetViews) use the singleton property and assign the new object to that instead of using "Append" This causes the class itself to ensure the order is correct.
workSheet.Append(sheetViews);
workSheet.Append(columns);
workSheet.Append(sheetData); // bad idea(though it does work if the order is good)
workSheet.Append(drawing);
workSheet.Append(mergeCells);
More correct format...
workSheet.sheetViews=sheetViews; // order doesn't matter.
workSheet.columns=columns;
...
As Joe Masilotti already explained, the order is defined in the schema.
Unfortunately, the OpenXML library does not ensure the correct order of child elements in the serialized XML as required by the underlying XML schema. Applications may not be able to parse the XML successfully if the order is not correct.
Here is a generic solution which I am using in my code:
private T GetOrCreateWorksheetChildCollection<T>(Spreadsheet.Worksheet worksheet)
where T : OpenXmlCompositeElement, new()
{
T collection = worksheet.GetFirstChild<T>();
if (collection == null)
{
collection = new T();
if (!worksheet.HasChildren)
{
worksheet.AppendChild(collection);
}
else
{
// compute the positions of all child elements (existing + new collection)
List<int> schemaPositions = worksheet.ChildElements
.Select(e => _childElementNames.IndexOf(e.LocalName)).ToList();
int collectionSchemaPos = _childElementNames.IndexOf(collection.LocalName);
schemaPositions.Add(collectionSchemaPos);
schemaPositions = schemaPositions.OrderBy(i => i).ToList();
// now get the index where the position of the new child is
int index = schemaPositions.IndexOf(collectionSchemaPos);
// this is the index to insert the new element
worksheet.InsertAt(collection, index);
}
}
return collection;
}
// names and order of possible child elements according to the openXML schema
private static readonly List<string> _childElementNames = new List<string>() {
"sheetPr", "dimension", "sheetViews", "sheetFormatPr", "cols", "sheetData",
"sheetCalcPr", "sheetProtection", "protectedRanges", "scenarios", "autoFilter",
"sortState", "dataConsolidate", "customSheetViews", "mergeCells", "phoneticPr",
"conditionalFormatting", "dataValidations", "hyperlinks", "printOptions",
"pageMargins", "pageSetup", "headerFooter", "rowBreaks", "colBreaks",
"customProperties", "cellWatches", "ignoredErrors", "smartTags", "drawing",
"drawingHF", "picture", "oleObjects", "controls", "webPublishItems", "tableParts",
"extLst"
};
The method always inserts the new child element at the correct position, ensuring that the resulting document is valid.
For those end up here via Google like I did, the function below solves the ordering problem after the child element is inserted:
public static T ReorderChildren<T>(T element) where T : OpenXmlElement
{
Dictionary<Type, int> childOrderHashTable = element.GetType()
.GetCustomAttributes()
.Where(x => x is ChildElementInfoAttribute)
.Select( (x, idx) => new KeyValuePair<Type, int>(((ChildElementInfoAttribute)x).ElementType, idx))
.ToDictionary(x => x.Key, x => x.Value);
List<OpenXmlElement> reorderedChildren = element.ChildElements
.OrderBy(x => childOrderHashTable[x.GetType()])
.ToList();
element.RemoveAllChildren();
element.Append(reorderedChildren);
return element;
}
The generated types in the DocumentFormat.OpenXml library have custom attributes that can be used to reflect metadata from the the OOXML schema. This solution relies on System.Reflection and System.Linq (i.e., not very fast) but eliminates the need to hardcode a list of strings to correctly order the child elements for a specific type.
I use this function after validation on the ValidationErrorInfo.Node property and it and cleans up the newly created element by reference. That way I don't have apply this method recursively across an entire document.
helb's answer is beautiful - thank you for that, helb.
It has the slight drawback that it does not test if there are already problems with the order of child elements. The following slight modification makes sure there are no pre-existing problems when adding a new element (you still need his _childElementNames, which is priceless) and it's slightly more efficient:
private static int getChildElementOrderIndex(OpenXmlElement collection)
{
int orderIndex = _childElementNames.IndexOf(collection.LocalName);
if( orderIndex < 0)
throw new InvalidOperationException($"Internal: worksheet part {collection.LocalName} not found");
return orderIndex;
}
private static T GetOrCreateWorksheetChildCollection<T>(Worksheet worksheet) where T : OpenXmlCompositeElement, new()
{
T collection = worksheet.GetFirstChild<T>();
if (collection == null)
{
collection = new T();
if (!worksheet.HasChildren)
{
worksheet.AppendChild(collection);
}
else
{
int collectionSchemaPos = getChildElementOrderIndex(collection);
int insertPos = 0;
int lastOrderNum = -1;
for(int i=0; i<worksheet.ChildElements.Count; ++i)
{
int thisOrderNum = getChildElementOrderIndex(worksheet.ChildElements[i]);
if(thisOrderNum<=lastOrderNum)
throw new InvalidOperationException($"Internal: worksheet parts {_childElementNames[lastOrderNum]} and {_childElementNames[thisOrderNum]} out of order");
lastOrderNum = thisOrderNum;
if( thisOrderNum < collectionSchemaPos )
++insertPos;
}
// this is the index to insert the new element
worksheet.InsertAt(collection, insertPos);
}
}
return collection;
}

Categories