I am currently using Aspose.Words to open a document, pull content between a bookmark start and a bookmark end and then place that content into another document. The issue that I'm having is that when using the ImportNode method is imports onto my document but changes all of the fonts from Calibri to Times New Roman and changes the font size from whatever it was on the original document to 12pt.
The way I'm obtaining the content from the bookmark is by using the Aspose ExtractContent method.
Because I'm having the issue with the ImportNode stripping my font formatting I tried making some adjustments and saving each node to an HTML string using ToString(HtmlSaveOptions). This works mostly but the problem with this is it is stripping out my returns on the word document so none of my text has the appropriate spacing. My returns end up coming in as HTML in the following format
"<p style=\"margin-top:0pt; margin-bottom:8pt; line-height:108%; font-size:11pt\"><span style=\"font-family:Calibri; display:none; -aw-import:ignore\"> </span></p>"
When using
DocumentBuilder.InsertHtml("<p style=\"margin-top:0pt; margin-bottom:8pt; line-height:108%; font-size:11pt\"><span style=\"font-family:Calibri; display:none; -aw-import:ignore\"> </span></p>");
it does not correctly add the return on the word document.
Here is the code I'm using, please forgive the comments etc... this has been my attempts at correcting this.
public async Task<string> GenerateHtmlString(Document srcDoc, ArrayList nodes)
{
// Create a blank document.
Document dstDoc = new Document();
ELSLogHelper.InsertInfoLog(_callContext, ELSLogHelper.AsposeLogMessage("Open"), MethodBase.GetCurrentMethod()?.Name, MethodBase.GetCurrentMethod().DeclaringType?.Name, Environment.StackTrace);
// Remove the first paragraph from the empty document.
dstDoc.FirstSection.Body.RemoveAllChildren();
// Create a new Builder for the temporary document that gets generated with the header or footer data.
// This allows us to control font and styles separately from the main document being built.
var newBuilder = new DocumentBuilder(dstDoc);
Aspose.Words.Saving.HtmlSaveOptions htmlSaveOptions = new Aspose.Words.Saving.HtmlSaveOptions();
htmlSaveOptions.ExportImagesAsBase64 = true;
htmlSaveOptions.SaveFormat = SaveFormat.Html;
htmlSaveOptions.ExportFontsAsBase64 = true;
htmlSaveOptions.ExportFontResources = true;
htmlSaveOptions.ExportTextBoxAsSvg = true;
htmlSaveOptions.ExportRoundtripInformation = true;
htmlSaveOptions.Encoding = Encoding.UTF8;
// Obtain all the links from the source document
// This is used later to add hyperlinks to the html
// because by default extracting nodes using Aspose
// does not pull in the links in a usable way.
var srcDocLinks = srcDoc.Range.Fields.GroupBy(x => x.DisplayResult).Select(x => x.First()).Where(x => x.Type == Aspose.Words.Fields.FieldType.FieldHyperlink).Distinct().ToList();
var childNodes = nodes.Cast<Node>().Select(x => x).ToList();
var oldBuilder = new DocumentBuilder(srcDoc);
oldBuilder.MoveToBookmark("Header");
var allchildren = oldBuilder.CurrentParagraph.Runs;
var allChildNodes = childNodes[0].Document.GetChildNodes(NodeType.Any, true);
var headerText = allChildNodes[0].Range.Bookmarks["Header"].BookmarkStart.GetText();
foreach (Node node in nodes)
{
var html = node.ToString(htmlSaveOptions);
try
{
// is used by aspose because it works in XML
// If we see this character and the text of the node is \r we need to insert a break
if (html.Contains(" ") && node.Range.Text == "\r")
{
newBuilder.InsertHtml(html, false);
// Change the node into an HTML string
/*var htmlString = node.ToString(SaveFormat.Html);
var tempHtmlLinkDoc = new HtmlDocument();
tempHtmlLinkDoc.LoadHtml(htmlString);
// Get all the child nodes of the html document
var allChildNodes = tempHtmlLinkDoc.DocumentNode.SelectNodes("//*");
// Loop over all child nodes so we can make sure we apply the correct font family and size to the break.
foreach (var childNode in allChildNodes)
{
// Get the style attribute from the child node
var childNodeStyles = childNode.GetAttributeValue("style", "").Split(';');
foreach (var childNodeStyle in childNodeStyles)
{
// Apply the font name and size to the new builder on the document.
if (childNodeStyle.ToLower().Contains("font-family"))
{
newBuilder.Font.Name = childNodeStyle.Split(':')[1].Trim();
}
if (childNodeStyle.ToLower().Contains("font-size"))
{
newBuilder.Font.Size = Convert.ToDouble(childNodeStyle.Split(':')[1]
.Replace("pt", "")
.Replace("px", "")
.Replace("em", "")
.Replace("rem", "")
.Replace("%", "")
.Trim());
}
}
}
// Insert the break with the corresponding font size and name.
newBuilder.InsertBreak(BreakType.ParagraphBreak);*/
}
else
{
// Loop through the source document links so the link can be applied to the HTML.
foreach (var srcDocLink in srcDocLinks)
{
if (html.Contains(srcDocLink.DisplayResult))
{
// Now that we know the html string has one of the links in it we need to get the address from the node.
var linkAddress = srcDocLink.Start.NextSibling.GetText().Replace(" HYPERLINK \"", "").Replace("\"", "");
//Convert the node into an HTML String so we can get the correct font color, name, size, and any text decoration.
var htmlString = srcDocLink.Start.NextSibling.ToString(SaveFormat.Html);
var tempHtmlLinkDoc = new HtmlDocument();
tempHtmlLinkDoc.LoadHtml(htmlString);
var linkStyles = tempHtmlLinkDoc.DocumentNode.ChildNodes[0].GetAttributeValue("style", "").Split(';');
var linkStyleHtml = "";
foreach (var linkStyle in linkStyles)
{
if (linkStyle.ToLower().Contains("color"))
{
linkStyleHtml += $"color:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("font-family"))
{
linkStyleHtml += $"font-family:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("font-size"))
{
linkStyleHtml += $"font-size:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("text-decoration"))
{
linkStyleHtml += $"text-decoration:{linkStyle.Split(':')[1].Trim()};";
}
}
if (linkAddress.ToLower().Contains("mailto:"))
{
// Since the link has mailto included don't add the target attribute to the link.
html = new Regex($#"\b{srcDocLink.DisplayResult}\b").Replace(html, $"{srcDocLink.DisplayResult}");
//html = html.Replace(srcDocLink.DisplayResult, $"{srcDocLink.DisplayResult}");
}
else
{
// Since the links is not an email include the target attribute.
html = new Regex($#"\b{srcDocLink.DisplayResult}\b").Replace(html, $"{srcDocLink.DisplayResult}");
//html = html.Replace(srcDocLink.DisplayResult, $"{srcDocLink.DisplayResult}");
}
}
}
// Inseret the HTML String into the temporary document.
newBuilder.InsertHtml(html, false);
}
}
catch (Exception ex)
{
throw;
}
}
// This is just for debugging/troubleshooting purposes and to make sure thigns look correct
string tempDocxPath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "temp", "TemporaryCompiledDocument.docx");
dstDoc.Save(tempDocxPath);
// We generate this HTML file then load it back up and pass the DocumentNode.OuterHtml back to the requesting method.
ELSLogHelper.InsertInfoLog(_callContext, ELSLogHelper.AsposeLogMessage("Save"), MethodBase.GetCurrentMethod()?.Name, MethodBase.GetCurrentMethod().DeclaringType?.Name, Environment.StackTrace);
string tempHtmlPath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "temp", "TemporaryCompiledDocument.html");
dstDoc.Save(tempHtmlPath, htmlSaveOptions);
var tempHtmlDoc = new HtmlDocument();
tempHtmlDoc.Load(tempHtmlPath);
var htmlText = tempHtmlDoc.DocumentNode.OuterHtml;
// Clean up our mess...
if (File.Exists(tempDocxPath))
{
File.Delete(tempDocxPath);
}
if (File.Exists(tempHtmlPath))
{
File.Delete(tempHtmlPath);
}
// Return the generated HTML string.
return htmlText;
}
Saving each node to HTML and then inserting them into the destination document is not a good idea. Because not all nodes can be properly saved to HTML and some formatting can be lost after Aspose.Words DOM -> HTML -> Aspose.Words DOM roundtrip.
Regarding the original issue, the problem might occur because you are using ImportFormatMode.UseDestinationStyles, in this case styles and default of the destination document are used and font might be changed. If you need to keep the source document formatting, you should use ImportFormatMode.KeepSourceFormatting.
If the problem occurs even with ImportFormatMode.KeepSourceFormatting this must be a bug and you should report this to Aspose.Words staff in the support forum.
Related
I'm working with an existing library - the goal of the library is to pull text out of PDFs to verify against expected values to quality check recorded data vs data in pdf.
I'm looking for a way to succinctly pull a specific page worth of text given a string that should only fall on that specific page.
var pdfDocument = new Document(file.PdfFilePath);
var textAbsorber = new TextAbsorber{
ExtractionOptions = {
FormattingMode = TextExtractionOptions.TextFormattingMode.Pure
}
};
pdfDocument.Pages.Accept(textAbsorber);
foreach (var page in pdfDocument.Pages)
{
}
I'm stuck inside the foreach(var page in pdfDocument.Pages) portion... or is that the right area to be looking?
Answer: Text Absorber recreated each page - inside the foreach loop.
If the absorber isn't recreated, it keeps text from previous loops.
public List<string> ProcessPage(MyInfoClass file, string find)
{
var pdfDocument = new Document(file.PdfFilePath);
foreach (Page page in pdfDocument.Pages)
{
var textAbsorber = new TextAbsorber {
ExtractionOptions = {
FormattingMode = TextExtractionOptions.TextFormattingMode.Pure
}
};
page.Accept(textAbsorber);
var ext = textAbsorber.Text;
var exts = ext.Replace("\n", "").Split('\r').ToList();
if (ext.Contains(find))
return exts;
}
return null;
}
I'm trying to add custom tags to PowerPoint slides using OpenXml component in c#.
But after adding the tags dynamically, the presentation gets corrupted and is causing the powerpoint to stop working.
Following is the code I'm using for adding the tags:
public static void AddSlideTag(string presentationPath, string tagValue)
{
using (PresentationDocument presentation = PresentationDocument.Open(presentationPath, true))
{
var presPart = presentation.PresentationPart;
// Copy each slide in the source presentation, in order, to
// the destination presentation.
var slideIndex = 0;
foreach (var openXmlElement in presPart.Presentation.SlideIdList)
{
var slideId = (SlideId)openXmlElement;
// Create a unique relationship id.
var sp = (SlidePart)presPart.GetPartById(slideId.RelationshipId);
//create userDefinedTag
Tag slideObjectTag = new Tag() { Name = "CustomTag", Val = tagValue};
UserDefinedTagsPart userDefinedTagsPart1 = sp.AddNewPart<UserDefinedTagsPart>();
if (userDefinedTagsPart1.TagList == null)
userDefinedTagsPart1.TagList = new TagList();
userDefinedTagsPart1.TagList.Append(slideObjectTag);
//add tag to CustomerDataList element
var id = sp.GetIdOfPart(userDefinedTagsPart1);
if (sp.Slide.CommonSlideData == null)
{
sp.Slide.CommonSlideData = new CommonSlideData();
}
if (sp.Slide.CommonSlideData.CustomerDataList == null)
sp.Slide.CommonSlideData.CustomerDataList = new CustomerDataList();
CustomerDataTags tags = new CustomerDataTags();
tags.Id = id;
sp.Slide.CommonSlideData.CustomerDataList.AppendChild(tags);
slideIndex++;
sp.Slide.Save();
}
presPart.Presentation.Save();
}
}
It seems like the xml of the slides is broken after executing this code, but I cannot figure out what is the problem.
Does anyone know what is wrong with the code above, or what is causing the powerpoint to gets corrupted after running this code? Am I missing something here?
I'm using DocumentFormat.OpenXml.2.5 package.
I use the code below to strip specific html tags from parsed html using AngleSharp (as it is recommendable over using regular expressions to do such jobs (AngleSharp is currently maintained, HtmlAgilityPack not, hence I have been moving to the latter).
It works great - but now I want to remove html comments as well. Meaning whatever is found between <!-- and --> tags.
How would this be achieved using AngleSharp ? Using QuerySelector does not seem suiting here.
private string ExtractContentFromHtml(string input)
{
List<string> tagsToRemove = new List<string>
{
"script",
"style",
"img"
};
var config = Configuration.Default.WithJavaScript();
HtmlParser hp = new HtmlParser(config);
List<IElement> tags = new List<IElement>();
List<string> nodeTypes = new List<string>();
var hpResult = hp.Parse(input);
try
{
foreach (var tagToRemove in tagsToRemove)
tags.AddRange(hpResult.QuerySelectorAll(tagToRemove));
foreach (var tag in tags)
tag.Remove();
}
catch (Exception ex)
{
_errors.Add(string.Format("Error in cleaning html. {0}", ex.Message));
}
var content = hpResult.QuerySelector("body");
return (content).InnerHtml;
}
After playing with the code above and AngleSharp's API, I came up with the following working solution.
Initially I thought I could replace all my tag-removing stuff and solely rely on treating text nodes only, but this is not recommendable,
since some text nodes will be generated on the fly via javascript code, meaning, you need to remove javascript nodes anyway. So I left the style + img removals as well.
Worth mentioning as well that the DOM classifies nodes according to types, and one is able to find comments by searching for nodes of type 8.
private string ExtractContentFromHtml(string input)
{
List<string> tagsToRemove = new List<string>
{
"script",
"style",
"img"
};
var config = Configuration.Default.WithJavaScript();
HtmlParser hp = new HtmlParser(config);
List<IElement> tags = new List<IElement>();
List<string> nodeTypes = new List<string>();
var hpResult = hp.Parse(input);
List<string> textNodesValues = new List<string>();
try
{
foreach (var tagToRemove in tagsToRemove)
tags.AddRange(hpResult.QuerySelectorAll(tagToRemove));
foreach (var tag in tags)
tag.Remove();
/*
the following will not work, because text nodes that are not immediate children will not be considered
textNodesValues = hpResult.All.Where(n => n.NodeType == NodeType.Text).Select(n => n.TextContent).ToList();
*/
var treeWalker = hpResult.CreateTreeWalker(hpResult, FilterSettings.Text);
var textNode = treeWalker.ToNext();
while (textNode != null)
{
textNodesValues.Add(textNode.TextContent);
textNode = treeWalker.ToNext();
}
}
catch (Exception ex)
{
_errors.Add(string.Format("Error in cleaning html. {0}", ex.Message));
}
return string.Join(" ", textNodesValues);
}
I want to add custom page numbers (like 1/2,2/2) to word document with using Aspose.Words. But I couldn't find any sample for c# language. I tried to overrite footer but i couldn't give a format to page numbers.
Pls help!
Thanks!
edit
After i tried first answer,it worked as what i want but another problem came up. I adding child documents to main document. I can only formatting main document's number. Child documents still have ordinary page number.
Here a sample of code;
public void AddChildDocs (System.IO.Stream parentStream, List<System.IO.Stream> childStreams)
{
doc = new Aspose.Words.Document(parentStream);
if (Items.Count > 0)
{
WordReplacer evaluator = new WordReplacer(this);
doc.Range.Replace(new Regex(ReplaceRegex), evaluator, false);
}
foreach (var item in childStreams)
{
Aspose.Words.Document childDoc = new Aspose.Words.Document(item);
if (Items.Count > 0)
{
WordReplacer evaluator = new WordReplacer(this);
childDoc.Range.Replace(new Regex(ReplaceRegex), evaluator, false);
}
doc.AppendDocument(childDoc, ImportFormatMode.KeepSourceFormatting);
}
DocumentBuilder builder = new DocumentBuilder(doc);
builder.MoveToHeaderFooter(HeaderFooterType.FooterPrimary);
builder.InsertField("PAGE", "");
builder.Write(" / ");
builder.InsertField("NUMPAGES", "");
}
You can get idea from this page in Aspose documentation. Below is the sample code taken from the same page, but only related to custom page numbers.
String src = dataDir + "Page numbers.docx";
String dst = dataDir + "Page numbers_out.docx";
// Create a new document or load from disk
Aspose.Words.Document doc = new Aspose.Words.Document(src);
// Create a document builder
Aspose.Words.DocumentBuilder builder = new DocumentBuilder(doc);
// Go to the primary footer
builder.MoveToHeaderFooter(HeaderFooterType.FooterPrimary);
// Add fields for current page number
builder.InsertField("PAGE", "");
// Add any custom text
builder.Write(" / ");
// Add field for total page numbers in document
builder.InsertField("NUMPAGES", "");
// Import new document
Aspose.Words.Document newDoc = new Aspose.Words.Document(dataDir + "new.docx");
// Link the header/footer of first section to previous document
newDoc.FirstSection.HeadersFooters.LinkToPrevious(true);
doc.AppendDocument(newDoc, ImportFormatMode.UseDestinationStyles);
// Save the document
doc.Save(dst);
I work with Aspose as Developer Evangelist.
Here is the code to set custom page number in aspose.word, when you set page margins and starting page number then it automatically get next page when that particular page area is finished. Try this it will work...
section.PageSetup.PaperSize = PaperSize.Letter;
section.PageSetup.LeftMargin = 10;
section.PageSetup.RightMargin = 10;
section.PageSetup.TopMargin = 00;
section.PageSetup.BottomMargin = 0;
section.PageSetup.HeaderDistance = 50;
section.PageSetup.FooterDistance = 50;
section.PageSetup.Borders.Color = Color.Black;
section.PageSetup.PageStartingNumber = 1;
I've been trying to create a library to replace the MergeFields on a Word 2003 document, everything works fine, except that I lose the style applied to the field when I replace it, is there a way to keep it?
This is the code I'm using to replace the fields:
private void FillFields2003(string template, Dictionary<string, string> values)
{
object missing = Missing.Value;
var application = new ApplicationClass();
var document = new Microsoft.Office.Interop.Word.Document();
try
{
// Open the file
foreach (Field mergeField in document.Fields)
{
if (mergeField.Type == WdFieldType.wdFieldMergeField)
{
string fieldText = mergeField.Code.Text;
string fieldName = Extensions.GetFieldName(fieldText);
if (values.ContainsKey(fieldName))
{
mergeField.Select();
application.Selection.TypeText(values[fieldName]);
}
}
}
document.Save();
}
finally
{
// Release resources
}
}
I tried using the CopyFormat and PasteFormat methods in the selection, also using the get_style and set_style but to no exent.
Instead of using TypeText over the top of your selection use the the Result property of the Field:
if (values.ContainsKey(fieldName))
{
mergeField.Result = (values[fieldName]);
}
This will ensure any formatting in the field is retained.