Good day, everyone.
How can I apply predefined styles of word document to inserted HTML?
Like:
builder.InsertHTML(post.Title)
// apply style from document "Media-title"
builder.InsertHTML(post.Content)
// apply style "Media-content"
Please note InsertHtml() overload with useBuilderFormatting will not override styles of HTML text having inline styles. You may implement INodeChangingCallback for applying styles/formatting to HTML text. Please check following code snippet for reference.
public static void HtmlFormatting()
{
// Create a blank document object
Document doc = new Document();
DocumentBuilder builder = new DocumentBuilder(doc);
// Set up and pass the object which implements the handler methods.
doc.NodeChangingCallback = new HandleNodeChanging_FontChanger();
// Insert sample HTML content
builder.InsertHtml("<p>Hello World</p>");
doc.NodeChangingCallback = null;
builder.InsertHtml("<p>Some Test Text</p>");
doc.Save(#"Out.docx");
}
public class HandleNodeChanging_FontChanger : INodeChangingCallback
{
// Implement the NodeInserted handler to set default font settings for every Run node inserted into the Document
void INodeChangingCallback.NodeInserted(NodeChangingArgs args)
{
// Change the font of inserted text contained in the Run nodes.
if (args.Node.NodeType == NodeType.Run)
{
Run run = (Run)args.Node;
Console.WriteLine(run.Text);
run.Font.StyleName = "Intense Emphasis";
// Aspose.Words.Font font = ((Run)args.Node).Font;
// font.Size = 24;
// font.Name = "Arial";
}
}
void INodeChangingCallback.NodeInserting(NodeChangingArgs args)
{
// Do Nothing
}
void INodeChangingCallback.NodeRemoved(NodeChangingArgs args)
{
// Do Nothing
}
void INodeChangingCallback.NodeRemoving(NodeChangingArgs args)
{
// Do Nothing
}
}
I work with Aspose as developer Evangelist.
Well, after a couple of hours of finding a solution I've got this to work:
builder.ParagraphFormat.ClearFormatting();
builder.ParagraphFormat.Style = styles["word_style1"];
builder.Writeln(post.Title);
builder.InsertParagraph();
builder.ParagraphFormat.Style = styles["word_style2"];
builder.InsertHtml(post.Annotation, true);
builder.InsertParagraph();
builder.ParagraphFormat.Style = styles["word_style3"];
builder.InsertHyperlink(post.Url, post.Url, false);
P.S.: I hope there are any workarounds or improvements to do ot better.
Related
I am currently using Aspose.Words to open a document, pull content between a bookmark start and a bookmark end and then place that content into another document. The issue that I'm having is that when using the ImportNode method is imports onto my document but changes all of the fonts from Calibri to Times New Roman and changes the font size from whatever it was on the original document to 12pt.
The way I'm obtaining the content from the bookmark is by using the Aspose ExtractContent method.
Because I'm having the issue with the ImportNode stripping my font formatting I tried making some adjustments and saving each node to an HTML string using ToString(HtmlSaveOptions). This works mostly but the problem with this is it is stripping out my returns on the word document so none of my text has the appropriate spacing. My returns end up coming in as HTML in the following format
"<p style=\"margin-top:0pt; margin-bottom:8pt; line-height:108%; font-size:11pt\"><span style=\"font-family:Calibri; display:none; -aw-import:ignore\"> </span></p>"
When using
DocumentBuilder.InsertHtml("<p style=\"margin-top:0pt; margin-bottom:8pt; line-height:108%; font-size:11pt\"><span style=\"font-family:Calibri; display:none; -aw-import:ignore\"> </span></p>");
it does not correctly add the return on the word document.
Here is the code I'm using, please forgive the comments etc... this has been my attempts at correcting this.
public async Task<string> GenerateHtmlString(Document srcDoc, ArrayList nodes)
{
// Create a blank document.
Document dstDoc = new Document();
ELSLogHelper.InsertInfoLog(_callContext, ELSLogHelper.AsposeLogMessage("Open"), MethodBase.GetCurrentMethod()?.Name, MethodBase.GetCurrentMethod().DeclaringType?.Name, Environment.StackTrace);
// Remove the first paragraph from the empty document.
dstDoc.FirstSection.Body.RemoveAllChildren();
// Create a new Builder for the temporary document that gets generated with the header or footer data.
// This allows us to control font and styles separately from the main document being built.
var newBuilder = new DocumentBuilder(dstDoc);
Aspose.Words.Saving.HtmlSaveOptions htmlSaveOptions = new Aspose.Words.Saving.HtmlSaveOptions();
htmlSaveOptions.ExportImagesAsBase64 = true;
htmlSaveOptions.SaveFormat = SaveFormat.Html;
htmlSaveOptions.ExportFontsAsBase64 = true;
htmlSaveOptions.ExportFontResources = true;
htmlSaveOptions.ExportTextBoxAsSvg = true;
htmlSaveOptions.ExportRoundtripInformation = true;
htmlSaveOptions.Encoding = Encoding.UTF8;
// Obtain all the links from the source document
// This is used later to add hyperlinks to the html
// because by default extracting nodes using Aspose
// does not pull in the links in a usable way.
var srcDocLinks = srcDoc.Range.Fields.GroupBy(x => x.DisplayResult).Select(x => x.First()).Where(x => x.Type == Aspose.Words.Fields.FieldType.FieldHyperlink).Distinct().ToList();
var childNodes = nodes.Cast<Node>().Select(x => x).ToList();
var oldBuilder = new DocumentBuilder(srcDoc);
oldBuilder.MoveToBookmark("Header");
var allchildren = oldBuilder.CurrentParagraph.Runs;
var allChildNodes = childNodes[0].Document.GetChildNodes(NodeType.Any, true);
var headerText = allChildNodes[0].Range.Bookmarks["Header"].BookmarkStart.GetText();
foreach (Node node in nodes)
{
var html = node.ToString(htmlSaveOptions);
try
{
// is used by aspose because it works in XML
// If we see this character and the text of the node is \r we need to insert a break
if (html.Contains(" ") && node.Range.Text == "\r")
{
newBuilder.InsertHtml(html, false);
// Change the node into an HTML string
/*var htmlString = node.ToString(SaveFormat.Html);
var tempHtmlLinkDoc = new HtmlDocument();
tempHtmlLinkDoc.LoadHtml(htmlString);
// Get all the child nodes of the html document
var allChildNodes = tempHtmlLinkDoc.DocumentNode.SelectNodes("//*");
// Loop over all child nodes so we can make sure we apply the correct font family and size to the break.
foreach (var childNode in allChildNodes)
{
// Get the style attribute from the child node
var childNodeStyles = childNode.GetAttributeValue("style", "").Split(';');
foreach (var childNodeStyle in childNodeStyles)
{
// Apply the font name and size to the new builder on the document.
if (childNodeStyle.ToLower().Contains("font-family"))
{
newBuilder.Font.Name = childNodeStyle.Split(':')[1].Trim();
}
if (childNodeStyle.ToLower().Contains("font-size"))
{
newBuilder.Font.Size = Convert.ToDouble(childNodeStyle.Split(':')[1]
.Replace("pt", "")
.Replace("px", "")
.Replace("em", "")
.Replace("rem", "")
.Replace("%", "")
.Trim());
}
}
}
// Insert the break with the corresponding font size and name.
newBuilder.InsertBreak(BreakType.ParagraphBreak);*/
}
else
{
// Loop through the source document links so the link can be applied to the HTML.
foreach (var srcDocLink in srcDocLinks)
{
if (html.Contains(srcDocLink.DisplayResult))
{
// Now that we know the html string has one of the links in it we need to get the address from the node.
var linkAddress = srcDocLink.Start.NextSibling.GetText().Replace(" HYPERLINK \"", "").Replace("\"", "");
//Convert the node into an HTML String so we can get the correct font color, name, size, and any text decoration.
var htmlString = srcDocLink.Start.NextSibling.ToString(SaveFormat.Html);
var tempHtmlLinkDoc = new HtmlDocument();
tempHtmlLinkDoc.LoadHtml(htmlString);
var linkStyles = tempHtmlLinkDoc.DocumentNode.ChildNodes[0].GetAttributeValue("style", "").Split(';');
var linkStyleHtml = "";
foreach (var linkStyle in linkStyles)
{
if (linkStyle.ToLower().Contains("color"))
{
linkStyleHtml += $"color:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("font-family"))
{
linkStyleHtml += $"font-family:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("font-size"))
{
linkStyleHtml += $"font-size:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("text-decoration"))
{
linkStyleHtml += $"text-decoration:{linkStyle.Split(':')[1].Trim()};";
}
}
if (linkAddress.ToLower().Contains("mailto:"))
{
// Since the link has mailto included don't add the target attribute to the link.
html = new Regex($#"\b{srcDocLink.DisplayResult}\b").Replace(html, $"{srcDocLink.DisplayResult}");
//html = html.Replace(srcDocLink.DisplayResult, $"{srcDocLink.DisplayResult}");
}
else
{
// Since the links is not an email include the target attribute.
html = new Regex($#"\b{srcDocLink.DisplayResult}\b").Replace(html, $"{srcDocLink.DisplayResult}");
//html = html.Replace(srcDocLink.DisplayResult, $"{srcDocLink.DisplayResult}");
}
}
}
// Inseret the HTML String into the temporary document.
newBuilder.InsertHtml(html, false);
}
}
catch (Exception ex)
{
throw;
}
}
// This is just for debugging/troubleshooting purposes and to make sure thigns look correct
string tempDocxPath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "temp", "TemporaryCompiledDocument.docx");
dstDoc.Save(tempDocxPath);
// We generate this HTML file then load it back up and pass the DocumentNode.OuterHtml back to the requesting method.
ELSLogHelper.InsertInfoLog(_callContext, ELSLogHelper.AsposeLogMessage("Save"), MethodBase.GetCurrentMethod()?.Name, MethodBase.GetCurrentMethod().DeclaringType?.Name, Environment.StackTrace);
string tempHtmlPath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "temp", "TemporaryCompiledDocument.html");
dstDoc.Save(tempHtmlPath, htmlSaveOptions);
var tempHtmlDoc = new HtmlDocument();
tempHtmlDoc.Load(tempHtmlPath);
var htmlText = tempHtmlDoc.DocumentNode.OuterHtml;
// Clean up our mess...
if (File.Exists(tempDocxPath))
{
File.Delete(tempDocxPath);
}
if (File.Exists(tempHtmlPath))
{
File.Delete(tempHtmlPath);
}
// Return the generated HTML string.
return htmlText;
}
Saving each node to HTML and then inserting them into the destination document is not a good idea. Because not all nodes can be properly saved to HTML and some formatting can be lost after Aspose.Words DOM -> HTML -> Aspose.Words DOM roundtrip.
Regarding the original issue, the problem might occur because you are using ImportFormatMode.UseDestinationStyles, in this case styles and default of the destination document are used and font might be changed. If you need to keep the source document formatting, you should use ImportFormatMode.KeepSourceFormatting.
If the problem occurs even with ImportFormatMode.KeepSourceFormatting this must be a bug and you should report this to Aspose.Words staff in the support forum.
I want to add custom page numbers (like 1/2,2/2) to word document with using Aspose.Words. But I couldn't find any sample for c# language. I tried to overrite footer but i couldn't give a format to page numbers.
Pls help!
Thanks!
edit
After i tried first answer,it worked as what i want but another problem came up. I adding child documents to main document. I can only formatting main document's number. Child documents still have ordinary page number.
Here a sample of code;
public void AddChildDocs (System.IO.Stream parentStream, List<System.IO.Stream> childStreams)
{
doc = new Aspose.Words.Document(parentStream);
if (Items.Count > 0)
{
WordReplacer evaluator = new WordReplacer(this);
doc.Range.Replace(new Regex(ReplaceRegex), evaluator, false);
}
foreach (var item in childStreams)
{
Aspose.Words.Document childDoc = new Aspose.Words.Document(item);
if (Items.Count > 0)
{
WordReplacer evaluator = new WordReplacer(this);
childDoc.Range.Replace(new Regex(ReplaceRegex), evaluator, false);
}
doc.AppendDocument(childDoc, ImportFormatMode.KeepSourceFormatting);
}
DocumentBuilder builder = new DocumentBuilder(doc);
builder.MoveToHeaderFooter(HeaderFooterType.FooterPrimary);
builder.InsertField("PAGE", "");
builder.Write(" / ");
builder.InsertField("NUMPAGES", "");
}
You can get idea from this page in Aspose documentation. Below is the sample code taken from the same page, but only related to custom page numbers.
String src = dataDir + "Page numbers.docx";
String dst = dataDir + "Page numbers_out.docx";
// Create a new document or load from disk
Aspose.Words.Document doc = new Aspose.Words.Document(src);
// Create a document builder
Aspose.Words.DocumentBuilder builder = new DocumentBuilder(doc);
// Go to the primary footer
builder.MoveToHeaderFooter(HeaderFooterType.FooterPrimary);
// Add fields for current page number
builder.InsertField("PAGE", "");
// Add any custom text
builder.Write(" / ");
// Add field for total page numbers in document
builder.InsertField("NUMPAGES", "");
// Import new document
Aspose.Words.Document newDoc = new Aspose.Words.Document(dataDir + "new.docx");
// Link the header/footer of first section to previous document
newDoc.FirstSection.HeadersFooters.LinkToPrevious(true);
doc.AppendDocument(newDoc, ImportFormatMode.UseDestinationStyles);
// Save the document
doc.Save(dst);
I work with Aspose as Developer Evangelist.
Here is the code to set custom page number in aspose.word, when you set page margins and starting page number then it automatically get next page when that particular page area is finished. Try this it will work...
section.PageSetup.PaperSize = PaperSize.Letter;
section.PageSetup.LeftMargin = 10;
section.PageSetup.RightMargin = 10;
section.PageSetup.TopMargin = 00;
section.PageSetup.BottomMargin = 0;
section.PageSetup.HeaderDistance = 50;
section.PageSetup.FooterDistance = 50;
section.PageSetup.Borders.Color = Color.Black;
section.PageSetup.PageStartingNumber = 1;
After trying out several solutions, I am in desperate need for help.
I tried several approaches, before finally copying and still being stuck with the solution from Getting full contents of a Datagrid using UIAutomation.
Let's talk code, please consider the comments:
// Get Process ID for desired window handle
uint processID = 0;
GetWindowThreadProcessId(hwnd, out processID);
var desktop = AutomationElement.RootElement;
// Find AutomationElement for the App's window
var bw = AutomationElement.RootElement.FindFirst(TreeScope.Children, new PropertyCondition(AutomationElement.ProcessIdProperty, (int)processID));
// Find the DataGridView in question
var datagrid = bw.FindFirst(TreeScope.Children, new PropertyCondition(AutomationElement.AutomationIdProperty, "dgvControlProperties"));
// Find all rows from the DataGridView
var loginLines = datagrid.FindAll(TreeScope.Children, new PropertyCondition(AutomationElement.ControlTypeProperty, ControlType.DataItem));
// Badumm Tzzz: loginLines has 0 items, foreach is therefore not executed once
foreach (AutomationElement loginLine in loginLines)
{
var loginLinesDetails = loginLine.FindAll(TreeScope.Children, new PropertyCondition(AutomationElement.ControlTypeProperty, ControlType.Custom));
for (var i = 0; i < loginLinesDetails.Count; i++)
{
var cacheRequest = new CacheRequest
{
AutomationElementMode = AutomationElementMode.None,
TreeFilter = Automation.RawViewCondition
};
cacheRequest.Add(AutomationElement.NameProperty);
cacheRequest.Add(AutomationElement.AutomationIdProperty);
cacheRequest.Push();
var targetText = loginLinesDetails[i].FindFirst(TreeScope.Children, new PropertyCondition(AutomationElement.ClassNameProperty, "TextBlock"));
cacheRequest.Pop();
var myString = targetText.Cached.Name;
}
}
I can neither fetch a GridPattern, nor a TablePattern instance from datagrid, both results in an exception:
GridPattern gridPattern = null;
try
{
gridPattern = datagrid.GetCurrentPattern(GridPattern.Pattern) as GridPattern;
}
catch (InvalidOperationException ex)
{
// It fails!
}
TablePattern tablePattern = null;
try
{
tablePattern = datagrid.GetCurrentPattern(TablePattern.Pattern) as TablePattern;
}
catch (InvalidOperationException ex)
{
// It fails!
}
The rows were added to the DataGridView beforehand, like this:
dgvControlProperties.Rows.Add(new object[] { false, "Some Text", "Some other text" });
I am compiling to .Net Framework 4.5. Tried both regular user rights and elevated admin rights for the UI Automation client, both yielded the same results described here.
Why does the DataGridView return 0 rows?
Why can't I get one of the patterns?
Kudos for helping me out!
Update:
James help didn't to the trick for me. The following code tough returns all rows (including the headers):
var rows = dataGrid.FindAll(TreeScope.Children, PropertyCondition.TrueCondition);
Header cells can then be identified by their ControlType of ControlType.Header.
The code that you are copying is flawed. I just tested this scenario with an adaptation of this sample program factoring in your code above, and it works.
The key difference is that the code above is using TreeScope.Children to grab the datagrid element. This option only grabs immediate children of the parent, so if your datagrid is nested it's not going to work. Change it to use TreeScope.Descendants, and it should work as expected.
var datagrid = bw.FindFirst(TreeScope.Descendants, new PropertyCondition(AutomationElement.AutomationIdProperty, "dgvControlProperties"));
Here is a link to how the various Treescope options behave. Also, I don't know how your binding the rows to the grid, but I did it like this in my test scenario and it worked flawlessly.
Hopefully this helps.
public class DataObject
{
public string FieldA { get; set; }
public string FieldB { get; set; }
public string FieldC { get; set; }
}
List<DataObject> items = new List<DataObject>();
items.Add(new DataObject() {FieldA="foobar",FieldB="foobar",FieldC="foobar"});
items.Add(new DataObject() { FieldA = "foobar", FieldB = "foobar", FieldC = "foobar" });
items.Add(new DataObject() { FieldA = "foobar", FieldB = "foobar", FieldC = "foobar" });
dg.ItemsSource = items;
Your code looks fine though this could be a focus issue.
Even though you are getting a reference to these automation element objects you should set focus on them (using the aptly named SetFocus method) before using them.
Try:
var desktop = AutomationElement.RootElement;
desktop.SetFocus();
// Find AutomationElement for the App's window
var bw = desktop.FindFirst(TreeScope.Children, new PropertyCondition(AutomationElement.ProcessIdProperty, (int)processID));
and if that does not work try explicitly focusing on the dataGrid prior to calling "FindAll" on it i.e.
datagrid.SetFocus()
Why does the DataGridView return 0 rows?
The DataGridViewRows have a ControlType of ControlType.Custom. So I modified the line
// Find all rows from the DataGridView
var loginLines = datagrid.FindAll(TreeScope.Children, new PropertyCondition(AutomationElement.ControlTypeProperty, ControlType.DataItem));
To
var loginLines = datagrid.FindAll(TreeScope.Children, new PropertyCondition(AutomationElement.ControlTypeProperty, ControlType.Custom));
That gets you all the rows within the DataGridView. But your code has that within the loop.
What pattern is supported by DataGridView (not DataGrid)?
LegacyIAccessiblePattern. Try this-
LegacyIAccessiblePattern legacyPattern = null;
try
{
legacyPattern = datagrid.GetCurrentPattern(LegacyIAccessiblePattern.Pattern) as LegacyIAccessiblePattern;
}
catch (InvalidOperationException ex)
{
// It passes!
}
As I commented on #James answer, there is no support for UIA for DataGridView (again, not DataGrid).
If you search in google with terms: "UI Automation DataGridView" the first result has an incomplete answer. Mike replies with what provider class (Obviously one that extends DataGridView)to create within the source, unfortunately I have no clue as to how to make UIA load that class a provider. If anyone can shed some clues, Phil and I will be extremely pleased!
EDIT
Your DataGridView should implement the interfaces in that link above. Once you do that the ControlType of Row will be ControlType.DataItem instead of ControlType.Custom. You can then use it how you'd use a DataGrid.
EDIT
This is what I ended up doing -
Creating a Custom DataGridView like below. Your datagridview could subclass this too. That will make it support ValuePattern and SelectionItemPattern.
public class CommonDataGridView : System.Windows.Forms.DataGridView,
IRawElementProviderFragmentRoot,
IGridProvider,
ISelectionProvider
{.. }
The complete code can be found # this msdn link.
Once I did that, I played a bit with visualUIVerify source. Here is how I can access the gridview's cell and change the value. Also, make note of the commented code. I didn't need that. It allows you to iterate through rows and cells.
private void _automationElementTree_SelectedNodeChanged(object sender, EventArgs e)
{
//selected currentTestTypeRootNode has been changed so notify change to AutomationTests Control
AutomationElementTreeNode selectedNode = _automationElementTree.SelectedNode;
AutomationElement selectedElement = null;
if (selectedNode != null)
{
selectedElement = selectedNode.AutomationElement;
if (selectedElement.Current.ClassName.Equals("AutomatedDataGrid.CommonDataGridViewCell"))
{
if(selectedElement.Current.Name.Equals("Tej")) //Current Value
{
var valuePattern = selectedElement.GetCurrentPattern(ValuePattern.Pattern) as ValuePattern;
valuePattern.SetValue("Jet");
}
//Useful ways to get patterns and values
//System.Windows.Automation.SelectionItemPattern pattern = selectedElement.GetCurrentPattern(System.Windows.Automation.SelectionItemPattern.Pattern) as System.Windows.Automation.SelectionItemPattern;
//var row = pattern.Current.SelectionContainer.FindFirst(TreeScope.Children, new PropertyCondition(AutomationElement.ClassNameProperty, "AutomatedDataGrid.CommonDataGridViewRow", PropertyConditionFlags.IgnoreCase));
//var cells = row.FindAll(TreeScope.Children, new PropertyCondition(AutomationElement.ClassNameProperty, "AutomatedDataGrid.CommonDataGridViewCell", PropertyConditionFlags.IgnoreCase));
//foreach (AutomationElement cell in cells)
//{
// Console.WriteLine("**** Printing Cell Value **** " + cell.Current.Name);
// if(cell.Current.Name.Equals("Tej")) //current name
// {
// var valuePattern = cell.GetCurrentPattern(ValuePattern.Pattern) as ValuePattern;
// valuePattern.SetValue("Suraj");
// }
//}
//Select Row
//pattern.Select();
//Get All Rows
//var arrayOfRows = pattern.Current.SelectionContainer.FindAll(TreeScope.Children, new PropertyCondition(AutomationElement.ClassNameProperty, "AutomatedDataGrid.CommonDataGridViewRow", PropertyConditionFlags.IgnoreCase));
//get cells
//foreach (AutomationElement row in arrayOfRows)
//{
// var cell = row.FindFirst(TreeScope.Children, new PropertyCondition(AutomationElement.ClassNameProperty, "AutomatedDataGrid.CommonDataGridViewCell", PropertyConditionFlags.IgnoreCase));
// var gridItemPattern = cell.GetCurrentPattern(GridItemPattern.Pattern) as GridItemPattern;
// // Row number.
// Console.WriteLine("**** Printing Row Number **** " + gridItemPattern.Current.Row);
// //Cell Automation ID
// Console.WriteLine("**** Printing Cell AutomationID **** " + cell.Current.AutomationId);
// //Cell Class Name
// Console.WriteLine("**** Printing Cell ClassName **** " + cell.Current.ClassName);
// //Cell Name
// Console.WriteLine("**** Printing Cell AutomationID **** " + cell.Current.Name);
//}
}
}
_automationTests.SelectedElement = selectedElement;
_automationElementPropertyGrid.AutomationElement = selectedElement;
}
Hopefully this helps.
I run into this problem too, after researching, i figured out that you can only access data grid view with LegacyIAccessible. However, .NET does not support this. So, here are the steps:
there are 2 versions of UIA: managed version and unmanaged version (native code). In some cases, the unmanaged version can do what the managed version can't.
=> as said here, using tblimp.exe (go along with Windows SDK) to generate a COM wrapper around the unmanaged UIA API, so we can call from C#.
I have done it here
we can use this to access DataGridView now but with very limited data, you can use Inspect.exe to see.
The code is:
using InteropUIA = interop.UIAutomationCore;
if (senderElement.Current.ControlType.Equals(ControlType.Custom))
{
var automation = new InteropUIA.CUIAutomation();
var element = automation.GetFocusedElement();
var pattern = (InteropUIA.IUIAutomationLegacyIAccessiblePattern)element.GetCurrentPattern(10018);
Logger.Info(string.Format("{0}: {1} - Selected", pattern.CurrentName, pattern.CurrentValue));
}
I'm using sgmlreader to convert HTML to XML. The output goes into a XmlDocument object, which I can then use the InnerText method to extract the plain text from the website. I'm trying to get the text to look as clean as possible, by removing any javascript. Looping through the xml and removing any <script type="text/javascript"> is easy enough, but I've hit a brick wall when any jquery or styling isn't encapsulated in any tags. Can anybody help me out?
Sample Code:
Step one:
Once I use the webclient class to download the HTML, I save it, then open the file with the text reader class.
Step two:
Create sgmlreader class and set the input stream to the text reader:
// setup SGMLReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = reader;
// create document
doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
Step three:
Once I have a xmldocument, I use the doc.InnerText to get my plain text.
Step four:
I can easy remove JavaScript tags like so:
XmlNodeList nodes = document.GetElementsByTagName("text/javascript");
for (int i = nodes.Count - 1; i >= 0; i--)
{
nodes[i].ParentNode.RemoveChild(nodes[i]);
}
Some stuff still slips through. Heres an example of an ouput for one particular website I'm scriping:
Criminal and Civil Enforcement | Fraud | Office of Inspector General | U.S. Department of Health and Human Services
#fancybox-right {
right:-20px;
}
#fancybox-left {
left:-20px;
}
#fancybox-right:hover span, #fancybox-right span
#fancybox-right:hover span, #fancybox-right span {
left:auto;
right:0;
}
#fancybox-left:hover span, #fancybox-left span
#fancybox-left:hover span, #fancybox-left span {
right:auto;
left:0;
}
#fancybox-overlay {
/* background: url('/connections/images/wc-overlay.png'); */
/* background: url('/connections/images/banner.png') center center no-repeat; */
}
$(document).ready(function(){
$("a[rel=photo-show]").fancybox({
'titlePosition' : 'over',
'overlayColor' : '#000',
'overlayOpacity' : 0.9
});
$(".title-under").fancybox({
'titlePosition' : 'outside',
'overlayColor' : '#000',
'overlayOpacity' : 0.9
})
});
That jquery and styling needs to be removed.
I just threw this together in LinqPad based on the html of this page and it properly removes the script and style tags.
void Main()
{
string htmlPath = #"C:\Users\Jschubert\Desktop\html\test.html";
var sgmlReader = new Sgml.SgmlReader();
var stringReader = new StringReader(File.ReadAllText(htmlPath));
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = stringReader;
// create document
var doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
List<XmlNode> nodes = doc.GetElementsByTagName("script")
.Cast<XmlNode>().ToList();
var byType = doc.SelectNodes("script[#type = 'text/javascript']")
.Cast<XmlNode>().ToList();
var style = doc.GetElementsByTagName("style").Cast<XmlNode>().ToList();
nodes.AddRange(byType);
nodes.AddRange(style);
for (int i = nodes.Count - 1; i >= 0; i--)
{
nodes[i].ParentNode.RemoveChild(nodes[i]);
}
doc.DumpFormatted();
stringReader.Close();
sgmlReader.Close();
}
Casting to XmlNode to use the generic list is not ideal, but I did it for the sake of space and demonstration.
Also, you shouldn't need both
doc.GetElementsByTagName("script") and
doc.SelectNodes("script[#type = 'text/javascript']").
Again, I did that for the sake of demonstration.
If you have other scripts and you only want to remove JavaScript, use the latter. If you're removing all script tags, use the first one. Or, use both if you want.
I've been trying to create a library to replace the MergeFields on a Word 2003 document, everything works fine, except that I lose the style applied to the field when I replace it, is there a way to keep it?
This is the code I'm using to replace the fields:
private void FillFields2003(string template, Dictionary<string, string> values)
{
object missing = Missing.Value;
var application = new ApplicationClass();
var document = new Microsoft.Office.Interop.Word.Document();
try
{
// Open the file
foreach (Field mergeField in document.Fields)
{
if (mergeField.Type == WdFieldType.wdFieldMergeField)
{
string fieldText = mergeField.Code.Text;
string fieldName = Extensions.GetFieldName(fieldText);
if (values.ContainsKey(fieldName))
{
mergeField.Select();
application.Selection.TypeText(values[fieldName]);
}
}
}
document.Save();
}
finally
{
// Release resources
}
}
I tried using the CopyFormat and PasteFormat methods in the selection, also using the get_style and set_style but to no exent.
Instead of using TypeText over the top of your selection use the the Result property of the Field:
if (values.ContainsKey(fieldName))
{
mergeField.Result = (values[fieldName]);
}
This will ensure any formatting in the field is retained.