how to get pages of a word document? I want to make a xml file from docx(or doc) file in this format:
<Page No="1">
<text>Text of page 1 here</text>
<footnote>text of footnote here</footnote>
</Page>
<Page No="2">
<text>Text of page 2 here</text>
<footnote>text of footnote here</footnote>
</Page>
....
Thanks
refer How to paginate a Word document from c# with Open XML
in more
if you can generate xml for it.
Create a New Word Document,
set Pagination into that.
save the document as XML see tags for pagination. you might find something like footer1.xml
I think that this is very useful:
http://www.shaunakelly.com/word/word-development/selecting-or-referring-to-a-page-in-the-word-object-model.html
Related
I have a xml document like below and I need to render it into HTML page. When I browse the XML from IE the HTML is rendered as expected with styling. If I load the xml document from c# code and pass to HTML page it just renders as plain text. What am I missing here?
XML
<?xml-stylesheet type='text/xsl' href='xslsheet.xsl'?>
<Document xmlns="org" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
....
</Document>
C#
XDocument doc = XDocument.Load(#"C\SampleDocument.xml");
var result = doc.ToString();
Loading an XML document does just that - it loads the data. It won't process a transform directive.
To do that you need to do an XSLT Transform. You can find the classes to do that on MSDN.
I am trying to generate a word doc from saved HTML file using an Open XML library.
If the HTML file does not contain an image I can simply use the code below and write text content to word doc.
HtmlDocument doc = new HtmlDocument();
doc.Load(fileName); //fileName is the Htm file
string Detail = string.Empty;
string webData = string.Empty;
HtmlNode hcollection = doc.DocumentNode.SelectSingleNode("//body");
Detail = hcollection.InnerText;
But if the HTML file contains an embedded image I am struggling to include that image in the word doc.
Using hcollection.InnerText only writes the text part and excludes the image.
When I use
HtmlNode hcollection = doc.DocumentNode.SelectSingleNode("//body");
Detail = hcollection.InnerHtml;
All the HTML tags get written to the word doc along with path of Image in the tag
<table border='0' width='100%' cellpadding='0' cellspacing='0' align='center'>
<tr><td valign='top' align="left">
<div style='width:100%'><div id="div_img">
<div>
<img src="http://www.myweb.com/web/img/2013/07/18/img_1.jpg">
<span>Sample Text</span></div></div><br><br>Sample Text Content here<br><br> </div></td></tr></table>
How to remove the html tags and instead of path shown like
<img src="http://www.myweb.com/web/img/2013/07/18/img_1.jpg">
the corresponding picture gets loaded.
Please help.
You'll need to look at the HTML and translate it to OpenXML somehow.
I've used HtmlToOpenXml open-source library (license), and that works well enough. It should handle images (inline, local or remote) and correctly insert them into the OpenXML document. I recently submitted a patch which was accepted, so the project is still somewhat active.
There are some limitations with the library though:
Javascript (<script>), CSS <style>, <meta> and other not supported tags does not generate an error but are ignored.
It does handle inline style information, but it entirely ignores other CSS, which was something I needed. I ended up integrating some simple parsing of a single <style> element from another open-source project (jsonfx, using MIT license).
Note: handling multiple <style> elements, downloading CSS files, sorting out which style rules have precedence -- these are all problems which I did not address.
Actually the converting of HTML document to MS Word is a very complex task and there are a lot of cases besides of IMAGE tags which need to be solved. The difference between Open XML and HTML formats is absolutely decisive.
If I were you I would look for 3rd party tools for that. It would be chiper to pay for them than spending weeks on investigation and learning of all aspects of the task, writing the code, and then fixing miltiple bugs.
Personaly me used Aspose.Words library for that. It worked perfectly fine, but maybe you want to try another one.
I see a similar case which answer my question in JQuery way:
Remove tag but leave contents - jQuery/Javascript
But I need to know how C# works for this? Is there any API in C# that I can use to strip out the tag and leave the plane text?
From:
some content and more and more content
To:
some content and more and more content
Thanks.
Use HtmlAgilityPack ( http://htmlagilitypack.codeplex.com/ ), load your HTML document into a HtmlDocument instance and query document.DocumentNode.InnerText.
I have a C# Form with WebBrowser object.
This object contains HTML Document.
And there is a link in that document that has no markers (no id and no name)
How can I access this element??
I tried to use this:
webBrowser1.Document.GetElementsByTagName("a")[n]
But it is not very useful, because if there will be some new link on the page, I'll need to rebuild all program.
I also can not do loops through document, or get a substring of Document.ToString() because then I can not click the link.
Would be great if you could give me some advice.
In this kind of situation the best idea is always to find an "Anchor", meaning - a place in the document that never change.
Lets say that
dada
Doesn't have an ID or Name, so the closest you can go is check if the parent of the element you're looking for has an ID.
<div id="parentDiv">
Some text
Some other stuff
The link you're looking for
</div>
That way you could get the parentDiv, which you know doesn't change, and then the A tag inside that parent (which should be permanent unless that website completely changes the structure which is one of the problems in parsing external HTML pages)
Shai.
you can use Html Agility Pack. and select links by xpath
HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument doc = htmlWeb.Load(/* url */);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
// do stuff
}
You should have some info on how to identify the link. it may be id or name or the text. If the text is always same then check the inner text of that link.
Is there a way to get the XML presented in a XMLFormView? I'm trying to create a custom web part with a "Save as PDF" button for an InfoPath form. The idea is to combine the XML with the form's XSL and make the resulting HTML into PDF which is presented to the user as a popup.
Because it is to be presented as a popup, using Workflows is not an option.
http://msdn.microsoft.com/en-us/library/microsoft.office.infopath.server.controls.xmlformview.xmllocation.aspx
This property will give you the Url of the Base XML file. You can read the Stream reader to read this XML.
Take a look at my codeplex project http://ip2html.codeplex.com/
It allows you to generate HTML from the given (InfoPath) XML & (XMLFormView) XSLT.
We ended up using the XmlFormHostItem.NotifyHost method to send HTML to a custom web part in a button-clicked event, which converted the HTML to PDF using Winnovative HTML to PDF converter.
HTML generation from InfoPath code-behind:
var formData = new XmlDocument();
var xslt = new XslCompiledTransform(true);
// Load the form data
formData.LoadXml(MainDataSource.CreateNavigator().InnerXml);
// Extract the stylesheet from the package
xslt.Load(ExtractFromPackage("Print.xsl")); // (uses Template.OpenFileFromPackage(string fileName) to get xsl)
// Perform XSL-transformation
// [...]
// Send HTML to web part
this.NotifyHost(formData.InnerXml);
One drawback of this method is that the NotifyHost event only fires once per form, so if the user clicks 'Save as PDF' and then cancels, he must reload the form in order to be able to save as PDF.