Removing jquery and CSS from an Xml Document - c#

I'm using sgmlreader to convert HTML to XML. The output goes into a XmlDocument object, which I can then use the InnerText method to extract the plain text from the website. I'm trying to get the text to look as clean as possible, by removing any javascript. Looping through the xml and removing any <script type="text/javascript"> is easy enough, but I've hit a brick wall when any jquery or styling isn't encapsulated in any tags. Can anybody help me out?
Sample Code:
Step one:
Once I use the webclient class to download the HTML, I save it, then open the file with the text reader class.
Step two:
Create sgmlreader class and set the input stream to the text reader:
// setup SGMLReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = reader;
// create document
doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
Step three:
Once I have a xmldocument, I use the doc.InnerText to get my plain text.
Step four:
I can easy remove JavaScript tags like so:
XmlNodeList nodes = document.GetElementsByTagName("text/javascript");
for (int i = nodes.Count - 1; i >= 0; i--)
{
nodes[i].ParentNode.RemoveChild(nodes[i]);
}
Some stuff still slips through. Heres an example of an ouput for one particular website I'm scriping:
Criminal and Civil Enforcement | Fraud | Office of Inspector General | U.S. Department of Health and Human Services
#fancybox-right {
right:-20px;
}
#fancybox-left {
left:-20px;
}
#fancybox-right:hover span, #fancybox-right span
#fancybox-right:hover span, #fancybox-right span {
left:auto;
right:0;
}
#fancybox-left:hover span, #fancybox-left span
#fancybox-left:hover span, #fancybox-left span {
right:auto;
left:0;
}
#fancybox-overlay {
/* background: url('/connections/images/wc-overlay.png'); */
/* background: url('/connections/images/banner.png') center center no-repeat; */
}
$(document).ready(function(){
$("a[rel=photo-show]").fancybox({
'titlePosition' : 'over',
'overlayColor' : '#000',
'overlayOpacity' : 0.9
});
$(".title-under").fancybox({
'titlePosition' : 'outside',
'overlayColor' : '#000',
'overlayOpacity' : 0.9
})
});
That jquery and styling needs to be removed.

I just threw this together in LinqPad based on the html of this page and it properly removes the script and style tags.
void Main()
{
string htmlPath = #"C:\Users\Jschubert\Desktop\html\test.html";
var sgmlReader = new Sgml.SgmlReader();
var stringReader = new StringReader(File.ReadAllText(htmlPath));
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = stringReader;
// create document
var doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
List<XmlNode> nodes = doc.GetElementsByTagName("script")
.Cast<XmlNode>().ToList();
var byType = doc.SelectNodes("script[#type = 'text/javascript']")
.Cast<XmlNode>().ToList();
var style = doc.GetElementsByTagName("style").Cast<XmlNode>().ToList();
nodes.AddRange(byType);
nodes.AddRange(style);
for (int i = nodes.Count - 1; i >= 0; i--)
{
nodes[i].ParentNode.RemoveChild(nodes[i]);
}
doc.DumpFormatted();
stringReader.Close();
sgmlReader.Close();
}
Casting to XmlNode to use the generic list is not ideal, but I did it for the sake of space and demonstration.
Also, you shouldn't need both
doc.GetElementsByTagName("script") and
doc.SelectNodes("script[#type = 'text/javascript']").
Again, I did that for the sake of demonstration.
If you have other scripts and you only want to remove JavaScript, use the latter. If you're removing all script tags, use the first one. Or, use both if you want.

Related

Apose.Words ImportNode ignores font formatting when appendingchild

I am currently using Aspose.Words to open a document, pull content between a bookmark start and a bookmark end and then place that content into another document. The issue that I'm having is that when using the ImportNode method is imports onto my document but changes all of the fonts from Calibri to Times New Roman and changes the font size from whatever it was on the original document to 12pt.
The way I'm obtaining the content from the bookmark is by using the Aspose ExtractContent method.
Because I'm having the issue with the ImportNode stripping my font formatting I tried making some adjustments and saving each node to an HTML string using ToString(HtmlSaveOptions). This works mostly but the problem with this is it is stripping out my returns on the word document so none of my text has the appropriate spacing. My returns end up coming in as HTML in the following format
"<p style=\"margin-top:0pt; margin-bottom:8pt; line-height:108%; font-size:11pt\"><span style=\"font-family:Calibri; display:none; -aw-import:ignore\"> </span></p>"
When using
DocumentBuilder.InsertHtml("<p style=\"margin-top:0pt; margin-bottom:8pt; line-height:108%; font-size:11pt\"><span style=\"font-family:Calibri; display:none; -aw-import:ignore\"> </span></p>");
it does not correctly add the return on the word document.
Here is the code I'm using, please forgive the comments etc... this has been my attempts at correcting this.
public async Task<string> GenerateHtmlString(Document srcDoc, ArrayList nodes)
{
// Create a blank document.
Document dstDoc = new Document();
ELSLogHelper.InsertInfoLog(_callContext, ELSLogHelper.AsposeLogMessage("Open"), MethodBase.GetCurrentMethod()?.Name, MethodBase.GetCurrentMethod().DeclaringType?.Name, Environment.StackTrace);
// Remove the first paragraph from the empty document.
dstDoc.FirstSection.Body.RemoveAllChildren();
// Create a new Builder for the temporary document that gets generated with the header or footer data.
// This allows us to control font and styles separately from the main document being built.
var newBuilder = new DocumentBuilder(dstDoc);
Aspose.Words.Saving.HtmlSaveOptions htmlSaveOptions = new Aspose.Words.Saving.HtmlSaveOptions();
htmlSaveOptions.ExportImagesAsBase64 = true;
htmlSaveOptions.SaveFormat = SaveFormat.Html;
htmlSaveOptions.ExportFontsAsBase64 = true;
htmlSaveOptions.ExportFontResources = true;
htmlSaveOptions.ExportTextBoxAsSvg = true;
htmlSaveOptions.ExportRoundtripInformation = true;
htmlSaveOptions.Encoding = Encoding.UTF8;
// Obtain all the links from the source document
// This is used later to add hyperlinks to the html
// because by default extracting nodes using Aspose
// does not pull in the links in a usable way.
var srcDocLinks = srcDoc.Range.Fields.GroupBy(x => x.DisplayResult).Select(x => x.First()).Where(x => x.Type == Aspose.Words.Fields.FieldType.FieldHyperlink).Distinct().ToList();
var childNodes = nodes.Cast<Node>().Select(x => x).ToList();
var oldBuilder = new DocumentBuilder(srcDoc);
oldBuilder.MoveToBookmark("Header");
var allchildren = oldBuilder.CurrentParagraph.Runs;
var allChildNodes = childNodes[0].Document.GetChildNodes(NodeType.Any, true);
var headerText = allChildNodes[0].Range.Bookmarks["Header"].BookmarkStart.GetText();
foreach (Node node in nodes)
{
var html = node.ToString(htmlSaveOptions);
try
{
//   is used by aspose because it works in XML
// If we see this character and the text of the node is \r we need to insert a break
if (html.Contains(" ") && node.Range.Text == "\r")
{
newBuilder.InsertHtml(html, false);
// Change the node into an HTML string
/*var htmlString = node.ToString(SaveFormat.Html);
var tempHtmlLinkDoc = new HtmlDocument();
tempHtmlLinkDoc.LoadHtml(htmlString);
// Get all the child nodes of the html document
var allChildNodes = tempHtmlLinkDoc.DocumentNode.SelectNodes("//*");
// Loop over all child nodes so we can make sure we apply the correct font family and size to the break.
foreach (var childNode in allChildNodes)
{
// Get the style attribute from the child node
var childNodeStyles = childNode.GetAttributeValue("style", "").Split(';');
foreach (var childNodeStyle in childNodeStyles)
{
// Apply the font name and size to the new builder on the document.
if (childNodeStyle.ToLower().Contains("font-family"))
{
newBuilder.Font.Name = childNodeStyle.Split(':')[1].Trim();
}
if (childNodeStyle.ToLower().Contains("font-size"))
{
newBuilder.Font.Size = Convert.ToDouble(childNodeStyle.Split(':')[1]
.Replace("pt", "")
.Replace("px", "")
.Replace("em", "")
.Replace("rem", "")
.Replace("%", "")
.Trim());
}
}
}
// Insert the break with the corresponding font size and name.
newBuilder.InsertBreak(BreakType.ParagraphBreak);*/
}
else
{
// Loop through the source document links so the link can be applied to the HTML.
foreach (var srcDocLink in srcDocLinks)
{
if (html.Contains(srcDocLink.DisplayResult))
{
// Now that we know the html string has one of the links in it we need to get the address from the node.
var linkAddress = srcDocLink.Start.NextSibling.GetText().Replace(" HYPERLINK \"", "").Replace("\"", "");
//Convert the node into an HTML String so we can get the correct font color, name, size, and any text decoration.
var htmlString = srcDocLink.Start.NextSibling.ToString(SaveFormat.Html);
var tempHtmlLinkDoc = new HtmlDocument();
tempHtmlLinkDoc.LoadHtml(htmlString);
var linkStyles = tempHtmlLinkDoc.DocumentNode.ChildNodes[0].GetAttributeValue("style", "").Split(';');
var linkStyleHtml = "";
foreach (var linkStyle in linkStyles)
{
if (linkStyle.ToLower().Contains("color"))
{
linkStyleHtml += $"color:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("font-family"))
{
linkStyleHtml += $"font-family:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("font-size"))
{
linkStyleHtml += $"font-size:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("text-decoration"))
{
linkStyleHtml += $"text-decoration:{linkStyle.Split(':')[1].Trim()};";
}
}
if (linkAddress.ToLower().Contains("mailto:"))
{
// Since the link has mailto included don't add the target attribute to the link.
html = new Regex($#"\b{srcDocLink.DisplayResult}\b").Replace(html, $"{srcDocLink.DisplayResult}");
//html = html.Replace(srcDocLink.DisplayResult, $"{srcDocLink.DisplayResult}");
}
else
{
// Since the links is not an email include the target attribute.
html = new Regex($#"\b{srcDocLink.DisplayResult}\b").Replace(html, $"{srcDocLink.DisplayResult}");
//html = html.Replace(srcDocLink.DisplayResult, $"{srcDocLink.DisplayResult}");
}
}
}
// Inseret the HTML String into the temporary document.
newBuilder.InsertHtml(html, false);
}
}
catch (Exception ex)
{
throw;
}
}
// This is just for debugging/troubleshooting purposes and to make sure thigns look correct
string tempDocxPath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "temp", "TemporaryCompiledDocument.docx");
dstDoc.Save(tempDocxPath);
// We generate this HTML file then load it back up and pass the DocumentNode.OuterHtml back to the requesting method.
ELSLogHelper.InsertInfoLog(_callContext, ELSLogHelper.AsposeLogMessage("Save"), MethodBase.GetCurrentMethod()?.Name, MethodBase.GetCurrentMethod().DeclaringType?.Name, Environment.StackTrace);
string tempHtmlPath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "temp", "TemporaryCompiledDocument.html");
dstDoc.Save(tempHtmlPath, htmlSaveOptions);
var tempHtmlDoc = new HtmlDocument();
tempHtmlDoc.Load(tempHtmlPath);
var htmlText = tempHtmlDoc.DocumentNode.OuterHtml;
// Clean up our mess...
if (File.Exists(tempDocxPath))
{
File.Delete(tempDocxPath);
}
if (File.Exists(tempHtmlPath))
{
File.Delete(tempHtmlPath);
}
// Return the generated HTML string.
return htmlText;
}
Saving each node to HTML and then inserting them into the destination document is not a good idea. Because not all nodes can be properly saved to HTML and some formatting can be lost after Aspose.Words DOM -> HTML -> Aspose.Words DOM roundtrip.
Regarding the original issue, the problem might occur because you are using ImportFormatMode.UseDestinationStyles, in this case styles and default of the destination document are used and font might be changed. If you need to keep the source document formatting, you should use ImportFormatMode.KeepSourceFormatting.
If the problem occurs even with ImportFormatMode.KeepSourceFormatting this must be a bug and you should report this to Aspose.Words staff in the support forum.

How to inject styles to a page using IE addon

I am trying to convert a content script I wrote for Google Chrome into an IE Addon, mainly using the code in this answer.
I needed to inject an stylesheet and, I found a way to do it using Javascript. I thought I might be able to do the same using C#. Here's my code:
[ComVisible(true)]
[Guid(/* replaced */)]
[ClassInterface(ClassInterfaceType.None)]
public class SimpleBHO: IObjectWithSite
{
private WebBrowser webBrowser;
void webBrowser_DocumentComplete(object pDisp, ref object URL)
{
var document2 = webBrowser.Document as IHTMLDocument2;
var document3 = webBrowser.Document as IHTMLDocument3;
// trying to add a '<style>' element to the header. this does not work.
var style = document2.createElement("style");
style.innerHTML = ".foo { background-color: red; }";// this line is the culprit!
style.setAttribute("type", "text/css");
var headCollection = document3.getElementsByTagName("head");
var head = headCollection.item(0, 0) as IHTMLDOMNode;
var result = head.appendChild(style as IHTMLDOMNode);
// trying to repace an element in the body. this part works if
// adding style succeeds.
var journalCollection = document3.getElementsByName("elem_id");
var journal = journalCollection.item(0, 0) as IHTMLElement;
journal.innerHTML = "<div class=\"foo\">Replaced!</div>";
// trying to execute some JavaScript. this part works as well if
// adding style succeeds.
document2.parentWindow.execScript("alert('Hi!')");
}
int IObjectWithSite.SetSite(object site)
{
if (site != null)
{
webBrowser = (WebBrowser)site;
webBrowser.DocumentComplete += new DWebBrowserEvents2_DocumentCompleteEventHandler(webBrowser_DocumentComplete);
}
else
{
webBrowser.DocumentComplete -= new DWebBrowserEvents2_DocumentCompleteEventHandler(webBrowser_DocumentComplete);
webBrowser = null;
}
return 0;
}
/* some code (e.g.: IObjectWithSite.SetSite) omitted to improve clarity */
}
If I just comment out the following line...
style.innerHTML = ".foo { background-color: red; }";
... the rest of the code executes perfectly (The element #elem_id is replaced and the JavaScript I injected is executed).
What am I doing wrong when trying to inject the stylesheet? Is this even possible?
EDIT: I found out that the site I'm trying to inject CSS requests Document Mode 5, and when Compatibility view is disabled, my code works perfectly. But how do I make it to work even when compatibility view is enabled?
After lot of experimenting, I found out that only failsafe way to inject stylesheets to inject them using JavaScript, which is executed with IHTMLWindow2.execScript().
I used following JavaScript:
var style = document.createElement('style');
document.getElementsByTagName('head')[0].appendChild(style);
var sheet = style.styleSheet || style.sheet;
if (sheet.insertRule) {
sheet.insertRule('.foo { background-color: red; }', 0);
} else if (sheet.addRule) {
sheet.addRule('.foo', 'background-color: red;', 0);
}
The above JavaScript was executed in the following fashion:
// This code is written inside a BHO written in C#
var document2 = webBrowser.Document as IHTMLDocument2;
document2.parentWindow.execScript(#"
/* Here, we have the same JavaScript mentioned above */
var style = docu....
...
}");
document2.parentWindow.execScript("alert('Hi!')");

How to insert custom page number in Aspose.Words

I want to add custom page numbers (like 1/2,2/2) to word document with using Aspose.Words. But I couldn't find any sample for c# language. I tried to overrite footer but i couldn't give a format to page numbers.
Pls help!
Thanks!
edit
After i tried first answer,it worked as what i want but another problem came up. I adding child documents to main document. I can only formatting main document's number. Child documents still have ordinary page number.
Here a sample of code;
public void AddChildDocs (System.IO.Stream parentStream, List<System.IO.Stream> childStreams)
{
doc = new Aspose.Words.Document(parentStream);
if (Items.Count > 0)
{
WordReplacer evaluator = new WordReplacer(this);
doc.Range.Replace(new Regex(ReplaceRegex), evaluator, false);
}
foreach (var item in childStreams)
{
Aspose.Words.Document childDoc = new Aspose.Words.Document(item);
if (Items.Count > 0)
{
WordReplacer evaluator = new WordReplacer(this);
childDoc.Range.Replace(new Regex(ReplaceRegex), evaluator, false);
}
doc.AppendDocument(childDoc, ImportFormatMode.KeepSourceFormatting);
}
DocumentBuilder builder = new DocumentBuilder(doc);
builder.MoveToHeaderFooter(HeaderFooterType.FooterPrimary);
builder.InsertField("PAGE", "");
builder.Write(" / ");
builder.InsertField("NUMPAGES", "");
}
You can get idea from this page in Aspose documentation. Below is the sample code taken from the same page, but only related to custom page numbers.
String src = dataDir + "Page numbers.docx";
String dst = dataDir + "Page numbers_out.docx";
// Create a new document or load from disk
Aspose.Words.Document doc = new Aspose.Words.Document(src);
// Create a document builder
Aspose.Words.DocumentBuilder builder = new DocumentBuilder(doc);
// Go to the primary footer
builder.MoveToHeaderFooter(HeaderFooterType.FooterPrimary);
// Add fields for current page number
builder.InsertField("PAGE", "");
// Add any custom text
builder.Write(" / ");
// Add field for total page numbers in document
builder.InsertField("NUMPAGES", "");
// Import new document
Aspose.Words.Document newDoc = new Aspose.Words.Document(dataDir + "new.docx");
// Link the header/footer of first section to previous document
newDoc.FirstSection.HeadersFooters.LinkToPrevious(true);
doc.AppendDocument(newDoc, ImportFormatMode.UseDestinationStyles);
// Save the document
doc.Save(dst);
I work with Aspose as Developer Evangelist.
Here is the code to set custom page number in aspose.word, when you set page margins and starting page number then it automatically get next page when that particular page area is finished. Try this it will work...
section.PageSetup.PaperSize = PaperSize.Letter;
section.PageSetup.LeftMargin = 10;
section.PageSetup.RightMargin = 10;
section.PageSetup.TopMargin = 00;
section.PageSetup.BottomMargin = 0;
section.PageSetup.HeaderDistance = 50;
section.PageSetup.FooterDistance = 50;
section.PageSetup.Borders.Color = Color.Black;
section.PageSetup.PageStartingNumber = 1;

How can I parse this HTML to get the content I want?

I am currently trying to parse an HTML document to retrieve all of the footnotes inside of it; the document contains dozens and dozens of them. I can't really figure out the expressions to use to extract all of content I want. The thing is, the classes (ex. "calibre34") are all randomized in every document. The only way to see where the footnotes are located is to search for "hide" and it's always text afterwards and is closed with a < /td> tag. Below is an example of one of the footnotes in the HTML document, all I want is the text. Any ideas? Thanks guys!
<td class="calibre33">1.<span><a class="x-xref" href="javascript:void(0);">
[hide]</a></span></td>
<td class="calibre34">
Among the other factors on which the premium would be based are the
average size of the losses experienced, a margin for contingencies,
a loading to cover the insurer's expenses, a margin for profit or
addition to the insurer's surplus, and perhaps the investment
earnings the insurer could realize from the time the premiums are
collected until the losses must be paid.</td>
Use HTMLAgilityPack to load the HTML document and then extract the footnotes with this XPath:
//td[text()='[hide]']/following-sibling::td
Basically,what it does is first selecting all td nodes that contain [hide] and then finally go to and select their next sibling. So the next td. Once you have this collection of nodes you can extract their inner text (in C#, with the support provided in HtmlAgilityPack).
How about use MSHTML to parse HTML source?
Here is the demo code.enjoy.
public class CHtmlPraseDemo
{
private string strHtmlSource;
public mshtml.IHTMLDocument2 oHtmlDoc;
public CHtmlPraseDemo(string url)
{
GetWebContent(url);
oHtmlDoc = (IHTMLDocument2)new HTMLDocument();
oHtmlDoc.write(strHtmlSource);
}
public List<String> GetTdNodes(string TdClassName)
{
List<String> listOut = new List<string>();
IHTMLElement2 ie = (IHTMLElement2)oHtmlDoc.body;
IHTMLElementCollection iec = (IHTMLElementCollection)ie.getElementsByTagName("td");
foreach (IHTMLElement item in iec)
{
if (item.className == TdClassName)
{
listOut.Add(item.innerHTML);
}
}
return listOut;
}
void GetWebContent(string strUrl)
{
WebClient wc = new WebClient();
strHtmlSource = wc.DownloadString(strUrl);
}
}
class Program
{
static void Main(string[] args)
{
CHtmlPraseDemo oH = new CHtmlPraseDemo("http://stackoverflow.com/faq");
Console.Write(oH.oHtmlDoc.title);
List<string> l = oH.GetTdNodes("x");
foreach (string n in l)
{
Console.WriteLine("new td");
Console.WriteLine(n.ToString());
}
Console.Read();
}
}

Select elements added to the DOM by a script

I've been trying to get either an <object> or an <embed> tag using:
HtmlNode videoObjectNode = doc.DocumentNode.SelectSingleNode("//object");
HtmlNode videoEmbedNode = doc.DocumentNode.SelectSingleNode("//embed");
This doesn't seem to work.
Can anyone please tell me how to get these tags and their InnerHtml?
A YouTube embedded video looks like this:
<embed height="385" width="640" type="application/x-shockwave-flash"
src="http://s.ytimg.com/yt/swf/watch-vfl184368.swf" id="movie_player" flashvars="..."
allowscriptaccess="always" allowfullscreen="true" bgcolor="#000000">
I got a feeling the JavaScript might stop the swf player from working, hope not...
Cheers
Update 2010-08-26 (in response to OP's comment):
I think you're thinking about it the wrong way, Alex. Suppose I wrote some C# code that looked like this:
string codeBlock = "if (x == 1) Console.WriteLine(\"Hello, World!\");";
Now, if I wrote a C# parser, should it recognize the contents of the string literal above as C# code and highlight it (or whatever) as such? No, because in the context of a well-formed C# file, that text represents a string to which the codeBlock variable is being assigned.
Similarly, in the HTML on YouTube's pages, the <object> and <embed> elements are not really elements at all in the context of the current HTML document. They are the contents of string values residing within JavaScript code.
In fact, if HtmlAgilityPack did ignore this fact and attempted to recognize all portions of text that could be HTML, it still wouldn't succeed with these elements because, being inside JavaScript, they're heavily escaped with \ characters (notice the precarious Unescape method in the code I posted to get around this issue).
I'm not saying my hacky solution below is the right way to approach this problem; I'm just explaining why obtaining these elements isn't as straightforward as grabbing them with HtmlAgilityPack.
YouTubeScraper
OK, Alex: you asked for it, so here it is. Some truly hacky code to extract your precious <object> and <embed> elements out from that sea of JavaScript.
class YouTubeScraper
{
public HtmlNode FindObjectElement(string url)
{
HtmlNodeCollection scriptNodes = FindScriptNodes(url);
for (int i = 0; i < scriptNodes.Count; ++i)
{
HtmlNode scriptNode = scriptNodes[i];
string javascript = scriptNode.InnerHtml;
int objectNodeLocation = javascript.IndexOf("<object");
if (objectNodeLocation != -1)
{
string htmlStart = javascript.Substring(objectNodeLocation);
int objectNodeEndLocation = htmlStart.IndexOf(">\" :");
if (objectNodeEndLocation != -1)
{
string finalEscapedHtml = htmlStart.Substring(0, objectNodeEndLocation + 1);
string unescaped = Unescape(finalEscapedHtml);
var objectDoc = new HtmlDocument();
objectDoc.LoadHtml(unescaped);
HtmlNode objectNode = objectDoc.GetElementbyId("movie_player");
return objectNode;
}
}
}
return null;
}
public HtmlNode FindEmbedElement(string url)
{
HtmlNodeCollection scriptNodes = FindScriptNodes(url);
for (int i = 0; i < scriptNodes.Count; ++i)
{
HtmlNode scriptNode = scriptNodes[i];
string javascript = scriptNode.InnerHtml;
int approxEmbedNodeLocation = javascript.IndexOf("<\\/object>\" : \"<embed");
if (approxEmbedNodeLocation != -1)
{
string htmlStart = javascript.Substring(approxEmbedNodeLocation + 15);
int embedNodeEndLocation = htmlStart.IndexOf(">\";");
if (embedNodeEndLocation != -1)
{
string finalEscapedHtml = htmlStart.Substring(0, embedNodeEndLocation + 1);
string unescaped = Unescape(finalEscapedHtml);
var embedDoc = new HtmlDocument();
embedDoc.LoadHtml(unescaped);
HtmlNode videoEmbedNode = embedDoc.GetElementbyId("movie_player");
return videoEmbedNode;
}
}
}
return null;
}
protected HtmlNodeCollection FindScriptNodes(string url)
{
var doc = new HtmlDocument();
WebRequest request = WebRequest.Create(url);
using (var response = request.GetResponse())
using (var stream = response.GetResponseStream())
{
doc.Load(stream);
}
HtmlNode root = doc.DocumentNode;
HtmlNodeCollection scriptNodes = root.SelectNodes("//script");
return scriptNodes;
}
static string Unescape(string htmlFromJavascript)
{
// The JavaScript has escaped all of its HTML using backslashes. We need
// to reverse this.
// DISCLAIMER: I am a TOTAL Regex n00b; I make no claims as to the robustness
// of this code. If you could improve it, please, I beg of you to do so. Personally,
// I tested it on a grand total of three inputs. It worked for those, at least.
return Regex.Replace(htmlFromJavascript, #"\\(.)", UnescapeFromBeginning);
}
static string UnescapeFromBeginning(Match match)
{
string text = match.ToString();
if (text.StartsWith("\\"))
{
return text.Substring(1);
}
return text;
}
}
And in case you're interested, here's a little demo I threw together (super fancy, I know):
class Program
{
static void Main(string[] args)
{
var scraper = new YouTubeScraper();
HtmlNode davidAfterDentistEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=txqiwrbYGrs");
Console.WriteLine("David After Dentist:");
Console.WriteLine(davidAfterDentistEmbedNode.OuterHtml);
Console.WriteLine();
HtmlNode drunkHistoryObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=jL68NyCSi8o");
Console.WriteLine("Drunk History:");
Console.WriteLine(drunkHistoryObjectNode.OuterHtml);
Console.WriteLine();
HtmlNode jessicaDailyAffirmationEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=qR3rK0kZFkg");
Console.WriteLine("Jessica's Daily Affirmation:");
Console.WriteLine(jessicaDailyAffirmationEmbedNode.OuterHtml);
Console.WriteLine();
HtmlNode jazzerciseObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=VGOO8ZhWFR4");
Console.WriteLine("Jazzercise - Move your Boogie Body:");
Console.WriteLine(jazzerciseObjectNode.OuterHtml);
Console.WriteLine();
Console.Write("Finished! Hit Enter to quit.");
Console.ReadLine();
}
}
Original Answer
Why not try using the element's Id instead?
HtmlNode videoEmbedNode = doc.GetElementbyId("movie_player");
Update: Oh man, you're searching for HTML tags that are themselves within JavaScript? That's definitely why this isn't working. (They aren't really tags to be parsed from the perspective of HtmlAgilityPack; all of that JavaScript is really one big string inside a <script> tag.) Maybe there's some way you can parse the <script> tag's inner text itself as HTML and go from there.

Categories