Parsing a website data c# - HTML agility pack

Parsing a website data c# - HTML agility pack - c#

I tried to parse the results from this website.
http://www.nokia.com/in-en/store-locator/?action=storeSearch&qt=madurai&tags=Nokia_Recycling_Point&country=IN
I need specifically the contents of the div class 'result-wrapper'. That is, all the 'h4', 'category' and 'description' span classes. The following is the code I could reach upto, later on, I do not know to parse that particular div. I need help to get all the contents of that div class.
protected async override void OnNavigatedTo(NavigationEventArgs e)
{
base.OnNavigatedTo(e);
string htmlPage = "";
using (var client = new HttpClient())
{
try
{
htmlPage = await client.GetStringAsync("http://www.nokia.com/in-en/store-locator/?action=storeSearch&qt=madurai&tags=Nokia_Recycling_Point&country=IN");
}
catch (HttpRequestException exc) { }
}
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(htmlPage);

Well, you can try:
var resultWrapperDivs = htmlDocument.DocumentNode.SelectNodes("//div[#class='result-wrapper']");
foreach (var resultWrapperDiv in resultWrapperDivs)
{
// Do stuff with each div.
}
Also, to get a specific content/"html tag" you can take each resultWrapperDiv alone and get also its children nodes (resultWrapperDiv.SelectSingleNode or resultWrapperDiv.SelectNodes)

Related

Apose.Words ImportNode ignores font formatting when appendingchild

I am currently using Aspose.Words to open a document, pull content between a bookmark start and a bookmark end and then place that content into another document. The issue that I'm having is that when using the ImportNode method is imports onto my document but changes all of the fonts from Calibri to Times New Roman and changes the font size from whatever it was on the original document to 12pt.
The way I'm obtaining the content from the bookmark is by using the Aspose ExtractContent method.
Because I'm having the issue with the ImportNode stripping my font formatting I tried making some adjustments and saving each node to an HTML string using ToString(HtmlSaveOptions). This works mostly but the problem with this is it is stripping out my returns on the word document so none of my text has the appropriate spacing. My returns end up coming in as HTML in the following format
"<p style=\"margin-top:0pt; margin-bottom:8pt; line-height:108%; font-size:11pt\"><span style=\"font-family:Calibri; display:none; -aw-import:ignore\"> </span></p>"
When using
DocumentBuilder.InsertHtml("<p style=\"margin-top:0pt; margin-bottom:8pt; line-height:108%; font-size:11pt\"><span style=\"font-family:Calibri; display:none; -aw-import:ignore\"> </span></p>");
it does not correctly add the return on the word document.
Here is the code I'm using, please forgive the comments etc... this has been my attempts at correcting this.
public async Task<string> GenerateHtmlString(Document srcDoc, ArrayList nodes)
{
// Create a blank document.
Document dstDoc = new Document();
ELSLogHelper.InsertInfoLog(_callContext, ELSLogHelper.AsposeLogMessage("Open"), MethodBase.GetCurrentMethod()?.Name, MethodBase.GetCurrentMethod().DeclaringType?.Name, Environment.StackTrace);
// Remove the first paragraph from the empty document.
dstDoc.FirstSection.Body.RemoveAllChildren();
// Create a new Builder for the temporary document that gets generated with the header or footer data.
// This allows us to control font and styles separately from the main document being built.
var newBuilder = new DocumentBuilder(dstDoc);
Aspose.Words.Saving.HtmlSaveOptions htmlSaveOptions = new Aspose.Words.Saving.HtmlSaveOptions();
htmlSaveOptions.ExportImagesAsBase64 = true;
htmlSaveOptions.SaveFormat = SaveFormat.Html;
htmlSaveOptions.ExportFontsAsBase64 = true;
htmlSaveOptions.ExportFontResources = true;
htmlSaveOptions.ExportTextBoxAsSvg = true;
htmlSaveOptions.ExportRoundtripInformation = true;
htmlSaveOptions.Encoding = Encoding.UTF8;
// Obtain all the links from the source document
// This is used later to add hyperlinks to the html
// because by default extracting nodes using Aspose
// does not pull in the links in a usable way.
var srcDocLinks = srcDoc.Range.Fields.GroupBy(x => x.DisplayResult).Select(x => x.First()).Where(x => x.Type == Aspose.Words.Fields.FieldType.FieldHyperlink).Distinct().ToList();
var childNodes = nodes.Cast<Node>().Select(x => x).ToList();
var oldBuilder = new DocumentBuilder(srcDoc);
oldBuilder.MoveToBookmark("Header");
var allchildren = oldBuilder.CurrentParagraph.Runs;
var allChildNodes = childNodes[0].Document.GetChildNodes(NodeType.Any, true);
var headerText = allChildNodes[0].Range.Bookmarks["Header"].BookmarkStart.GetText();
foreach (Node node in nodes)
{
var html = node.ToString(htmlSaveOptions);
try
{
//   is used by aspose because it works in XML
// If we see this character and the text of the node is \r we need to insert a break
if (html.Contains(" ") && node.Range.Text == "\r")
{
newBuilder.InsertHtml(html, false);
// Change the node into an HTML string
/*var htmlString = node.ToString(SaveFormat.Html);
var tempHtmlLinkDoc = new HtmlDocument();
tempHtmlLinkDoc.LoadHtml(htmlString);
// Get all the child nodes of the html document
var allChildNodes = tempHtmlLinkDoc.DocumentNode.SelectNodes("//*");
// Loop over all child nodes so we can make sure we apply the correct font family and size to the break.
foreach (var childNode in allChildNodes)
{
// Get the style attribute from the child node
var childNodeStyles = childNode.GetAttributeValue("style", "").Split(';');
foreach (var childNodeStyle in childNodeStyles)
{
// Apply the font name and size to the new builder on the document.
if (childNodeStyle.ToLower().Contains("font-family"))
{
newBuilder.Font.Name = childNodeStyle.Split(':')[1].Trim();
}
if (childNodeStyle.ToLower().Contains("font-size"))
{
newBuilder.Font.Size = Convert.ToDouble(childNodeStyle.Split(':')[1]
.Replace("pt", "")
.Replace("px", "")
.Replace("em", "")
.Replace("rem", "")
.Replace("%", "")
.Trim());
}
}
}
// Insert the break with the corresponding font size and name.
newBuilder.InsertBreak(BreakType.ParagraphBreak);*/
}
else
{
// Loop through the source document links so the link can be applied to the HTML.
foreach (var srcDocLink in srcDocLinks)
{
if (html.Contains(srcDocLink.DisplayResult))
{
// Now that we know the html string has one of the links in it we need to get the address from the node.
var linkAddress = srcDocLink.Start.NextSibling.GetText().Replace(" HYPERLINK \"", "").Replace("\"", "");
//Convert the node into an HTML String so we can get the correct font color, name, size, and any text decoration.
var htmlString = srcDocLink.Start.NextSibling.ToString(SaveFormat.Html);
var tempHtmlLinkDoc = new HtmlDocument();
tempHtmlLinkDoc.LoadHtml(htmlString);
var linkStyles = tempHtmlLinkDoc.DocumentNode.ChildNodes[0].GetAttributeValue("style", "").Split(';');
var linkStyleHtml = "";
foreach (var linkStyle in linkStyles)
{
if (linkStyle.ToLower().Contains("color"))
{
linkStyleHtml += $"color:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("font-family"))
{
linkStyleHtml += $"font-family:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("font-size"))
{
linkStyleHtml += $"font-size:{linkStyle.Split(':')[1].Trim()};";
}
if (linkStyle.ToLower().Contains("text-decoration"))
{
linkStyleHtml += $"text-decoration:{linkStyle.Split(':')[1].Trim()};";
}
}
if (linkAddress.ToLower().Contains("mailto:"))
{
// Since the link has mailto included don't add the target attribute to the link.
html = new Regex($#"\b{srcDocLink.DisplayResult}\b").Replace(html, $"{srcDocLink.DisplayResult}");
//html = html.Replace(srcDocLink.DisplayResult, $"{srcDocLink.DisplayResult}");
}
else
{
// Since the links is not an email include the target attribute.
html = new Regex($#"\b{srcDocLink.DisplayResult}\b").Replace(html, $"{srcDocLink.DisplayResult}");
//html = html.Replace(srcDocLink.DisplayResult, $"{srcDocLink.DisplayResult}");
}
}
}
// Inseret the HTML String into the temporary document.
newBuilder.InsertHtml(html, false);
}
}
catch (Exception ex)
{
throw;
}
}
// This is just for debugging/troubleshooting purposes and to make sure thigns look correct
string tempDocxPath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "temp", "TemporaryCompiledDocument.docx");
dstDoc.Save(tempDocxPath);
// We generate this HTML file then load it back up and pass the DocumentNode.OuterHtml back to the requesting method.
ELSLogHelper.InsertInfoLog(_callContext, ELSLogHelper.AsposeLogMessage("Save"), MethodBase.GetCurrentMethod()?.Name, MethodBase.GetCurrentMethod().DeclaringType?.Name, Environment.StackTrace);
string tempHtmlPath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "temp", "TemporaryCompiledDocument.html");
dstDoc.Save(tempHtmlPath, htmlSaveOptions);
var tempHtmlDoc = new HtmlDocument();
tempHtmlDoc.Load(tempHtmlPath);
var htmlText = tempHtmlDoc.DocumentNode.OuterHtml;
// Clean up our mess...
if (File.Exists(tempDocxPath))
{
File.Delete(tempDocxPath);
}
if (File.Exists(tempHtmlPath))
{
File.Delete(tempHtmlPath);
}
// Return the generated HTML string.
return htmlText;
}

Saving each node to HTML and then inserting them into the destination document is not a good idea. Because not all nodes can be properly saved to HTML and some formatting can be lost after Aspose.Words DOM -> HTML -> Aspose.Words DOM roundtrip.
Regarding the original issue, the problem might occur because you are using ImportFormatMode.UseDestinationStyles, in this case styles and default of the destination document are used and font might be changed. If you need to keep the source document formatting, you should use ImportFormatMode.KeepSourceFormatting.
If the problem occurs even with ImportFormatMode.KeepSourceFormatting this must be a bug and you should report this to Aspose.Words staff in the support forum.

No Errors yet nothing in Console?

Just after some help with some code i've written to extract data using HttpClient.
I am new to writing code so can't find my problem. Could someone pls help me troubleshoot this.
I expect to write the data of the table i'm scraping to the console line.
Any help appreciated
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net.Http;
using HtmlAgilityPack;
namespace weatherCheck
{
class Program
{
private static void Main(string[] args)
{
GetHtmlAsync();
Console.ReadLine();
}
protected static async void GetHtmlAsync()
{
var url = "https://www.weatherzone.com.au/vic/melbourne/melbourne";
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(url);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
//grab the rain chance, rain in mm and date
var MyTable = Enumerable.FirstOrDefault(htmlDocument.DocumentNode.Descendants("table")
.Where(table => table.Attributes.Contains("id"))
, table => table.Attributes["id"].Value == "forecast-table");
List<HtmlNode> rows = htmlDocument.DocumentNode.SelectNodes("//tr").ToList();
foreach (var row in rows)
{
try
{
if (MyTable != null)
{
Console.WriteLine(MyTable.GetAttributeValue("forecast-table", " "));
}
}
catch (Exception)
{
}
}
}
}
}

I used your code to look up the values but it didnt produce anything for me either. When i look at the htmlDocument.DocumentNode.OuterHtml to view the entire Html it is scraping, I dont see anything in the document that reflects an attribute forecast-table.
Also, you are validating MyTable each time you loop through rows. You should validate row != null along with printing attribute from row.
var MyTable = Enumerable.FirstOrDefault(htmlDocument.DocumentNode.Descendants("table")
.Where(table => table.Attributes.Contains("id")), table => table.Attributes["id"].Value == "forecast-table");
List<HtmlNode> rows = htmlDocument.DocumentNode.SelectNodes("//tr").ToList();
foreach (var row in rows)
{
try
{
if (row != null) // Here, it should be row, not My Table along with MyTable in line below.
Console.WriteLine(row.GetAttributeValue("forecast-table", " "));
}
catch (Exception)
{
}
}
Problem is
You also should know that Html you view by using Dev Tools on chrome is not the same as the one you see in HtmlAgilityPack. Chrome renders the page after executing the scripts where HtmlAgilityPack simply provides you with default HTML of the page. This is the reason why you are not able to get the value of forecast-table.

From Doc, For GetAttributeValue(name,def) it'll return def if the attribute not found.
So, it'll print ""(empty string if the attribute not found in your case)
remove async and await as you already calling httpClient.GetStringAsync(url);
var html =httpClient.GetStringAsync(url).Result;
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
And print,
Console.WriteLine(MyTable.GetAttributeValue("forecast-table","SOME_TEXT_HERE").ToString());

C# Looking to get obtain the <div> value using HtmlAgilityPack but receiving a System.NullReferenceException

I am trying to get the value of the div class "darkgreen" which is 46.98. I tried the following code but am getting a Null exception.
Below is the code I am trying:
private void button1_Click(object sender, EventArgs e)
{
var doc = new HtmlWeb().Load("https://rotogrinders.com/grids/nba-defense-vs-position-cheat-sheet-1493632?site=fanduele");
HtmlAgilityPack.HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div[#class='darkgreen']");
foreach (HtmlAgilityPack.HtmlNode node in nodes)
{
Console.WriteLine(node.InnerText);
}
}
If I run the same code but with doc.DocumentNode.SelectNodes("//div[#class='rgt-hdr colorize']") it does pull the header data with no error.
I am thinking that maybe child nodes may be a solution but I am not sure as I am unable to get it to work still.

Your problem is that the HTML your looking it is created by a javascript. And the HTML you load into your Document variable is pre-what-ever is created by the javascript. If you look at the page source in your web browser you will see the exact HTML that gets loaded in your HtmlDocument variable.
The example below will give you the data(JSON) that is used to create the table. I don't know whether that is enough for whatever you're trying to do.
public static void Main(string[] args)
{
Console.WriteLine("Program Started!");
HtmlDocument doc;
doc = new HtmlWeb().Load("https://rotogrinders.com/grids/nba-defense-vs-position-cheat-sheet-1493632?site=fanduele");
HtmlNode node = doc.DocumentNode.SelectSingleNode("//section[#class='bdy content article full cflex reset long table-page']/following-sibling::script[1]");
int start = node.InnerText.IndexOf("[");
int length = node.InnerText.IndexOf("]") - start +1;
Console.WriteLine(node.InnerText.Substring(start, length));
Console.WriteLine("Program Ended!");
Console.ReadKey();
}
Alternative solution
Alternatively you can use Selenium with PhantomJS. And then load the HTML from the headless browser into your document variable and then your xpath will work.

Creating an xml parsing function in C#

I have an XML that's obtained from a web service, i'm using an HttpClient for it. This is what the XML looks like:
<respuesta>
<entrada>
<rut>7059099</rut>
<dv>9</dv>
</entrada>
<status>
<code>OK</code>
<descrip>Persona tiene ficha, ok</descrip>
</status>
<ficha>
<folio>3204525</folio>
<ptje>7714</ptje>
<fec_aplic>20080714</fec_aplic>
<num_integ>2</num_integ>
<comuna>08205</comuna>
<parentesco>1</parentesco>
<fec_puntaje>20070101</fec_puntaje>
<personas>
<persona>
<run>7059099</run>
<dv>9</dv>
<nombres>JOSE SANTOS</nombres>
<ape1>ONATE</ape1>
<ape2>FERNANDEZ</ape2>
<fec_nac>19521101</fec_nac>
<sexo>M</sexo>
<parentesco>1</parentesco>
</persona>
<persona>
<run>8353907</run>
<dv>0</dv>
<nombres>JUANA DEL TRANSITO</nombres>
<ape1>MEDINA</ape1>
<ape2>ROA</ape2>
<fec_nac>19560815</fec_nac>
<sexo>F</sexo>
<parentesco>2</parentesco>
</persona>
</personas>
</ficha>
I'm trying to make a function that can parse this and, right now (just for the purpose of testing my understanding of the language since i'm new to it) i just need it to find the VALUE inside an "rut" tag, the first one, or something like that. More precisely I need to find a value inside the XML and return it, so i can show it on a label that's on my .aspx page. The code of my parsing function looks like this:
public static String parseXml(String xmlStr, String tag)
{
String valor;
using (XmlReader r = XmlReader.Create(new StringReader(xmlStr)))
{
try
{
r.ReadToFollowing(tag);
r.MoveToContent();
valor = r.Value;
}
catch (Exception ex)
{
throw new Exception(ex.Message, ex.InnerException);
}
}
return valor;
}
This code is based on an example I found on youtube made by the guys from microsoft where they "explain" how to use the parser.
Also, this function is being called from inside one of the tasks of the HttpClient, this is it:
protected void rutBTN_Click(object sender, EventArgs e)
{
if (rutTB.Text != "")
{
HttpClient client = new HttpClient();
String xmlString = "";
String text = "";
var byteArray = Encoding.ASCII.GetBytes("*******:*******"); //WebService's server authentication
client.BaseAddress = new Uri("http://wschsol.mideplan.cl");
var par = "mod_perl/xml/fps-by-rut?rut=" + rutTB.Text;
client.DefaultRequestHeaders.Authorization = new System.Net.Http.Headers.AuthenticationHeaderValue("Basic", Convert.ToBase64String(byteArray));
client.GetAsync(par).ContinueWith(
(requestTask) =>
{
HttpResponseMessage resp = requestTask.Result;
try
{
resp.EnsureSuccessStatusCode();
XmlDocument xmlResp = new XmlDocument();
requestTask.Result.Content.ReadAsStreamAsync().ContinueWith(
(streamTask) =>
{
xmlResp.Load(streamTask.Result);
text = xmlResp.InnerXml.ToString();
xmlString = parseXml(text, "rut"); //HERE I'm calling the parsing function, and i'm passing the whole innerXml to it, and the string "rut", so it searches for this tag.
Console.WriteLine("BP");
}).Wait();
}
catch (Exception ex)
{
throw new Exception(ex.Message, ex.InnerException);
}
}).Wait();
testLBL.Text = xmlString; //Finally THIS is the label i want to show the "rut" tag's value to be shown.
testLBL.Visible = true;
}
else
{
testLBL.Text = "You must enter an RUT number";
testLBL.Visible = true;
}
}
The problem is that when i put some breakpoints into the parsing function i can see that it's receiving correctly the innerxml string (as a string) but it's not finding the tag called "rut", or rather not finding anything at all, since it's returning an empty string ("").
I know that maybe this is not the correct way to parse an xmlDocument, so if someone can help me out i'd be really really thankful.
EDIT:
Ok, so i won't ask for any tutorial or such (I requested that to avoid asking noob questions). But anyway, please, instead of just answering "you better do it like this", I'd appreciate if you could explain me things like "THIS is what you're doing wrong and THAT'S why your code isn't working", and THEN tell me how you guys would do it instead.
Thanks in advance!

As you only want to retrieve a single field value I would recommend using Xpath.
Basically you create a XpathNavigator from a XpathDocument or xmlDocument and then use Select to get the content of the rut node:
XPathNavigator navigator = xmlResp.CreateNavigator();
XPathNodeIterator rutNode = navigator.SelectSingleNode("/respuesta/entrada/rut");
string rut = rutNode.Value

Select elements added to the DOM by a script

I've been trying to get either an <object> or an <embed> tag using:
HtmlNode videoObjectNode = doc.DocumentNode.SelectSingleNode("//object");
HtmlNode videoEmbedNode = doc.DocumentNode.SelectSingleNode("//embed");
This doesn't seem to work.
Can anyone please tell me how to get these tags and their InnerHtml?
A YouTube embedded video looks like this:
<embed height="385" width="640" type="application/x-shockwave-flash"
src="http://s.ytimg.com/yt/swf/watch-vfl184368.swf" id="movie_player" flashvars="..."
allowscriptaccess="always" allowfullscreen="true" bgcolor="#000000">
I got a feeling the JavaScript might stop the swf player from working, hope not...
Cheers

Update 2010-08-26 (in response to OP's comment):
I think you're thinking about it the wrong way, Alex. Suppose I wrote some C# code that looked like this:
string codeBlock = "if (x == 1) Console.WriteLine(\"Hello, World!\");";
Now, if I wrote a C# parser, should it recognize the contents of the string literal above as C# code and highlight it (or whatever) as such? No, because in the context of a well-formed C# file, that text represents a string to which the codeBlock variable is being assigned.
Similarly, in the HTML on YouTube's pages, the <object> and <embed> elements are not really elements at all in the context of the current HTML document. They are the contents of string values residing within JavaScript code.
In fact, if HtmlAgilityPack did ignore this fact and attempted to recognize all portions of text that could be HTML, it still wouldn't succeed with these elements because, being inside JavaScript, they're heavily escaped with \ characters (notice the precarious Unescape method in the code I posted to get around this issue).
I'm not saying my hacky solution below is the right way to approach this problem; I'm just explaining why obtaining these elements isn't as straightforward as grabbing them with HtmlAgilityPack.
YouTubeScraper
OK, Alex: you asked for it, so here it is. Some truly hacky code to extract your precious <object> and <embed> elements out from that sea of JavaScript.
class YouTubeScraper
{
public HtmlNode FindObjectElement(string url)
{
HtmlNodeCollection scriptNodes = FindScriptNodes(url);
for (int i = 0; i < scriptNodes.Count; ++i)
{
HtmlNode scriptNode = scriptNodes[i];
string javascript = scriptNode.InnerHtml;
int objectNodeLocation = javascript.IndexOf("<object");
if (objectNodeLocation != -1)
{
string htmlStart = javascript.Substring(objectNodeLocation);
int objectNodeEndLocation = htmlStart.IndexOf(">\" :");
if (objectNodeEndLocation != -1)
{
string finalEscapedHtml = htmlStart.Substring(0, objectNodeEndLocation + 1);
string unescaped = Unescape(finalEscapedHtml);
var objectDoc = new HtmlDocument();
objectDoc.LoadHtml(unescaped);
HtmlNode objectNode = objectDoc.GetElementbyId("movie_player");
return objectNode;
}
}
}
return null;
}
public HtmlNode FindEmbedElement(string url)
{
HtmlNodeCollection scriptNodes = FindScriptNodes(url);
for (int i = 0; i < scriptNodes.Count; ++i)
{
HtmlNode scriptNode = scriptNodes[i];
string javascript = scriptNode.InnerHtml;
int approxEmbedNodeLocation = javascript.IndexOf("<\\/object>\" : \"<embed");
if (approxEmbedNodeLocation != -1)
{
string htmlStart = javascript.Substring(approxEmbedNodeLocation + 15);
int embedNodeEndLocation = htmlStart.IndexOf(">\";");
if (embedNodeEndLocation != -1)
{
string finalEscapedHtml = htmlStart.Substring(0, embedNodeEndLocation + 1);
string unescaped = Unescape(finalEscapedHtml);
var embedDoc = new HtmlDocument();
embedDoc.LoadHtml(unescaped);
HtmlNode videoEmbedNode = embedDoc.GetElementbyId("movie_player");
return videoEmbedNode;
}
}
}
return null;
}
protected HtmlNodeCollection FindScriptNodes(string url)
{
var doc = new HtmlDocument();
WebRequest request = WebRequest.Create(url);
using (var response = request.GetResponse())
using (var stream = response.GetResponseStream())
{
doc.Load(stream);
}
HtmlNode root = doc.DocumentNode;
HtmlNodeCollection scriptNodes = root.SelectNodes("//script");
return scriptNodes;
}
static string Unescape(string htmlFromJavascript)
{
// The JavaScript has escaped all of its HTML using backslashes. We need
// to reverse this.
// DISCLAIMER: I am a TOTAL Regex n00b; I make no claims as to the robustness
// of this code. If you could improve it, please, I beg of you to do so. Personally,
// I tested it on a grand total of three inputs. It worked for those, at least.
return Regex.Replace(htmlFromJavascript, #"\\(.)", UnescapeFromBeginning);
}
static string UnescapeFromBeginning(Match match)
{
string text = match.ToString();
if (text.StartsWith("\\"))
{
return text.Substring(1);
}
return text;
}
}
And in case you're interested, here's a little demo I threw together (super fancy, I know):
class Program
{
static void Main(string[] args)
{
var scraper = new YouTubeScraper();
HtmlNode davidAfterDentistEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=txqiwrbYGrs");
Console.WriteLine("David After Dentist:");
Console.WriteLine(davidAfterDentistEmbedNode.OuterHtml);
Console.WriteLine();
HtmlNode drunkHistoryObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=jL68NyCSi8o");
Console.WriteLine("Drunk History:");
Console.WriteLine(drunkHistoryObjectNode.OuterHtml);
Console.WriteLine();
HtmlNode jessicaDailyAffirmationEmbedNode = scraper.FindEmbedElement("http://www.youtube.com/watch?v=qR3rK0kZFkg");
Console.WriteLine("Jessica's Daily Affirmation:");
Console.WriteLine(jessicaDailyAffirmationEmbedNode.OuterHtml);
Console.WriteLine();
HtmlNode jazzerciseObjectNode = scraper.FindObjectElement("http://www.youtube.com/watch?v=VGOO8ZhWFR4");
Console.WriteLine("Jazzercise - Move your Boogie Body:");
Console.WriteLine(jazzerciseObjectNode.OuterHtml);
Console.WriteLine();
Console.Write("Finished! Hit Enter to quit.");
Console.ReadLine();
}
}
Original Answer
Why not try using the element's Id instead?
HtmlNode videoEmbedNode = doc.GetElementbyId("movie_player");
Update: Oh man, you're searching for HTML tags that are themselves within JavaScript? That's definitely why this isn't working. (They aren't really tags to be parsed from the perspective of HtmlAgilityPack; all of that JavaScript is really one big string inside a <script> tag.) Maybe there's some way you can parse the <script> tag's inner text itself as HTML and go from there.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Parsing a website data c# - HTML agility pack - c#

Related

Apose.Words ImportNode ignores font formatting when appendingchild

No Errors yet nothing in Console?

C# Looking to get obtain the <div> value using HtmlAgilityPack but receiving a System.NullReferenceException

Creating an xml parsing function in C#

Select elements added to the DOM by a script

Categories

Resources