Web Scraping with c# and HTMLAgilityPack

Web Scraping with c# and HTMLAgilityPack - c#

Screenshot of the code and error message+variable values So, the goal is to take a word and get the part of speech of the word from its google definition.
I've tried a few different approaches but I'm getting a null reference error every time. Is my code failing to access the webpage? Is it a firewall issue, a logic issue, an {insert-issue-here} problem? I really wish i had a vague idea of what is wrong.
Thanks for your time.
Addendum: I've tried "//[#id=\"source - luna\"]//div" and "//[#id=\"source - luna\"]/div1" as XPath values.
//attempt 1////////////////////////////////////////////////////////////////////////
var term = "Hello";
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.urbandictionary.com/define.php?term=" + term);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader stream = new StreamReader(response.GetResponseStream());
string final_response = stream.ReadToEnd();
MessageBox.Show(final_response); //doesn't execute
//attempt 2////////////////////////////////////////////////////////////////////////
var url = "https://www.google.co.za/search?q=define+position";
var content = new System.Net.WebClient().DownloadString(url);
var webGet = new HtmlWeb();
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(content);
//doc is null at runtime
HtmlNode ourNode = doc.DocumentNode.SelectSingleNode("//*[#id=\"uid_0\"]/div[1]/div/div[1]/div[2]/div[2]/div[1]/i/span");
if (ourNode != null)
{
richTextBox1.AppendText(ourNode.InnerText);
}
else
richTextBox1.AppendText("null");
//attempt 3////////////////////////////////////////////////////////////////////////
var webGet = new HtmlWeb();
var doc = webGet.Load("https://www.google.co.za/search?q=define+position");
//doc is null at runtime
HtmlNode ourNode = doc.DocumentNode.SelectSingleNode("//*[#id=\"uid_0\"]/div[1]/div/div[1]/div[2]/div[2]/div[1]/i/span");
if (ourNode != null)
{
richTextBox1.AppendText(ourNode.InnerText);
}
else
richTextBox1.AppendText("null");
//attempt 4////////////////////////////////////////////////////////////////////////
string Url = "http://www.metacritic.com/game/pc/halo-spartan-assault";
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(Url);
//doc is null at runtime
string metascore = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[1]/div/div/div[2]/a/span[1]")[0].InnerText;
string userscore = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[1]/div[2]/div[1]/div/div[2]/a/span[1]")[0].InnerText;
string summary = doc.DocumentNode.SelectNodes("//*[#id=\"main\"]/div[3]/div/div[2]/div[2]/div[1]/ul/li/span[2]/span/span[1]")[0].InnerText;
richTextBox1.AppendText(metascore + " " + userscore + " " + summary);
//attempt 5////////////////////////////////////////////////////////////////////////
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument html = web.Load("https://www.google.co.za/search?q=define+position");
//html is null
var div = html.DocumentNode.SelectNodes("//*[#id=\"uid_0\"]/div[1]/div/div[1]/div[2]/div[2]/div[1]/i/span");
richTextBox1.AppendText(Convert.ToString(div));

You are getting null because your XPATHs aren't correct or it couldn't find any node based on those XPATHs. What are you trying to achieve here?

Related

Certain website returning Null with HTML Agility Pack

So I'm trying to scrape a website and I'm using HTML Agility Pack to try and do this. I have tried my code on the html-agility-pack and google websites and it seem to work fine with this simple search.
My problem is the code is returning an error ("System.NullReferenceException: 'Object reference not set to an instance of an object.'") on this line of code:
Console.WriteLine("Node Name: " + node.Name + "\n" + node.OuterHtml);
I understand this is happening because the var node is returning Null but why is this happening on this website and not on the others?
//var html = #"http://html-agility-pack.net/";
var html = #"https://www./";
//var html = #"https://www.google.com/";
HtmlWeb web = new HtmlWeb();
HtmlDocument htmlDoc = web.Load(html);
if (web.StatusCode == HttpStatusCode.OK)
{
Console.WriteLine("CONNECTION OK");
var node = htmlDoc.DocumentNode.SelectSingleNode("//head/title");
Console.WriteLine("Node Name: " + node.Name + "\n" + node.OuterHtml);
Console.ReadLine();
}else
{
Console.WriteLine("No Connection to website");
Console.ReadLine();
}

c# htmlagilitypack strong and ul pull

enter image description here
I want to pull the areas within the pictures
Iam pulled 2 and 3 think
Uri url = new Uri("http://www.milliyet.com.tr/sondakika/");
WebClient client = new WebClient();
client.Encoding = System.Text.Encoding.UTF8;
var html = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument dokuman = new HtmlAgilityPack.HtmlDocument();
dokuman.LoadHtml(html);
HtmlNodeCollection basliklar = dokuman.DocumentNode.SelectNodes("//div[contains(#class,'kategoriList3')]//a");
foreach (var baslik in basliklar)
{
try
{
datacıktı.Rows.Add();
datacıktı.Rows[sayac].Cells[0].Value = baslik.Attributes["href"].Value.ToString();
datacıktı.Rows[sayac].Cells[1].Value = baslik.InnerText;
sayac++;
}
catch
{
continue;
}
}

This code can help you
Uri url = new Uri("http://www.milliyet.com.tr/sondakika/");
WebClient client = new WebClient();
client.Encoding = System.Text.Encoding.UTF8;
var html = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument dokuman = new HtmlAgilityPack.HtmlDocument();
dokuman.LoadHtml(html);
IEnumerable<HtmlNode> htmlNodes = dokuman.DocumentNode.Descendants("ul").Where(d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("sonDK"));
foreach (HtmlNode htmlNode in htmlNodes)
{
IEnumerable<HtmlNode> liList = htmlNode.Descendants("li").Where(l => (l.Attributes.Contains("class") && l.Attributes["class"].Value.Contains("title")) == false);
foreach (HtmlNode liNode in liList)
{
Console.WriteLine("strong:" + liNode.FirstChild.InnerText + "- link:" + liNode.LastChild.Attributes["href"].Value);
}
}

Accessing xmlNodeList with partial namespaces(Xpath)

I have been digging through every example I can find related to Xpath and XML parsing. I can't find a close enough example to the XML I have to deal with that makes any sense to me. I am having an extremely difficult time wrapping my head around Xpath in particular but also XML parsing in a more general sense. The complexity of the file I'm working with is not making it easier to understand.
I have an XML file coming from a remote source which I have no control over.
The file is:
<AssetWarrantyDTO xmlns:i="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.datacontract.org/2004/07/Dell.Support.AssetsExternalAPI.Web.Models.V1.Response">
<AdditionalInformation i:nil="true"/>
<AssetWarrantyResponse>
<AssetWarrantyResponse>
<AssetEntitlementData>
<AssetEntitlement>
<EndDate>2010-12-20T23:59:59</EndDate>
<EntitlementType>EXTENDED</EntitlementType>
<ItemNumber>983-4252</ItemNumber>
<ServiceLevelCode>ND</ServiceLevelCode>
<ServiceLevelDescription>Next Business Day Onsite</ServiceLevelDescription>
<ServiceLevelGroup>5</ServiceLevelGroup>
<ServiceProvider>UNY</ServiceProvider>
<StartDate>2008-12-21T00:00:00</StartDate>
</AssetEntitlement>
<AssetEntitlement>
<EndDate>2010-12-20T23:59:59</EndDate>
<EntitlementType>EXTENDED</EntitlementType>
<ItemNumber>987-1139</ItemNumber>
<ServiceLevelCode>TS</ServiceLevelCode>
<ServiceLevelDescription>ProSupport</ServiceLevelDescription>
<ServiceLevelGroup>8</ServiceLevelGroup>
<ServiceProvider>DELL</ServiceProvider>
<StartDate>2008-12-21T00:00:00</StartDate>
</AssetEntitlement>
<AssetEntitlement>
<EndDate>2008-12-20T23:59:59</EndDate>
<EntitlementType>INITIAL</EntitlementType>
<ItemNumber>984-0210</ItemNumber>
<ServiceLevelCode>ND</ServiceLevelCode>
<ServiceLevelDescription>Next Business Day Onsite</ServiceLevelDescription>
<ServiceLevelGroup>5</ServiceLevelGroup>
<ServiceProvider>UNY</ServiceProvider>
<StartDate>2007-12-20T00:00:00</StartDate>
</AssetEntitlement>
<AssetEntitlement>
<EndDate>2008-12-20T23:59:59</EndDate>
<EntitlementType>INITIAL</EntitlementType>
<ItemNumber>987-1308</ItemNumber>
<ServiceLevelCode>TS</ServiceLevelCode>
<ServiceLevelDescription>ProSupport</ServiceLevelDescription>
<ServiceLevelGroup>8</ServiceLevelGroup>
<ServiceProvider>DELL</ServiceProvider>
<StartDate>2007-12-20T00:00:00</StartDate>
</AssetEntitlement>
</AssetEntitlementData>
<AssetHeaderData>
<BUID>11</BUID>
<CountryLookupCode>US</CountryLookupCode>
<CustomerNumber>64724056</CustomerNumber>
<IsDuplicate>false</IsDuplicate>
<ItemClassCode>`U060</ItemClassCode>
<LocalChannel>17</LocalChannel>
<MachineDescription>Precision T3400</MachineDescription>
<OrderNumber>979857987</OrderNumber>
<ParentServiceTag i:nil="true"/>
<ServiceTag>7P3VBU1</ServiceTag>
<ShipDate>2007-12-20T00:00:00</ShipDate>
</AssetHeaderData>
<ProductHeaderData>
<LOB>Dell Precision WorkStation</LOB>
<LOBFriendlyName>Precision WorkStation</LOBFriendlyName>
<ProductFamily>Desktops & All-in-Ones</ProductFamily>
<ProductId>precision-t3400</ProductId>
<SystemDescription>Precision T3400</SystemDescription>
</ProductHeaderData>
</AssetWarrantyResponse>
</AssetWarrantyResponse>
<ExcessTags>
<BadAssets xmlns:d3p1="http://schemas.microsoft.com/2003/10/Serialization/Arrays"/>
</ExcessTags>
<InvalidBILAssets>
<BadAssets xmlns:d3p1="http://schemas.microsoft.com/2003/10/Serialization/Arrays"/>
</InvalidBILAssets>
<InvalidFormatAssets>
<BadAssets xmlns:d3p1="http://schemas.microsoft.com/2003/10/Serialization/Arrays"/>
</InvalidFormatAssets>
</AssetWarrantyDTO>
Here is the Final code not including the setting of the URI variable for the API URL.
protected void Unnamed1_Click(object sender, EventArgs e)
{
string Serial = TextBox1.Text.ToUpper();
URI = String.Format(URI, Serial);
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(URI);
request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3";
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
CookieContainer aCookie = new CookieContainer();
request.CookieContainer = aCookie;
WebResponse pageResponse = request.GetResponse();
Stream responseStream = pageResponse.GetResponseStream();
string xml = string.Empty;
using (StreamReader streamRead = new StreamReader(responseStream))
{
xml = streamRead.ReadToEnd();
}
XmlDocument doc1 = new XmlDocument();
doc1.LoadXml(xml);
string _byteOrderMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
if (xml.StartsWith(_byteOrderMarkUtf8))
{
var lastIndexOfUtf8 = _byteOrderMarkUtf8.Length - 1;
xml = xml.Remove(0, lastIndexOfUtf8);
//Label2.Text = "BOM found.";
}
XmlNamespaceManager nsmgr = new XmlNamespaceManager(doc1.NameTable);
nsmgr.AddNamespace("j", "http://schemas.datacontract.org/2004/07/Dell.Support.AssetsExternalAPI.Web.Models.V1.Response");
XmlNodeList nodes = doc1.SelectNodes(".//j:AssetWarrantyResponse/j:AssetWarrantyResponse/j:AssetEntitlementData", nsmgr);
//Make a list to hold the start dates
System.Collections.ArrayList startDates = new System.Collections.ArrayList();
//Make a list to hold the end dates
System.Collections.ArrayList endDates = new System.Collections.ArrayList();
//Create a regex for finding just the date and discarding the time value which can alter tha date if the time is 24:00 (euro standard)
Regex r = new Regex(#"\d{4}-\d{1,2}-\d{1,2}", RegexOptions.IgnoreCase);
//Set the culture to format the date as US region
CultureInfo dtFormat = new CultureInfo("en-US", false);
foreach (XmlNode node in nodes)
{
foreach (XmlNode childNode in node.ChildNodes)
{
string startDate = childNode["StartDate"].InnerText;
if (startDate != null)
{
MatchCollection mcl1 = r.Matches(startDate);
startDates.Add(DateTime.Parse(mcl1[0].ToString(), dtFormat));
}
string endDate = childNode["EndDate"].InnerText;
if (endDate != null)
{
MatchCollection mcl2 = r.Matches(endDate);
endDates.Add(DateTime.Parse(mcl2[0].ToString(), dtFormat));
}
}
startDates.Sort();
endDates.Sort();
DateTime wStartDate = new DateTime();
DateTime wEndDate = new DateTime();
//if (dates.Count > 1) wStartDate = (DateTime)dates[dates.Count - 1];
if (startDates.Count >= 1) wStartDate = (DateTime)startDates[0];
Label1.Text = wStartDate.ToString("MM/dd/yyyy");
if (endDates.Count >= 1) wEndDate = (DateTime)endDates[endDates.Count - 1];
Label2.Text = wEndDate.ToString("MM/dd/yyyy");
//Label2.Text = tempc;
//Label3.Text = feels;
}
nodes = doc1.SelectNodes(".//j:AssetWarrantyResponse/j:AssetWarrantyResponse/j:AssetHeaderData", nsmgr);
foreach (XmlNode node in nodes)
{
try
{
string custNumber = node["CustomerNumber"].InnerText;
string model = node["MachineDescription"].InnerText;
string orderNumber = node["OrderNumber"].InnerText;
string serialNumber = node["ServiceTag"].InnerText;
Label3.Text = custNumber;
Label4.Text = model;
Label5.Text = orderNumber;
Label6.Text = serialNumber;
}
catch (Exception ex)
{
dbgLabel.Text = ex.Message;
}
}
}

You are looking for AssetWarrantyResponse in namespace http://www.w3.org/2001/XMLSchema-instance (the namespace you have bound to prefix "i") but it is actually in namespace http://schemas.datacontract.org/2004/07/Dell.Support.AssetsExternalAPI.Web.Models.V1.Response. Bind a prefix to that namespace (anything will do, e.g "p") and use that prefix in your query, e.g. p:AssetWarrantyResponse, and similarly for other element names.
I wonder if you are spending too much time trying to look for example code that exactly matches what you want to do, and not enough time studying the underlying concepts of the language so that you can apply them to your own problems. Get some good XML books and read them.
There's another problem with your XPath, which is the "/" at the end of the path. That's invalid syntax. If that's the cause of the error then I'm not very impressed with your XPath processor's diagnostics.

Trying to scrape all href from source code. I dont understand what I am doing wrong

I am trying to scrape all href from source code in tag and having class = "linked formlink" . I dont understand what I am doing wrong.I am getting null in the "links".
StreamReader sr = new StreamReader(webBrowser1.DocumentStream);
string sourceCode = sr.ReadToEnd();
sr.Close();
//removing illegal path
string regexSearch = new string(Path.GetInvalidFileNameChars()) + new string(Path.GetInvalidPathChars());
Regex r = new Regex(string.Format("[{0}]", Regex.Escape(regexSearch)));
sourceCode = r.Replace(sourceCode, "");
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(sourceCode);
var links = htmlDoc.DocumentNode
.Descendants("a")
.Where(x => x.Attributes["class"] != null
&& x.Attributes["class"].Value == "linked formlink")
.Select(x => x.Attributes["href"].Value.ToString());

the regular expression is removing the brackets plus other necessary characters used by the html-agile-pack to determine tags and classes
just remove it

HmlAgilityPack fix not open tag

I get from url html page.
in page I get table with hot opened <tr> tag
<table class="transparent">
<tr><td>Sąrašo eil. Nr.:</td><td>B-FA001</td></tr>
<td>Įrašymo į Sąrašą data:</td><td>2006-11-13</td></tr>
</table>
how to fix to
<table class="transparent">
<tr><td>Sąrašo eil. Nr.:</td><td>B-FA001</td></tr>
<tr><td>Įrašymo į Sąrašą data:</td><td>2006-11-13</td></tr>
</table>
I tried to do
private HtmlDocument GetHtmlDocument(string link)
{
string url = "http://195.182.67.7/paslaugos/administratoriai/bankroto-administratoriai/" + link;
var web = new HtmlWeb { AutoDetectEncoding = false, OverrideEncoding = Encoding.UTF8 };
var doc = web.Load(url);
doc.OptionFixNestedTags = true;
doc.OptionAutoCloseOnEnd = true;
doc.OptionCheckSyntax = true;
// build a list of nodes ordered by stream position
NodePositions pos = new NodePositions(doc);
// browse all tags detected as not opened
foreach (HtmlParseError error in doc.ParseErrors.Where(e => e.Code == HtmlParseErrorCode.TagNotOpened))
{
// find the text node just before this error
var last = pos.Nodes.OfType<HtmlTextNode>().LastOrDefault(n => n.StreamPosition < error.StreamPosition);
if (last != null)
{
// fix the text; reintroduce the broken tag
last.Text = error.SourceText.Replace("/", "") + last.Text + error.SourceText;
}
}
doc.Save(Console.Out);
return doc;
}
but not fix

for this particular problem you could do simple regex replacing:
string wrong = "<table class=\"transparent\"><tr><td>Sąrašo eil. Nr.:</td><td>B-FA001</td></tr><td>Įrašymo į Sąrašą data:</td><td>2006-11-13</td></tr></table>";
Regex reg = new Regex(#"(?<!(?:<tr>)|(?:</td>))<td>");
string right = reg.Replace(wrong, "<tr><td>");
Console.WriteLine(right);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Web Scraping with c# and HTMLAgilityPack - c#

You are getting null because your XPATHs aren't correct or it couldn't find any node based on those XPATHs. What are you trying to achieve here?

Related

Certain website returning Null with HTML Agility Pack

c# htmlagilitypack strong and ul pull

Accessing xmlNodeList with partial namespaces(Xpath)

Trying to scrape all href from source code. I dont understand what I am doing wrong

HmlAgilityPack fix not open tag

Categories

Resources