I am using HTMLAgilityPack and I'm trying to scrape the link http://www.hundsun.co.jp/ which is under the data-preconnect-urls. How can I get that?
<h3>
<a style="display:none" href="/aclk?sa=L&ai=DChcSEwimnPnc5OvQAhWRl70KHcxqCEAYABAA&ei=9hZNWLqlCIyY8gXA04vACg&sig=AOD64_3SZuXd57_-qOs8nnhn8rqw8GlIgw&q=&sqi=2&ved=0ahUKEwi6-PTc5OvQAhUMjLwKHcDpAqgQ0QwIGA&adurl=" id="s0p1c0"></a>
ブリッジSE募集中 - hundsun.co.jp
You can do it like this:
using System;
using HtmlAgilityPack;
using System.Xml;
public class Program
{
public static void Main()
{
string html = "<html><body><h3><a style=\"display:none\" href=\"/aclk?sa=L&ai=DChcSEwimnPnc5OvQAhWRl70KHcxqCEAYABAA&ei=9hZNWLqlCIyY8gXA04vACg&sig=AOD64_3SZuXd57_-qOs8nnhn8rqw8GlIgw&q=&sqi=2&ved=0ahUKEwi6-PTc5OvQAhUMjLwKHcDpAqgQ0QwIGA&adurl=\" id=\"s0p1c0\"></a>ブリッジSE募集中 - hundsun.co.jp</h3></body></html>";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
var links = doc.DocumentNode.SelectNodes("//a[#data-preconnect-urls]");
if (links == null)
{
Console.WriteLine("no links contain attribute data-preconnect-urls");
return;
}
foreach(var htmlNode in links)
{
var attr = htmlNode.Attributes["data-preconnect-urls"];
Console.WriteLine(attr.Value);
}
}
}
you can try it out here:
https://dotnetfiddle.net/gMTFV3
Related
The big picture is to print all URLs in a specific location on a website to console.
This can give me the text to all the links, but not the URLs. please help. sorry, I'm very new to coding. I've been told to use a different web driver, but for my current project, I want to stay in Selenium.
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support;
using System;
using System.Collections.Generic;
using System.IO;
using System.Text;
using OpenQA.Selenium.Support.UI;
namespace Test_Scraper_1
{
class Program
{
static void Main(string[] args)
{
//Initialize chrome driver
using (var driver = new ChromeDriver())
{
driver.Navigate().GoToUrl("https://www.tfrrs.org/");
//find elements
var Search_Field = driver.FindElementByXPath(#"/html/body/div[3]/div/div/div[4]/div/div[2]/form/div[1]/input");
var Search_Button = driver.FindElementByXPath(#"/html/body/div[3]/div/div/div[4]/div/div[2]/form/div[4]/button");
var Count = 1;
Search_Field.SendKeys("Ashley Smith");
Search_Button.Click();
var titles = driver.FindElementsByClassName("allRows");
foreach (var allRows in titles)
{
Console.WriteLine(allRows.Text + Count++);
}
Console.ReadLine();
}
}
}
}
Your allRows is tr element like below.
<tr class="filtered allRows ">
<td id="col0">
Ashley Smith
</td>
<td id="col1">
Youngstown St. (F)
</td>
</tr>
But you need href attribute of a element. So you need something like this, assuming you want first link:
var column0 = allRows.FindElement(By.Id("col0"));
var aElement = column0.FindElement(By.TagName("a"));
var link = aElement.GetAttribute("href");
Use allRows.getAttribute("href") instead of allRows.Text in your foreach loop to get the URL
namespace Test_Scraper_1
{
class Program
{
static void Main(string[] args)
{
//Initialize chrome driver
using (var driver = new ChromeDriver())
{
driver.Navigate().GoToUrl("https://www.tfrrs.org/");
//find elements
var Search_Field = driver.FindElementByXPath(#"/html/body/div[3]/div/div/div[4]/div/div[2]/form/div[1]/input");
var Search_Button = driver.FindElementByXPath(#"/html/body/div[3]/div/div/div[4]/div/div[2]/form/div[4]/button");
var Count = 1;
//Navigate to target page
Search_Field.SendKeys("Ashley Smith");
Search_Button.Click();
var titles =driver.FindElementsByClassName("allRows"); // driver.FindElementByLinkText("Ashley Smith");
foreach (var title in titles)
{
var Link_Name_TFRRS = title.FindElement(By.TagName("a")).GetAttribute("href"); ;
Console.WriteLine(Link_Name_TFRRS);
}
Console.ReadLine();
}
}
}
}
I have below XML.
<subscription>
<subscription_add_ons type="array">
<subscription_add_on>
<add_on_code>premium_support</add_on_code>
<name>Premium Support</name>
<quantity type="integer">1</quantity>
<unit_amount_in_cents type="integer">15000</unit_amount_in_cents>
<add_on_type>fixed</add_on_type>
<usage_percentage nil="true"></usage_percentage>
<measured_unit_id nil="true"></measured_unit_id>
</subscription_add_on>
</subscription_add_ons>
My XMLParse function
public XNode GetXmlNodes(XElement xml, string elementName)
{
List<string> addOnCodes= new List<string>();
//elementName = "subscription_add_ons ";
var addOns = xml.DescendantNodes().Where(x => x.Parent.Name == elementName).FirstOrDefault();
foreach (XNode addOn in addOns)
{
//Needed to do something like this
/*var len = "add_on_code".Length + 2;
var sIndex = addOn.ToString().IndexOf("<add_on_code>") + len;
var eIndex = addOn.ToString().IndexOf("</add_on_code>");
var addOnCode = addOn.ToString().Substring(sIndex, (eIndex - sIndex)).Trim().ToLower();
addOnCodes.Add(addOnCode);*/
}
As mentioned in comments by #JonSkeet, I updated my snippet as below.
var addOns = xml.Descendants(elementName).Single().Elements();
foreach (XNode addOn in addOns)
{
/*addon = {<subscription_add_on>
<add_on_code>premium_support</add_on_code>
<name>Premium Support</name>
<quantity type="integer">1</quantity>
<unit_amount_in_cents type="integer">15000</unit_amount_in_cents>
<add_on_type>fixed</add_on_type>
<usage_percentage nil="true"></usage_percentage>
<measured_unit_id nil="true"></measured_unit_id>
</subscription_add_on>} */
//how to get the addOnCode node value ?
var addOnCode = string.Empty;
addOnCodes.Add(addOnCode);
}
But what I need is from the passed XML, get all the nodes of type subscription_add_on then get the value contained in add_on_code & add it to string collection.
Or in general get the value of node by passing type ? Tried with the available methods coming from VS Intellisense but not getting the exact method that can do this?
Thanks!
Here is solution with Xml Linq (XDOCUMENT) :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication107
{
class Program
{
const string FILENAME = #"c:\temp\test.xml";
static void Main(string[] args)
{
XDocument doc = XDocument.Load(FILENAME);
var results = doc.Descendants("subscription_add_on").Select(x => new
{
add_on_code = (string)x.Element("add_on_code"),
name = (string)x.Element("name"),
quantity = (int)x.Element("quantity"),
amount = (int)x.Element("unit_amount_in_cents"),
add_on_type = (string)x.Element("add_on_type")
}).ToList();
}
}
}
I'm trying to scrape a website - ive accomplished this on other projects but i cant seem to get this right. It could be that ive been up for over 2 days working and maybe i am missing something. Please could someone look over my code? Here it is :
using System;
using System.Collections.Generic;
using HtmlAgilityPack;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using System.Linq;
using System.Xml.Linq;
using System.IO;
public partial class _Default : System.Web.UI.Page
{
List<string> names = new List<string>();
List<string> address = new List<string>();
List<string> number = new List<string>();
protected void Page_Load(object sender, EventArgs e)
{
string url = "http://www.scoot.co.uk/find/" + "cafe" + " " + "-in-uk?page=" + "4";
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
List<List<string>> mainList = new List<List<string>>();
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//h2//a"))
{
names.Add(Regex.Replace(node.ChildNodes[0].InnerHtml, #"\s{2,}", " "));
}
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//p[#class='result-address']"))
{
address.Add(Regex.Replace(node.ChildNodes[0].InnerHtml, #"\s{2,}", " "));
}
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//p[#class='result-number']"))
{
number.Add(Regex.Replace(node.ChildNodes[0].InnerHtml, #"\s{2,}", " "));
}
XDocument doccy = new XDocument(
new XDeclaration("1.0", "utf-8", "yes"),
new XComment("Business For Sale"),
new XElement("Data",
from data in mainList
select new XElement("data", new XAttribute("data", "data"),
new XElement("Name : ", names[0]),
new XElement("Add : ", address[0]),
new XElement("Number : ", number[0])
)
)
);
var xml = doccy.ToString();
Response.ContentType = "text/xml"; //Must be 'text/xml'
Response.ContentEncoding = System.Text.Encoding.UTF8; //We'd like UTF-8
doccy.Save(Response.Output); //Save to the text-writer
}
}
The website lists business name, phone number and address and they are all defined by a class name (result-address, result-number etc). I am trying to get XML output so i can get the business name, address and phone number from each listing on page 4 for a presentation tomorrow but i cant get it to work at all!
The results are right in all 3 of the for each loops but they wont output in the xml i get an out of range error.
My first piece of advice would be to keep your CodeBehind as light as possible. If you bloat it up with business logic then the solution will become difficult to maintain. That's off topic, but I recommend looking up SOLID principles.
First, I've created a custom object to work with instead of using Lists of strings which have no way of knowing which address item links up with which name:
public class Listing
{
public string Name { get; set; }
public string Address { get; set; }
public string Number { get; set; }
}
Here is the heart of it, a class that does all the scraping and serializing (I've broken SOLID principles but sometimes you just want it to work right.)
using System.Collections.Generic;
using HtmlAgilityPack;
using System.IO;
using System.Xml;
using System.Xml.Serialization;
using System.Linq;
public class TheScraper
{
public List<Listing> DoTheScrape(int pageNumber)
{
List<Listing> result = new List<Listing>();
string url = "http://www.scoot.co.uk/find/" + "cafe" + " " + "-in-uk?page=" + pageNumber;
var Webget = new HtmlWeb();
var doc = Webget.Load(url);
// select top level node, this is the closest we can get to the elements in which all the listings are a child of.
var nodes = doc.DocumentNode.SelectNodes("//*[#id='list']/div/div/div/div");
// loop through each child
if (nodes != null)
{
foreach (var node in nodes)
{
Listing listing = new Listing();
// get each individual listing and manually check for nulls
// listing.Name = node.SelectSingleNode("./div/div/div/div/h2/a")?.InnerText; --easier way to null check if you can use null propagating operator
var nameNode = node.SelectSingleNode("./div/div/div/div/h2/a");
if (nameNode != null) listing.Name = nameNode.InnerText;
var addressNode = node.SelectSingleNode("./div/div/div/div/p[#class='result-address']");
if (addressNode != null) listing.Address = addressNode.InnerText.Trim();
var numberNode = node.SelectSingleNode("./div/div/div/div/p[#class='result-number']/a");
if (numberNode != null) listing.Number = numberNode.Attributes["data-visible-number"].Value;
result.Add(listing);
}
}
// filter out the nulls
result = result.Where(x => x.Name != null && x.Address != null && x.Number != null).ToList();
return result;
}
public string SerializeTheListings(List<Listing> listings)
{
var xmlSerializer = new XmlSerializer(typeof(List<Listing>));
using (var stringWriter = new StringWriter())
using (var xmlWriter = XmlWriter.Create(stringWriter, new XmlWriterSettings { Indent = true }))
{
xmlSerializer.Serialize(xmlWriter, listings);
return stringWriter.ToString();
}
}
}
Then your code behind would look something like this, plus references to the scraper class and model class:
public partial class _Default : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
TheScraper scraper = new TheScraper();
List<Listing> listings = new List<Listing>();
// quick hack to do a loop 5 times, to get all 5 pages. if this is being run frequently you'd want to automatically identify how many pages or start at page one and find / use link to next page.
for (int i = 0; i < 5; i++)
{
listings = listings.Union(scraper.DoTheScrape(i)).ToList();
}
string xmlListings = scraper.SerializeTheListings(listings);
}
}
I am trying to delete/Scrub few elements from the xml using c# with the help of xpaths. I am trying to replace the value of social_security_number with "Scrubbed" in both the child tags named "Customers". But my program is landing in many errors. Please correct me.
xml :
<?xml version="1.0" encoding="utf-16"?>
<LoanApplications xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="12345" bundle_id="12225" version="1.0">
<LoanApplication payment_call="False" version="1.0" app_status="I" perform_dupe_check="1" bundle_id="12225" UpdateReviewed="True">
<Customers id = "12" name = "krish" ssn = "123456789">
</LoanApplication>
<LoanApplication deal_type="RESPONSE" payment_call="True" version="1.0" app_status="I" perform_dupe_check="1" bundle_id="12225" UpdateReviewed="True">
<Customers id = "12" name = "krish" ssn = "123456789">
</LoanApplication>
</LoanApplications>
Program :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml;
namespace ConsoleApplication2
{
class Program
{
static void Main(string[] args)
{
XmlDocument doc = new XmlDocument();
doc.Load("mytestfile.xml");
doc.SelectSingleNode("/LoanApplications/LoanApplication[#deal_type="%DealTypeALL%"]/LoanApplicationStates/LoanApplicationState/Customers/Customer[#customer_id="%CustIDALL%"]/").Attributes["social_security_number"].InnerText = "Scrubbed";
doc.Save("mytestfile.xml");
}
}
}
var doc = XDocument.Parse(System.IO.File.ReadAllText("C:\\Users\\jason\\Desktop\\Input\\2015\\09\\03\\mytestfile.xml"));
foreach (var customer in doc.Descendants("Customer"))
{
var ssn = customer.Attribute("social_security_number");
if (ssn != null)
{
ssn.Value = "scrubbed";
}
}
doc.Save("file.xml");
You have few needed nodes instead of one. Thus, you should use SelectNodes instead of SelectSingleNode method.
var doc = new XmlDocument();
doc.Load("mytestfile.xml");
var ssns = doc.SelectNodes("LoanApplications/LoanApplication/LoanApplicationStates/LoanApplicationState/Customers/Customer/#social_security_number");
foreach (XmlAttribute ssn in ssns)
ssn.InnerText = "Scrubbed";
doc.Save("mytestfile.xml");
You can use shorter XPath with descendants. But it has less performance.
var ssns = doc.SelectNodes("//Customer/#social_security_number");
This is real easy with xml linq
namespace ConsoleApplication1
{
class Program
{
const string FILENAME = #"c:\temp\test.xml";
static void Main(string[] args)
{
XDocument doc = XDocument.Load(FILENAME);
List<XElement> ss = doc.Descendants().Where(x => x.Attribute("social_security_number") != null).ToList();
foreach (XElement s in ss)
{
s.Attribute("social_security_number").Value = "Scrubbed";
}
}
}
}
I have a html document that contains multiple divs
Example:
<div class="element">
<div class="title">
<a href="127.0.0.1" title="Test>Test</a>
</div>
</div>
Now I'm using this code to extract the title element.
List<string> items = new List<string>();
var nodes = Web.DocumentNode.SelectNodes("//*[#title]");
if (nodes != null)
{
foreach (var node in nodes)
{
foreach (var attribute in node.Attributes)
if (attribute.Name == "title")
items.Add(attribute.Value);
}
}
I don't know how to adapt my code to extract the href and the title element
at the same time.
Each div should be an object with the included a tags as properties.
public class CheckBoxListItem
{
public string Text { get; set; }
public string Href { get; set; }
}
You can use the following xpath query to retrieve only a tags with a title and href :
//a[#title and #href]
The you can use your code like this:
List<CheckBoxListItem> items = new List<CheckBoxListItem>();
var nodes = Web.DocumentNode.SelectNodes("//a[#title and #href]");
if (nodes != null)
{
foreach (var node in nodes)
{
items.Add(new CheckBoxListItem()
{
Text = node.Attributes["title"].Value,
Href = node.Attributes["href"].Value
});
}
}
I very often use ScrapySharp's package together with HtmlAgilityPack for css selection.
(add a using statement for ScrapySharp.Extensions so you can use the CssSelect method).
using HtmlAgilityPack;
using ScrapySharp.Extensions;
In your case, I would do:
HtmlWeb w = new HtmlWeb();
var htmlDoc = w.Load("myUrl");
var titles = htmlDoc.DocumentNode.CssSelect(".title");
foreach (var title in titles)
{
string href = string.Empty;
var anchor = title.CssSelect("a").FirstOrDefault();
if (anchor != null)
{
href = anchor.GetAttributeValue("href");
}
}