Reading a textstring from a webpage - c#

Currently I'm trying to read out a text from a Website via a c# program.
To be exact the Track and the Dj from www.hardbase.fm.
This is what the page source looks like:
<div id="Moderator">
<div id="Moderator_special">
<div style="width:158px; float:left; margin:8px"></div>
<div id="onAir" style="width:420px;overflow:hidden;">
<strong>
<a href="/member/46069" target="_top">
<span style="color:#4AA6E5">BIOCORE</span>
</a>
<span style="color:#26628B"> mit "This Is BIOCORE" (Hardstyle)</span>
</strong>
</div>
</div>
</div>
The text I want to read out is "BIOCORE" and "mit "This Is BIOCORE" (Hardstyle)"
(the text seen when running the snippet).
If have tried the following:
System.Net.WebClient wc = new System.Net.WebClient();
byte[] raw = wc.DownloadData("http://www.hardbase.fm/");
first = webData.IndexOf("#4AA6E5\">") + "#4AA6E5\">".Length;
last = webData.LastIndexOf("</span></a><span style=\"color:#26628B\">");
hb_dj = webData.Substring(first, last - first);
But this doesn't always works because sometimes the source code of the page changes a bit. Like the color or so. And then the search wont work.
So the question is: Is there a better method to do this?

You should try the HTML Agility Pack
HtmlWeb page = new HtmlWeb();
HtmlDocument document = page.Load("http://www.hardbase.fm/");
var nodes = document.DocumentNode.SelectNodes("//[#id='onAir']");
var nodes2 = nodes.Select(c1 => c1.SelectNodes("span")).ToList();
var span1=nodes2[0];
var span2 nodes2[1]

Related

Problem with subnode's text when scraping data from a website with HtmlAgilityPack

Hope somebody can help this newbie.
I tried many paths for this subnodes but i cant figure it out.
Html part:
<div class="center-block"> == $0
<div class="match-time" id="dvStatusText">MS</div>
<div class="match-score" id="dvScoreText">4 - 0</div>
<div class="hf-match-score" id="dvHTScoreText">İY : 3- 0</div>
</div>
My code:
Uri url = new Uri("http://arsiv.mackolik.com/Mac/3213138/");
WebClient client = new WebClient();
client.Encoding = Encoding.UTF8;
try
{
string html = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
HtmlNodeCollection results = doc.DocumentNode.SelectNodes("//*[#class='center-block']"); //
if (results != null)
{
for (int i = 0; i < results.Count; i++)
{
var t1 = results[i].SelectSingleNode("//*[#class='match-score']").InnerText; // (FT)
var t2 = results[i].SelectSingleNode("//*[#id='dvHTScoreText']").InnerText; // ht
listBox1.Items.Add(t2.ToString());
}
}
My problem from InnerHtml result:
<div class="match-time" id="dvStatusText">MS</div>
<div class="match-score" id="dvScoreText">4 - 0</div>
<div class="hf-match-score" id="dvHTScoreText"></div> // this element has always contains text.
I tried different ways to solve this problem but i have nothing. I can scrape "class=match time" or "class=match-score". But i cant "class=hf-match-score" . I have tried scrape with class or id. Different ways same problem.
Please show me a way. Thanks alot.
The score at half-time is displayed with Javascript. You'll need Selenium or a similar tool to access this element.
As an alternative, you can fetch the data directly from the JSON loaded in the background.
Piece of code in Python (I suppose you can do the same in c#) :
import requests
from lxml import html
# We set up the download url (obtained in the network tab of the developer tool) and the mandatory header
url = 'http://arsiv.mackolik.com/Match/MatchData.aspx?t=dtl&id=3213138&s=0'
hd = {'Referer': 'http://arsiv.mackolik.com/Mac/3213138/'}
# We download and parse the json
data = requests.get(url,headers=hd)
val= data.json()
# We extract values of interest
print(val["d"]["s"],val["d"]["ht"],sep="\n")
Output :
4 - 0
3 - 0

Xpath to get href that contain id?

I am trying to get all links that contain ids.I have tried for name and price which is working perfectly but not able to get links which is related to that stuff.
For name I am using this code but for getting links it is not working.
//For Name
var name=scorenodesdoc.DocumentNode.SelectNodes("//[contains(#id,'item')]/ul[1]/li1]/span");
//for Links
var Links= doc.DocumentNode.SelectNodes("//a[contains(#id, 'item')]/#href");
xpath for link is://*[#id="item5d86882c07"]/div[1]/div/a
//This is the code I am try to get the href link
<li id="item5d86882c07" _sp="p2045573.m1686.l8" listingid="401689029639" class="sresult lvresult clearfix li" r="1">
<div class="lvpic pic img left" iid="401689029639">
<div class="lvpicinner full-width picW">
<a href="https://www.ebay.com/itm/Microsoft-Xbox-One-X-White-Console-1TB-Forza-Special-Edition-Bundle-White/401689029639?hash=item5d86882c07:g:lgwAAOSwoZJcQY5s" class="img imgWr2">
<img src="https://i.ebayimg.com/thumbs/images/g/lgwAAOSwoZJcQY5s/s-l225.jpg" class="img" alt="Microsoft Xbox One X White Console 1TB & Forza Special Edition Bundle - White'">
</a>
</div>
</div>
</li>
Ok this is how I resolve my issue.First it gets get anchor tag information then by using getattributevalue to get value of href.
var URLnodes = doc.DocumentNode.SelectNodes("//*[contains(#id,'item')]/div[1]/div/a");
var AllURL = URLnodes.Select(node => node.GetAttributeValue("href",null));

Error "Object reference not set to an instance of an object"

This code can work with one of the web, but with some sites it back error messages like this, I do not know how to edit (Error in stars)
var document = webBrowser1.Document;
var documentAsIHtmlDocument3 = (mshtml.IHTMLDocument3)document.DomDocument;
var htmlString = documentAsIHtmlDocument3.documentElement.innerHTML;
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(htmlString);
// Sử dụng node để lấy tin
HtmlNodeCollection texts = doc.DocumentNode.SelectNodes("//div[#id='footer']/p");
string kq = "";
// cho vòng lặp để lấy kết quả
foreach (var item in texts)
{
kq += item.InnerText + Environment.NewLine;
}
richTextBox1.Text = kq;
HTML code:
<div id="divTop" >
<div id="text-conent" style="width: 500px; float: right;"></div>
<div id="grid" style="margin-removed 505px; height: 700px;"></div>
</div>
It seems that on the pages where this is successful there exists a div with the id of footer
But on other pages where this fails no such div exists.
So it seems like your logic may need to change to make the search expression that doc.DocumentNode.SelectNodes more forgiving.
Alternatively create a few more search strings that would work if your original fails:
if(texts == null){
texts = doc.DocumentNode.SelectNodes("some other search string");
}
etc.

Retrieving Text between <span>Text</span> in Selenium C#

I am facing problem in retrieving Subject title of a mail from Unread mails using Selenium webdriver-C#.
Here's the HTML code :
<div class="ae4 UI UJ" gh="tl">
<div class="Cp">
<div>
<table id=":8e" class="F cf zt" cellpadding="0">
<colgroup>
<tbody>
<tr id=":8d" class="zA zE">
<td class="PF xY"></td>
<td id=":8c" class="oZ-x3 xY" style="">
<td class="apU xY">
<td class="WA xY">
<td class="yX xY ">
<td id=":87" class="xY " role="link" tabindex="0">
<div class="xS">
<div class="xT">
<div id=":86" class="yi">
<div class="y6">
**<span id=":85">
<b>hi</b>
</span>**
<span class="y2">
</div>
</div>
</div>
</td>
<td class="yf xY "> </td>
<td class="xW xY ">
</tr>
I am able to print 'emailSenderName' in console but unable to print 'text' (subject line i.e. "hi" in this case) as it is between span tags. Here's my code.
//Try to Retrieve mail Senders name and Subject
IWebElement tbl_UM = d1.FindElement(By.ClassName("Cp")).FindElement(By.ClassName("F"));
IList<IWebElement> tr_ListUM = tbl_UM.FindElements(By.ClassName("zE"));
Console.WriteLine("NUMBER OF ROWS IN THIS TABLE = " + tr_ListUM.Count());
foreach (IWebElement trElement in tr_ListUM)
{
IList<IWebElement> td_ListUM = trElement.FindElements(By.TagName("td"));
Console.WriteLine("NUMBER OF COLUMNS=" + td_ListUM.Count());
string emailSenderName = td_ListUM[4].FindElement(By.ClassName("yW")).FindElement(By.ClassName("zF")).GetAttribute("name");
Console.WriteLine(emailSenderName);
string text = td_ListUM[5].FindElement(By.ClassName("y6")).FindElement(By.TagName("span")).FindElement(By.TagName("b")).Text;
Console.WriteLine(text);
}
I had also tried by directly selecting the Text from tag of 5th Column (td), which contains the subject text (in my case), but no results.
I might went wrong somewhere or may be there is some other way of doing it.
Please suggest, Thanks in advance :)
The 'getText' method available in the Java implementation of Selenium Web Driver seems to do a better job than the equivalent 'Text' property available in C#.
I found a way of achieving the same end which, although somewhat convoluted, works well:
public static string GetInnerHtml(this IWebElement element)
{
var remoteWebDriver = (RemoteWebElement)element;
var javaScriptExecutor = (IJavaScriptExecutor) remoteWebDriver.WrappedDriver;
var innerHtml = javaScriptExecutor.ExecuteScript("return arguments[0].innerHTML;", element).ToString();
return innerHtml;
}
It works by passing an IWebElement as a parameter to some JavaScript executing in the Browser, which treats it just like a normal DOM element. You can then access properties on it such as 'innerHTML'.
I've only tested this in Google Chrome but I see no reason why this shouldn't work in other browsers.
Using GetAttribute("textContent") instead of Text() did the trick for me.
Driver.FindElement(By.CssSelector("ul.list span")).GetAttribute("textContent")
Try this
findElement(By.cssSelector("div.y6>span>b")).getText();
I had the same problem. Worked on PhantomJS. The solution is to get the value using GetAttribute("textContent"):
Driver.FindElementsByXPath("SomexPath").GetAttribute("textContent");
Probably too late but could be helpful for someone.
IWebElement spanText= driver.FindElement(By.XPath("//span[contains(text(), 'TEXT TO LOOK FOR')]"));
spanText.Click();
IWebElement spanParent= driver.FindElement(By.XPath("//span[contains(text(), 'TEXT TO LOOK FOR')]/ancestor::li"));
spanParent.FindElement(By.XPath(".//a[contains(text(), 'SIBLING LINK TEXT')]")).Click();
bonus content here to look for siblings of this text
once the span element is found, look for siblings by starting from parent. I am looking for an anchor link here. The dot at the start of XPath means you start looking from the element spanParent
<li>
<span> TEXT TO LOOK FOR </span>
<a>SIBLING LINK TEXT</a>
</li>
This worked for me in Visual Studio 2017 Unit test project. I'm trying to find the search result from a typeahead control.
IWebElement searchBox = this.WebDriver.FindElement(By.Id("searchEntry"));
searchBox.SendKeys(searchPhrase);
System.Threading.Thread.Sleep(3000);
IList<IWebElement> results = this.WebDriver.FindElements(By.CssSelector(".tt-suggestion.tt-selectable"));
if (results.Count > 1)
{
searchResult = results[1].FindElement(By.TagName("span")).GetAttribute("textContent");
}

How can I add "class" attributes to HTML elements?

I have the following HTML. I want to add class="last" attributes to the final li elements in each list. How can I do this?
<div class="gpbscol">
<ul class="listl">
<li>ACCESSORIES</li>
<li>AMPLIFIERS</li>
<li>ANALOG AUDIO PROCESSING</li>
<li>MICROPHONE PREAMPLIFIERS</li>
<li>MICROPHONES</li>
<li>SPEAKERS/MONITORS</li>
<li>STUDIO</li>
<li>DIGITAL AUDIO PROCESSING</li>
<li>CONSOLES, MIXERS</li>
<li>DAWS/PERIPHERALS</li>
</ul>
</div>
<div class="audio">
<ul class="listl">
<li>DAWS/PERIPHERALS</li>
<li>LOUDSPEAKERS — FOH</li>
<li>RECORDERS/PLAYERS</li>
<li>HEADPHONES</li>
<li>MICROPHONES - WIRELESS CONVERTERS</li>
<li>NETWORK AUDIO / CONTROL / SNAKES</li>
<li>COMPUTER AUDIO INTERFACES</li>
<li>INTERCONNECTS</li>
<li>LOUDSPEAKERS — STAGE MONITORS</li>
<li>ACOUSTIC TREATMENT</li>
<li>MI PRODUCTS</li>
</ul>
</div>
So the final element might be
<li class="last">MI PRODUCTS</li>
I would be a styling issue. If I had no option for client side codes, I would go for CSS styling. You may consider this:
ul.listl li:last-child { }
This would be easier to do with jQuery:
$(function(){
$("ul.listl li:last").addClass("last");
});
And that's all, folks :)
I agree with #JohnHartsock but if you want to do it your way you can use a variety of libraries which help you in querying html elements (DOM).
Fizzler
Sharp-Query
HTML Agility Pack
string[] g=#"<div class=""gpbscol"">
<ul class=""listl"">
<li>ACCESSORIES</li>
<li>AMPLIFIERS</li>
<li>ANALOG AUDIO PROCESSING</li>
<li>MICROPHONE PREAMPLIFIERS</li>
<li>MICROPHONES</li>
<li>SPEAKERS/MONITORS</li>
<li>STUDIO</li>
<li>DIGITAL AUDIO PROCESSING</li>
<li>CONSOLES, MIXERS</li>
<li>DAWS/PERIPHERALS</li>
</ul>
</div>
<div class=""audio"">
<ul class=""listl"">
<li>DAWS/PERIPHERALS</li>
<li>LOUDSPEAKERS — FOH</li>
<li>RECORDERS/PLAYERS</li>
<li>HEADPHONES</li>
<li>MICROPHONES - WIRELESS CONVERTERS</li>
<li>NETWORK AUDIO / CONTROL / SNAKES</li>
<li>COMPUTER AUDIO INTERFACES</li>
<li>INTERCONNECTS</li>
<li>LOUDSPEAKERS — STAGE MONITORS</li>
<li>ACOUSTIC TREATMENT</li>
<li>MI PRODUCTS</li>
</ul>
</div>".Split(new string[]{"<li>"});
g[g.length-1]=g[g.length-1].replace("<li>","<li class='last' >");
string newString=String.Join("", g);
The easiest of all, this is using HtmlAgilityPack;
TextWriter text = new StringWriter();
string set = [html here];
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(set);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//li[last()]");
HtmlAttribute attr;
attr = node.SetAttributeValue("class", "last");
doc.Save(text);
return text;
anyways thanks all for helping me.

Categories