Get the titles and URLs of Yahoo result page in c# - c#

I want to get titles and URLs of Yahoo result page with htmlagility pack
HtmlWeb w = new HtmlWeb();
string SearchResults = "https://en-maktoob.search.yahoo.com/search?p=" + query.querytxt;
var hd = w.Load(SearchResults);
var nodes = hd.DocumentNode.SelectNodes("//a[#cite and #href]");
if (nodes != null)
{
foreach (var node in nodes)
{
{
string Text = node.Attributes["title"].Value;
string Href = node.Attributes["href"].Value;
}
}
It works but all links in search result are not Appropriate links how to omit ads link , Yahoo links and etc .
I want to access the correct links

What about this:
HtmlWeb w = new HtmlWeb();
string search = "https://en-maktoob.search.yahoo.com/search?p=veverke";
//ac-algo ac-21th lh-15
var hd = w.Load(search);
var titles = hd.DocumentNode.CssSelect(".title a").Select(n => n.InnerText);
var links = hd.DocumentNode.CssSelect(".fz-15px.fw-m.fc-12th.wr-bw.lh-15").Select(n => n.InnerText);
for (int i = 0; i < titles.Count() - 1; i++)
{
var title = titles.ElementAt(i);
string link = string.Empty;
if (links.Count() > i)
link = links.ElementAt(i);
Console.WriteLine("Title: {0}, Link: {1}", title, link);
}
Keep in mind that I am using the extension method CssSelect, from nuget package's ScrapySharp. Install it just like you installed HtmlAgilityPack, then add a using statement at the top of the code like using ScrapySharp.Extensions; and you are good to go. (I use it because its easier to refer to css selectors instead of xpath expressions...)
Regarding skipping ads, I noticed ads in these yahoo search results will come at the last record only ? Assuming I am correct, simply skip the last one.
Here's the output I get for running the code above:

Related

HTMLAgilityPack selects nodes from first iteration through divs

I'm trying to use HTMLAgilityPack to parse some website for the first time. Everything works as expected but only for first iteration. On each iteration I get unique div with its data, but SelectNodes() always gets data from first iteration.
The code listed below explains the problem
All the properties for station get values from first iteration.
static void Main(string[] args)
{
List<Station> stations = new List<Station>();
wClient = new WebClient();
wClient.Proxy = null;
wClient.Encoding = encode;
for (int i = 1; i <= 1; i++)
{
HtmlDocument html = new HtmlDocument();
string link = string.Format("http://energybase.ru/powerPlant/index?PowerPlant_page={0}&pageSize=20&q=/powerPlant", i);
html.LoadHtml(wClient.DownloadString(link));
var stationList = html.DocumentNode.SelectNodes("//div[#class='items']").First().ChildNodes.Where(x=>x.Name=="div").ToList();//get list of nodes with PowerStation Data
foreach (var item in stationList) //each iteration returns Item with unique InnerHTML
{
Station st = new Station();
st.Name = item.SelectNodes("//div[#class='col-md-20']").First().SelectNodes("//div[#class='name']").First().ChildNodes["a"].InnerText;//gets name from first iteration
st.Url = item.SelectNodes("//div[#class='col-md-20']").First().SelectNodes("//div[#class='name']").First().ChildNodes["a"].Attributes["href"].Value;//gets url from first iteration and so on
st.Company = item.SelectNodes("//div[#class='col-md-20']").First().SelectNodes("//div[#class='name']").First().ChildNodes["small"].ChildNodes["em"].ChildNodes["a"].InnerText;
stations.Add(st);
}
}
Maybe I am not getting some of essentials of OOP?
Your code can be greatly simplified by using the full power of XPath.
var stationList = html.DocumentNode.SelectNodes("//div[#class='items']/div");
// XPath-expression may be so: "//div[#class='items'][1]/div"
// where [1] means first node
foreach (var item in stationList)
{
Station st = new Station();
st.Name = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']/a").InnerText;
st.Url = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']/a").Attributes["href"].Value;
string rawText = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']/small/em").InnerText;
st.Company = HttpUtility.HtmlDecode(rawText.Trim());
stations.Add(st);
}
Your mistake was to use XPath descendants axis: //div.
Even better rewrite code like this:
var divName = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']");
var nodeA = divName.SelectSingleNode("a");
st.Name = nodeA.InnerText;
st.Url = nodeA.Attributes["href"].Value;
string rawText = divName.SelectSingleNode("small/em").InnerText;
st.Company = HttpUtility.HtmlDecode(rawText.Trim());
This article contains some good exaples on various aspects of html agility pack.
have a look into this article, it would give you a quick start.

Trying to compare tags in mongodb string array, with textbox input c#

Okay so i have a mongodb that has a collection that is called videos and in videos i have a field called tags. what i want to do is compare a textbox input with the tags on all videos in the collection and return them to a gridview if a tag matches the input from the textbox. When i create a new video the tags field is a string Array so it is possible to store more than one tag. I am trying to do this in c#. Hope you some of you can help thanks!
Code for creating a new video document.
#region Database Connection
var client = new MongoClient();
var server = client.GetServer();
var db = server.GetDatabase("Database");
#endregion
var videos = db.GetCollection<Video>("Videos");
var name = txtVideoName.Text;
var location = txtVideoLocation.Text;
var description = txtVideoDescription.Text;
var user = txtVideoUserName.Text;
string[] lst = txtVideoTags.Text.Split(new char[] { ',' });
var index = videos.Count();
var id = 0;
if (id <= index)
{
id += (int)index;
}
videos.CreateIndex(IndexKeys.Ascending("Tags"), IndexOptions.SetUnique(false));
var newVideo = new Video(id, name, location, description, lst, user);
videos.Insert(newVideo);
Okay so here is how the search method looks like i have just made the syntax a little diffrent from what Grant Winney ansewred.
var videos = db.GetCollection<Video>("Videos");
string[] txtInput = txtSearchTags.Text.Split(new char[] { ',' });
var query = (from x in videos.AsQueryable<Video>()
where x.Tags.ContainsAny(txtInput)
select x);
This finds all videos with tags that contain a tag specified in the TextBox, assuming the MongoDB driver can properly translate it into a valid query.
var videos = db.GetCollection<Video>("Videos")
.AsQueryable()
.Where(v => v.Tags.Split(',')
.ContainsAny(txtVideoTags.Text.Split(',')))
.ToList();
Make sure you've got using MongoDB.Driver.Linq; at the top.

How to get data from xml, by linq, c#

Hi,
I have a problem with getting data from youtube xml:
address of youtube xml: http://gdata.youtube.com/feeds/api/videos?q=keyword&orderby=viewCount
I try this, but the program doesn't go into the linq inquiry.
key = #"http://gdata.youtube.com/feeds/api/videos?q="+keyword+#"&orderby=viewCount";
youtube = XDocument.Load(key);
urls = (from item in youtube.Elements("feed")
select new VideInfo
{
soundName = item.Element("entry").Element("title").ToString(),
url = item.Element("entry").Element("id").ToString(),
}).ToList<VideInfo>();
Anyone has idea, how to solve this?
Thanks!
Searching for elements in Linq to XML requires that you fully qualify with the namespace. In this case:
var keyword = "food";
var key = #"http://gdata.youtube.com/feeds/api/videos?q="+keyword+#"&orderby=viewCount";
var youtube = XDocument.Load(key);
var urls = (from item in youtube.Elements("{http://www.w3.org/2005/Atom}feed")
select new
{
soundName = item.Element("{http://www.w3.org/2005/Atom}entry").Element("{http://www.w3.org/2005/Atom}title").ToString(),
url = item.Element("{http://www.w3.org/2005/Atom}entry").Element("{http://www.w3.org/2005/Atom}id").ToString(),
});
foreach (var t in urls) {
Console.WriteLine(t.soundName + " " + t.url);
}
Works for me. To avoid writing out the namespace, one option is to search by local name (e. g. youtube.Elements().Where(e => e.LocalName == "feed"). I'm not sure if there's a more elegant way to be "namespace agnostic".

Get href from html using mshtml in C#

I am trying to get the href link out of the following HTML code using mshtml in C# (WPF).
<a class="button_link" href="https://rhystowey.com/account/confirm_email/2842S-B2EB5-136382?t=1&sig=b0dbd522380a21007d8c375iuc583f46a90365d9&iid=am-130280753913638201274485430&ac=1&uid=1284488216&nid=18+308" style="border:none;color:#0084b4;text-decoration:none;color:#ffffff;font-size:13px;font-weight:bold;font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;">Confirm your account now</a>
I have tried using the following code to make this work by using mshtml in C# (WPF) but I have failed miserably.
HTMLDocument mdoc = (HTMLDocument)browser.Document;
string innerHtml = mdoc.body.outerText;
string str = "https://rhystowey.com/account/confirm_email/";
int index = innerHtml.IndexOf(str);
innerHtml = innerHtml.Remove(0, index + str.Length);
int startIndex = innerHtml.IndexOf("\"");
string str3 = innerHtml.Remove(startIndex, innerHtml.Length - startIndex);
string thelink = "https://rhystowey.com/account/confirm_email/" + str3;
Can someone please help me to get this to work.
Use this:
var ex = new Regex("href=\"(.*)\" style");
var tag = "<a class=\"button_link\" href=\"https://rhystowey.com/account/confirm_email/2842S-B2EB5-136382?t=1&sig=b0dbd522380a21007d8c375iuc583f46a90365d9&iid=am-130280753913638201274485430&ac=1&uid=1284488216&nid=18+308\" style=\"border:none;color:#0084b4;text-decoration:none;color:#ffffff;font-size:13px;font-weight:bold;font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;\">Confirm your account now</a>";
var address = ex.Match(tag).Groups[1].ToString();
But you should extend it with checks because for instance Groups[1] could be out of range.
In your example
HTMLDocument mdoc = (HTMLDocument)browser.Document;
string innerHtml = mdoc.body.outerText;
var ex = new Regex("href=\"([^\"\"]+)\"");
var address = ex.Match(innerHtml).Groups[1].ToString();
will match the first href="...". Or you select all occurrences:
var matches = (from Match match in ex.Matches(innerHtml) select match.Groups[1].Value).ToList();
This will give you a List<string> with all the links in your HTML. To filter this, you can either go this way
var wantedMatches = matches.Where(m => m.StartsWith("https://rhystowey.com/account/confirm_email/"));
which is more flexible because you could check against a list of start strings or whatever. Or you do it in your regex, which will lead in better performance:
var ex = new Regex("href=\"(https://rhystowey\\.com/account/confirm_email/[^\"\"]+)\"");
Bringing it all together to what you want as far as I understand
var ex = new Regex("href=\"(https://rhystowey\\.com/account/confirm_email/[^\"\"]+)\"");
var matches = (from Match match in ex.Matches(innerHTML)
where match.Groups.Count >= 1
select match.Groups[1].Value).ToList();
var firstAddress = matches.FirstOrDefault();
firstAddress holds your link, if there is one.
If your link will always start with the same path and isn't repeated on the page, you can use this (untested):
var match = Regex.Match(html, #"href=""(?<href>https\:\/\/rhystowey\.com\/account\/confirm_email\/[^""]+)""");
if (match.Success)
{
var href = match.Groups["href"].Value;
....
}

Working with HtmlAgilityPack

I'm trying to get a link and another element from an HTML page, but I don't really know what to do. This is what I have right now:
var client = new HtmlWeb(); // Initialize HtmlAgilityPack's functions.
var url = "http://p.thedgtl.net/index.php?tag=-1&title={0}&author=&o=u&od=d&page=-1&"; // The site/page we are indexing.
var doc = client.Load(string.Format(url, textBox1.Text)); // Index the whole DB.
var nodes = doc.DocumentNode.SelectNodes("//a[#href]"); // Get every url.
string authorName = "";
string fileName = "";
string fileNameWithExt;
foreach (HtmlNode link in nodes)
{
string completeUrl = link.Attributes["href"].Value; // The complete plugin download url.
#region Get all jars
if (completeUrl.Contains(".jar")) // Check if the url contains .jar
{
fileNameWithExt = completeUrl.Substring(completeUrl.LastIndexOf('/') + 1); // Get the filename with extension.
fileName = fileNameWithExt.Remove(fileNameWithExt.LastIndexOf('.')); ; // Get the filename without extension.
Console.WriteLine(fileName);
}
#endregion
#region Get all Authors
if (completeUrl.Contains("?author=")) // Check if the url contains .jar
{
authorName = completeUrl.Substring(completeUrl.LastIndexOf('=') + 1); // Get the filename with extension.
Console.WriteLine(authorName);
}
#endregion
}
I am trying to get all the filenames and authors next to each other, but now everything is like randomly placed, why?
Can someone help me with this? Thanks!
If you look at the HTML, it's very unfortunate it is not well-formed. There's a lot of open tags and the way HAP structures it is not like a browser, it interprets the majority of the document as deeply nested. So you can't just simply iterate through the rows of the table like you would in the browser, it gets a lot more complicated than that.
When dealing with such documents, you have to change your queries quite a bit. Rather than searching through child elements, you have to search through descendants adjusting for the change.
var title = System.Web.HttpUtility.UrlEncode(textBox1.Text);
var url = String.Format("http://p.thedgtl.net/index.php?title={0}", title);
var web = new HtmlWeb();
var doc = web.Load(url);
// select the rows in the table
var xpath = "//div[#class='content']/div[#class='pluginList']/table[2]";
var table = doc.DocumentNode.SelectSingleNode(xpath);
// unfortunately the `tr` tags are not closed so HAP interprets
// this table having a single row with multiple descendant `tr`s
var rows = table.Descendants("tr")
.Skip(1); // skip header row
var query =
from row in rows
// there may be a row with an embedded ad
where row.SelectSingleNode("td/script") == null
// each row has 6 columns so we need to grab the next 6 descendants
let columns = row.Descendants("td").Take(6).ToList()
let titleText = columns[1].Elements("a").Select(a => a.InnerText).FirstOrDefault()
let authorText = columns[2].Elements("a").Select(a => a.InnerText).FirstOrDefault()
let downloadLink = columns[5].Elements("a").Select(a => a.GetAttributeValue("href", null)).FirstOrDefault()
select new
{
Title = titleText ?? "",
Author = authorText ?? "",
FileName = Path.GetFileName(downloadLink ?? ""),
};
So now you can just iterate through the query and write out what you want for each of the rows.
foreach (var item in query)
{
Console.WriteLine("{0} ({1})", item.FileName, item.Author);
}

Categories