Web scraper using HtmlAgilityPack

Web scraper using HtmlAgilityPack - c#

I am new to C# so this might be very obvious how to get this to work or way too complex for me but I am trying to setup and scrape a web page using the HtmlAgilityPack. Currently my code compiles but when I write the string I only get 1 result and it happens to be the last result from the li in the ul. The reason for the string split is so I can eventually output the title and description strings into a .csv for further use. I am just unsure what to do next thus, why I am asking for any help/understanding/ideas/thoughts/suggestions that can be offered. Thank you!
private void button1_Click(object sender, EventArgs e)
{
List<string> cities = new List<string>();
//var xpath = "//h2[span/#id='Cities']";
var xpath = "//h2[span/#id='Cities']" + "/following-sibling::ul[1]" + "/li";
WebClient web = new WebClient();
String html = web.DownloadString("http://wikitravel.org/en/Vietnam");
hap.HtmlDocument doc = new hap.HtmlDocument();
doc.LoadHtml(html);
foreach (hap.HtmlNode node in doc.DocumentNode.SelectNodes(xpath))
{
string all = node.InnerText;
//splits text between '—', '-' or ' ' into 2 parts
string[] split = all.Split(new char[] { '—', ' ', '-' }, StringSplitOptions.None);
string title;
string description;
int nodeCount;
nodeCount = node.ChildNodes.Count;
if (nodeCount == 2)
{
title = node.ChildNodes[0].InnerText;
description = node.ChildNodes[1].InnerText;
}
else if (nodeCount == 4)
{
title = node.ChildNodes[0].InnerText;
description = node.ChildNodes[1].InnerText + node.ChildNodes[2].InnerText;
}
else
{
title = "Error";
description = "The node cound was not 2 or 3. Check the div section.";
}
System.IO.StreamWriter write = new System.IO.StreamWriter(#"C:\Users\cbrannin\Desktop\textTest\testText.txt");
write.WriteLine(all);
write.Close();
}
}
}

One problem is that you're overwriting the output file each time through the loop. You probably want to do this:
using (StreamWriter write = new StreamWriter(#"filename"))
{
foreach (hap.HtmlNode node in doc.DocumentNode.SelectNodes(xpath))
{
// do your thing
write.WriteLine(all);
}
}
Also, have you single-stepped this to see if you're getting more than one HtmlNode from your SelectNode call?
Finally, I don't see where you're doing anything with the title or description. Were you planning to use those for something else?

Related

My function to get text between two strings isn't finding the correct words

I am creating an application that fetches information about a website. I have been trying several approaches on getting the information from the HTML tags. The website is who.is and I am trying to get information about Google (as a test!) Source can be found on view-source:https://who.is/whois/google.com/ < (if using Chrome browser)
Now the problem is that I am trying to get the name of the creator of the website (Mark or something) but I am not receiving the correct result. My code:
//GET name
string getName = source;
string nameBegin = "<div class=\"col-md-4 queryResponseBodyKey\">Name</div><div class=\"col-md-8 queryResponseBodyValue\">";
string nameEnd = "</div>";
int nameStart = getName.IndexOf(nameBegin) + nameBegin.Length;
int nameIntEnd = getName.IndexOf(nameEnd, nameStart);
string creatorName = getName.Substring(nameStart, nameIntEnd - nameStart);
lb_name.Text = creatorName;
(source contains html of page)
This doesn't put out the correct answer though... I think it has something to do with the fact that I use a [\] because of the multiple "" 's...
What am I doing wrong? :(

Instead of trying the parse the html result manually, use a real html parser like HtmlAgilityPack
using (var client = new HttpClient())
{
var html = await client.GetStringAsync("https://who.is/whois/google.com/");
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var nodes = doc.DocumentNode.SelectNodes("//*[#class='col-md-4 queryResponseBodyKey']");
var results = nodes.ToDictionary(n=>n.InnerText, n=>n.NextSibling.NextSibling.InnerText);
//print
foreach(var kv in results)
{
Console.WriteLine(kv.Key + " => " + kv.Value);
}
}

string getName = "<div class=\"col-md-4 queryResponseBodyKey\">Name</div><div class=\"col-md-8 queryResponseBodyValue\">";
string nameBegin = "<div class=\"col-md-4 queryResponseBodyKey\">";
string nameEnd = "</div>";
int nameStart = getName.IndexOf(nameBegin) + nameBegin.Length;
int nameIntEnd = getName.IndexOf(nameEnd, nameStart);
string creatorName = getName.Substring(nameStart, nameIntEnd - nameStart);
//lb_name.Text = creatorName;
Console.WriteLine(creatorName);
Console.ReadLine();
Is this what you are looking for, to get Name from that div ?

How to find a word from a text file and then read the next word after it on C#

Ok I've got a txt file named "info.txt" which includes the following text:
[entry]
title = Hello World
info = sometext
number = 0
available = -1
[entry]
title = All Vids
info = somemoretext
number = 1
available = 0
[entry]
title = All pics
info = somedifferenttext
number = 2
available = -1
[entry]
title = all music
info = differenttext
number = 3
available = 0
On C# What I want to do is open this file and search for "title = " and then get any words after it and then put it inside a text box.
So for example, after it looks for "title = " I want it to put "Hello World" inside textbox1. Then if there is another "title = " which would be "All Vids" I want to put it inside textbox2. The same should be done if there are more instances of "title = " which should be placed into textbox3, textbox4 and so on.
This is what I worked on which I found from another answer:
private void button1_Click(object sender, EventArgs e)
{
List<List<string>> groups = new List<List<string>>();
List<string> current = null;
foreach (var line in File.ReadAllLines(#"C:\Users\Rohul\Documents\info.txt"))
{
if (line.Contains("title") && current == null)
current = new List<string>();
else if (line.Contains("info") && current != null)
{
groups.Add(current);
current = null;
}
if (current != null)
richTextBox1.Text = line;
}
}
The problem with this it reads the full line and the last entry is read
I hope someone can help me.
Thanks in advance

Consider your data is in a file named data.txt.
Logic: Read the data, split by new line, find lines containing "title =". Remove that identifier and take the rest of the line.
string data = File.ReadAllText("data.txt");
string identifier = "title =";
List<string> results =
data.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries)
.Where(x => x.Contains(identifier))
.Select(x => x.Replace(identifier, String.Empty).Trim()).ToList();
After that, you'll have list of strings in results. Do whatever you want with it.
If you need to read it line by line like you tried, then:
string identifier = "title =";
string data = File.ReadAllText("data.txt");
List<String> results = new List<string>();
foreach(string line in File.ReadAllLines("data.txt"))
{
if(line.Contains(identifier))
{
results.Add(line.Replace(identifier, string.Empty).Trim());
}
}

It sounds like you are trying to read an INI FILE. If this is your purpose take a look at
this article An INI file handling class using C#
or
Reading/writing an INI file from StackOverflow.

C# Trying to read a page using XmlNode

So I am trying to read the Steam store page from the lowest price to the highest. I have the URL needed and I have written some code which have worked in the past but does not work anymore. I have spend some days trying to fix this problem but I just can't seem to find the problem.
Link I am trying to read.
Here is the code.
//List of items from the Steam market from lowest to highest
private void priceFromMarket(int StartPage)
{
if (valueList.Count != 0)
{
valueList.Clear();
numList.Clear();
nameList.Clear();
}
string pageContent = null;
string results_html = null;
try
{
HttpWebRequest myReq = (HttpWebRequest)WebRequest.Create("http://steamcommunity.com/market/search/render/?query=appid:730&start=" + StartPage.ToString() + "&sort_column=price&sort_dir=asc&count=100&currency=1&l=english");
HttpWebResponse myRes = (HttpWebResponse)myReq.GetResponse();
using (StreamReader sr = new StreamReader(myRes.GetResponseStream()))
{
pageContent = sr.ReadToEnd();
}
}
catch { Thread.Sleep(30000); priceFromMarket(StartPage); }
if (pageContent == null) { priceFromMarket(StartPage); }
try
{
JObject user = JObject.Parse(pageContent);
bool success = (bool)user["success"];
if (success)
{
results_html = (string)user["results_html"];
string data = results_html;
data = "<root>" + data + "</root>";
XmlDocument document = new XmlDocument();
document.LoadXml(System.Net.WebUtility.HtmlDecode(data));
XmlNode rootnode = document.SelectSingleNode("root");
XmlNodeList items = rootnode.SelectNodes("./a/div");
foreach (XmlNode node in items)
{
//This does not work anymore!
//The try fails here at line 574!
string value = node.SelectSingleNode("./div[contains(concat(' ', #class, ' '), ' market_listing_their_price ')]/span/span").InnerText;
string num = node.SelectSingleNode("./div[contains(concat(' ', #class, ' '), ' market_listing_num_listings ')]/span/span").InnerText;
string name = node.SelectSingleNode("./div/span[contains(concat(' ', #class, ' '), ' market_listing_item_name ')]").InnerText;
valueList.Add(value); //Lowest price for the item
numList.Add(num); //Volume of that item
nameList.Add(name); //Name of that item
}
}
else { Thread.Sleep(60000); priceFromMarket(StartPage); }
}
catch { Thread.Sleep(60000); priceFromMarket(StartPage); }
}

It's never reliable to parse HTML as XML because HTML doesn't have to be well formatted to be parsed properly...
For parsing HTML in C# i prefer to use CSQuery https://www.nuget.org/packages/CsQuery/
it lets you parse HTML in c# similar to doing it via jquery.
Another way is HTML Agility Pack which you could probably use without changing much of your code.. it's functions are similar to the System.Xml.XmlDocument Library.

how do i handle special characters scraped from a web page (xml and strings)

using HTMLAgility, C#, XML
I am using HTMLAgility to scrape a web page and in turn populate a class structure that in turn gets serialised into an XML document.
the data I am handling are guitar chords and as such I have some special characters to manage.
the special character I am struggling with is "Aº7" the middle character in the preceding string (which means diminished in musical terms).
when i get the string from the webpage I see a question mark in a black diamond in the watch window, this in turn gets populated in the XML.
my choices are
a) handle the char appropriately so that it presents in XML as the character it is.
b) convert every instance of this char in a string to the word "dim"
what is the best way to go about this as the char does get found in a replace statement (using char(code)).
I am not really sure how I 'should' approach this problem.
Code below that I am using to grab data (for clarity, this is a once and done function that once i have the data in a usable format will never get used again!, simply built to create an xml serialised object structure).
public void BuildDBFromWebSite()
{
string[] chordKeys = { "A", "A#", "Ab", "B", "Bb", "C", "C#", "D", "D#", "Db", "E", "Eb", "F", "F#", "G", "G#", "Gb" };
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
foreach (string chordKeyName in chordKeys)
{
//LOOP THROUGH THE CHORD KEYS
chordKey theChordKey = new chordKey() { KeyName = chordKeyName };
_keys.Add(theChordKey);
//grab the tone page
doc = web.Load("http://www.scales-chords.com/showchbykey.php?key=" + theChordKey.KeyName);
HtmlNode chordListTable = doc.DocumentNode.SelectSingleNode("/html/body/div[#id='wrapper']/div[#id='body']/div[#id='left']/div[#id='visit']/table/tbody");
// CHORDS
HtmlNodeCollection chordRows = chordListTable.SelectNodes("tr");
for (int i = 2; i < chordRows.Count; i++)
{
//LOOP THROUGH THE CHORDS
Chord theChord = new Chord();
HtmlNodeCollection chordInfoCells = chordRows[i].SelectNodes("td");
HtmlNode chordLink = chordInfoCells[0].SelectSingleNode("a[#href]");
//each of the next 3 cells can contain a bad glyph for diminished chords
theChord.ChordName = chordInfoCells[0].InnerText;
theChord.ChordNameText = chordInfoCells[1].InnerText;
theChord.Family = chordInfoCells[2].InnerText;
theChord.Importance = chordInfoCells[3].InnerText;
//HtmlAgilityPack.HtmlAttribute href = chordLink.Attributes["href"];
//HTMLAgility tries to encode the bad glyph but uses the wrong escape and breaks the href, work around is to manually strip the href myself
string theURL = chordLink.OuterHtml;
theURL = theURL.Remove(0,9);
int startPos = theURL.IndexOf(">") - 1;
theURL = theURL.Substring(0, startPos);
const string theBadCode = "º";
theChord.ChordNameURL = HTMLEncodeSpecialChars(theURL);
theChordKey.Chords.Add(theChord);
}
//VARIATIONS ETC
foreach (Chord theChord in theChordKey.Chords)
{
//grab the tone page
doc = web.Load("http://www.scales-chords.com/" + theChord.ChordNameURL);
HtmlNode chordMoreInfoTable = doc.DocumentNode.SelectSingleNode("/html/body/div[#id='wrapper']/div[#id='body']/div[#id='left']/div[#id='visit']/center/table[1]/tbody");
HtmlNodeCollection chordMoreInfoRows = chordMoreInfoTable.SelectNodes("tr");
theChord.Notes = chordMoreInfoRows[3].SelectNodes("td")[1].InnerText;
theChord.Structure = chordMoreInfoRows[4].SelectNodes("td")[1].InnerText;
theChord.BelongsTo = chordMoreInfoRows[6].SelectNodes("td")[1].InnerText;
HtmlNodeCollection variationHTML = doc.DocumentNode.SelectNodes("/html/body/div[#id='wrapper']/div[#id='body']/div[#id='left']/div[#id='visit']/center/b");
for (int iVariation = 1; iVariation < variationHTML.Count; iVariation=iVariation+2)
{
Variation theVariation = new Variation();
theVariation.Notation = variationHTML[iVariation].NextSibling.InnerHtml;
theVariation.Difficuty = variationHTML[iVariation + 1].NextSibling.InnerText;
string[] theStrings = theVariation.Notation.Split(' ');
try
{
theVariation.String1 = theStrings[1];
theVariation.String2 = theStrings[2];
theVariation.String3 = theStrings[3];
theVariation.String4 = theStrings[4];
theVariation.String5 = theStrings[5];
theVariation.String6 = theStrings[6];
}
catch (Exception ex)
{
}
theChord.Variations.Add(theVariation);
Console.WriteLine(theChord.ChordNameText + " : " + theVariation.Notation);
}
}
}
this.SaveToDisk("C:\\chords.xml");
}
thanks
Dan

The problems you are having are most likely due to the encoding. The page states that they use the charset windows-1252, so changing your code like this should work.
WebClient client = new WebClient();
client.Encoding = System.Text.Encoding.GetEncoding("windows-1252");
doc = web.Load(client.OpenRead("http://www.scales-chords.com/showchbykey.php?key=" + theChordKey.KeyName));
Of course, if you where to use this function more than once i'd move the declaration of the WebClient instance out of the foreach.

String not getting decoded

I have a DecXpress report and the datasource shows a filed where the data is comming something like
PRODUCT - APPLE<BR/>ITEM NUMBER - 23454</BR>LOT NUMBER 3343 <BR/>
Now that is how it is showing in a cell, so i decided to decoded, but nothing is working, i tried HttpUtility.HtmlDecode and here i am trying WebUtility.HtmlDecode.
private void xrTableCell9_BeforePrint(object sender, System.Drawing.Printing.PrintEventArgs e)
{
XRTableCell cell = sender as XRTableCell;
string _description = WebUtility.HtmlDecode(Convert.ToString(GetCurrentColumnValue("Description")));
cell.Text = _description;
}
How can I decode the value of this column in the datasource?.
Thank you

If you need to show the description with the < /> also, you need to use HtmlEncode.
If you need to extract the text from that html
public static string ExtractTextFromHtml(this string text)
{
if (String.IsNullOrEmpty(text))
return text;
var sb = new StringBuilder();
var doc = new HtmlDocument();
doc.LoadHtml(text);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
if (!String.IsNullOrWhiteSpace(node.InnerText))
sb.Append(HtmlEntity.DeEntitize(node.InnerText.Trim()) + " ");
}
return sb.ToString();
}
And you need HtmlAgilityPack
To remove the br tags:
var str = Convert.ToString(GetCurrentColumnValue("Description"));
Regex.Replace(str, #"</?\s?br\s?/?>", System.Environment.NewLine, RegexOptions.IgnoreCase);

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Web scraper using HtmlAgilityPack - c#

Related

My function to get text between two strings isn't finding the correct words

How to find a word from a text file and then read the next word after it on C#

C# Trying to read a page using XmlNode

how do i handle special characters scraped from a web page (xml and strings)

String not getting decoded

Categories

Resources