extracting values of text from html source file

extracting values of text from html source file - c#

in this code var TempTxt holds An Html Body Content
as string
how can i extract element <table> or <td> inner text/ html using lambada syntax ?
public string ExtractPageValue(IWebDriver DDriver, string url="")
{
if(string.IsNullOrEmpty(url))
url = #"http://www.boi.org.il/he/Markets/ExchangeRates/Pages/Default.aspx";
var service = InternetExplorerDriverService.CreateDefaultService(directory);
service.LogFile = directory + #"\seleniumlog.txt";
service.LoggingLevel = InternetExplorerDriverLogLevel.Trace;
var options = new InternetExplorerOptions();
options.IntroduceInstabilityByIgnoringProtectedModeSettings = true;
DDriver = new InternetExplorerDriver(service, options, TimeSpan.FromSeconds(60));
DDriver.Navigate().GoToUrl(url);
var TempTxt = DDriver.PageSource;
return "";//Math.Round(Convert.ToDouble( TempTxt.Split(' ')[10]),2).ToString();
}

If you are open to try HtmlAgilityPack
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
var table = doc.DocumentNode.SelectNodes("//table/tr")
.Select(tr => tr.Elements("td").Select(td => td.InnerText).ToList())
.ToList();

Related

Parsing site using HtmlAgilityPack in C#

For example, I have link https://shikimori.one/animes/38256-magia-record-mahou-shoujo-madoka-magica-gaiden-tv/art
I wanna get from there list of div classes by name "container packery" using HtmlAgilityPack in C#. (in order to download images from all the links) But this part
var httpClient = new HttpClient();
var html = await httpClient.GetStringAsync(link);
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(link);
return me html code from this page https://shikimori.one/animes/38256-magia-record-mahou-shoujo-madoka-magica-gaiden-tv as i understood. So, I can`t parse anything from "/art". That because next part of code just returns null.
var links = htmlDocument.DocumentNode.Descendants("div")
.Where(node => node.GetAttributeValue("class", "")
.Equals("menu-slide-outer x199")).ToList();
What am I missing?
Final code:
class Program
{
static List<string> sources = new List<string>();
[STAThread]
static void Main(string[] args)
{
var link = "https://shikimori.one/animes/1577-taiho-shichau-zo/art";
var web = new HtmlWeb();
web.BrowserTimeout = TimeSpan.FromTicks(0);
var htmlDocument = web.LoadFromBrowser(link);
var divlink = htmlDocument.DocumentNode.Descendants("div")
.Where(node => node.GetAttributeValue("class", "")
.Equals("container packery")).ToList();
var alink = htmlDocument.DocumentNode.Descendants("a")
.Where(node => node.GetAttributeValue("class", "")
.Equals("b-image")).ToList();
foreach(var a in alink)
{
sources.Add(a.GetAttributeValue("href", string.Empty));
}
Console.WriteLine("done");
Console.ReadKey();
}

With HttpClient:
string html = string.Empty;
using (var httpClient = new HttpClient())
{
html = await httpClient.GetStringAsync(link);
}
var htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html); // LoadHtml expects the page source, not the URL.
Or, a simple way to parse HTML code from a URL:
var web = new HtmlAgilityPack.HtmlWeb();
var htmlDocument = await web.LoadFromWebAsync(link);
Or, load dynamic (i.e. Ajax) content:
var web = new HtmlAgilityPack.HtmlWeb();
var htmlDocument = web.LoadFromBrowser(link);

Net Core: Convert String to TagBuilder

The following code converts a TagBuilder to a String.
What is the opposite? How do I convert reverse, a String to a TagBuilder?
Looking for a solution.
Convert IHtmlContent/TagBuilder to string in C#
public static string GetString(IHtmlContent content)
{
using (var writer = new System.IO.StringWriter())
{
content.WriteTo(writer, HtmlEncoder.Default);
return writer.ToString();
}
}

As an option you can use an html parser like HtmlAgilityPack and get a html node, then create a TagBuilder using node's name, attributes and inner html:
public TagBuilder GetTagBuilder(string html)
{
var node = HtmlAgilityPack.HtmlNode.CreateNode(html);
var tagBuilder = new TagBuilder(node.Name);
tagBuilder.MergeAttributes(node.Attributes.ToDictionary(x => x.Name, x => x.Value));
tagBuilder.InnerHtml = node.InnerHtml;
return tagBuilder;
}
For example:
var html = #"<div id=""div1"" class=""class1"">Something</div>";
var tagBuilder = GetTagBuilder(html);
var str = tagBuilder.ToString();
Then the str value would be:
<div class="class1" id="div1">Something</div>

with XmlDocument
var doc = new System.Xml.XmlDocument();
doc.LoadXml(html);
TagBuilder tagBuilder = new TagBuilder(doc.DocumentElement.Name);
tagBuilder.MergeAttributes(doc.DocumentElement.Attributes.Cast<System.Xml.XmlAttribute>().ToDictionary(x => x.Name, x => x.Value));
tagBuilder.InnerHtml = doc.DocumentElement.InnerXml;
return tagBuilder;

Using a method in a Lambda Expression - HTMLDoc

I want to load content into a htmlDocument list using HTML Agility pack.
I have successfully achieved what I want using:
var htmllist = new List<HtmlDocument>();
int counter = 0;
foreach(var c in content)
{
htmllist.Add(new HtmlDocument());
htmllist[counter].LoadHtml(c);
counter += 1;
}
How can i write this in a Lambda expression? I tried:
var htmllist = content.Select(p => new HtmlDocument() {Text = p })

You need to add a ToList() to execute the query like
var htmllist = content.Select(p => new HtmlDocument() { Text = p }).ToList();
Per your comment and another comment: you can change your existing code a bit like
private HtmlDocument LoadHtmlFromContent(string content)
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(content);
return doc;
}
Now call this in your Linq query like
var htmllist = content.Select(p => this.LoadHtmlFromContent(p)).ToList();

Enumerable.Selectaccepts an arbitrary selector as Func<TSource,TResult>. So you can inline the conversion method, but imho it really doesn't look great…
content.Select(c => {var doc = new HtmlDocument(); doc.LoadHtml(c); return doc;});
If you're using C# >=7.0 you could think about using a local function for that. E.g.
void Convert(IEnumerable<string> content)
{
var htmls = content.Select(ConvertToHtml);
HtmlDocument ConvertToHtml(string c)
{
var doc = new HtmlDocument();
doc.LoadHtml(c);
return doc;
}
}
That looks more maintainable to me.

Html Agility Pack get contents from table

I need to get the location, address, and phone number from "http://anytimefitness.com/find-gym/list/AL" So far I have this...
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(stateURLs[0].ToString());
var BlankNode =
htmlDoc.DocumentNode.SelectNodes("/div[#class='segmentwhite']/table[#style='width: 100%;']//tr[#class='']");
var GrayNode =
htmlDoc.DocumentNode.SelectNodes("/div[#class='segmentwhite']/table[#style='width: 100%;']//tr[#class='gray_bk']");
I have looked around stackoverflow for a while but none of the present post regarding htmlagilitypack has really helped. I have also have been using http://www.w3schools.com/xpath/xpath_syntax.asp

Since <div> you're after is not direct child of root node, you need to use // instead of /. Then you can combine XPath for BlankNode and GrayNode using or operator, for example :
var htmlweb = new HtmlWeb();
HtmlDocument htmlDoc = htmlweb.Load("http://anytimefitness.com/find-gym/list/AL");
htmlDoc.OptionFixNestedTags = true;
var AllNode =
htmlDoc.DocumentNode.SelectNodes("//div[#class='segmentwhite']/table//tr[#class='' or #class='gray_bk']");
foreach (HtmlNode node in AllNode)
{
var location = node.SelectSingleNode("./td[2]").InnerText;
var address = node.SelectSingleNode("./td[3]").InnerText;
var phone = node.SelectSingleNode("./td[4]").InnerText;
//do something with above informations
}

Here's an example I tested in LinqPad.
string url = #"http://anytimefitness.com/find-gym/list/AL";
var client = new System.Net.WebClient();
var data = client.DownloadData(url);
var html = Encoding.UTF8.GetString(data);
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.LoadHtml(html);
var gyms = htmlDoc.DocumentNode.SelectNodes("//tbody/tr[#class='' or #class='gray_bk']");
foreach (var gym in gyms) {
var city = gym.SelectSingleNode("./td[2]").InnerText;
var address = gym.SelectSingleNode("./td[3]").InnerText;
var phone = gym.SelectSingleNode("./td[4]").InnerText;
}
Since the HtmlAgilityPack also supports Linq, you could also do something like:
string [] classes = {"", "gray_bk"};
var gyms = htmlDoc
.DocumentNode
.Descendants("tr")
.Where(t => classes.Contains(t.Attributes["class"].Value))
.ToList();
gyms.ForEach(gym => {
var city = gym.SelectSingleNode("./td[2]").InnerText;
var address = gym.SelectSingleNode("./td[3]").InnerText;
var phone = gym.SelectSingleNode("./td[4]").InnerText;
});

Load website using Html-agility-pack using xpath-syntax

I'm having this method to select out specific html and put it In a list.
Works perfect when I'm using a html-file saved on my computer. But how can a load a content from a website
This is my method loading the .html-file, witch works:
public void TestGetHtml()
{
var doc = new HtmlDocument();
doc.Load("C:/Users/Jonathan/Desktop/laggen.html");
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[4]";
var listOfGtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", ""));
}
But I want to load a website instead of a file, like this:
public void TestGetHtml()
{
var doc = new HtmlDocument();
doc.Load("http://www.dabas.com/mypages/search.aspx?typ=FP&sosokord=laggen"); <--- this is the site I want to load
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[4]";
var listOfGtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", ""));
}

Use
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.dabas.com/mypages/search.aspx?typ=FP&sosokord=laggen");
var xpath = "//table[#id='tableSearchArticle']/tbody/tr/td[4]";
var listOfGtins = doc.DocumentNode.SelectNodes(xpath)
.Select(td => td.InnerText.Replace("GTIN:", ""));
foreach (string gtin in listOfGtins)
{
Console.WriteLine(gtin);
}
if you want to load HTML over HTTP from a URL.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

extracting values of text from html source file - c#

If you are open to try HtmlAgilityPack HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(html); var table = doc.DocumentNode.SelectNodes("//table/tr") .Select(tr => tr.Elements("td").Select(td => td.InnerText).ToList()) .ToList();

Related

Parsing site using HtmlAgilityPack in C#

Net Core: Convert String to TagBuilder

Using a method in a Lambda Expression - HTMLDoc

Html Agility Pack get contents from table

Load website using Html-agility-pack using xpath-syntax

Categories

Resources