I am trying to get all links in my txt file to extract them using the Html Agility Pack but when extracting I get an error:
Can you explain why?
Code:
string[] lines = File.ReadAllLines("links");
foreach (string line in lines)
{
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc = hw.Load(line.ToString());
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue("href", string.Empty);
if (!hrefValue.ToString().StartsWith("http://") && !hrefValue.ToString().StartsWith("https://"))
continue;
if (!crawlListbox.Items.Contains(hrefValue))
{
crawlListbox.Items.Add(hrefValue);
}
}
}
Related
enter image description here
I want to pull the areas within the pictures
Iam pulled 2 and 3 think
Uri url = new Uri("http://www.milliyet.com.tr/sondakika/");
WebClient client = new WebClient();
client.Encoding = System.Text.Encoding.UTF8;
var html = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument dokuman = new HtmlAgilityPack.HtmlDocument();
dokuman.LoadHtml(html);
HtmlNodeCollection basliklar = dokuman.DocumentNode.SelectNodes("//div[contains(#class,'kategoriList3')]//a");
foreach (var baslik in basliklar)
{
try
{
datacıktı.Rows.Add();
datacıktı.Rows[sayac].Cells[0].Value = baslik.Attributes["href"].Value.ToString();
datacıktı.Rows[sayac].Cells[1].Value = baslik.InnerText;
sayac++;
}
catch
{
continue;
}
}
This code can help you
Uri url = new Uri("http://www.milliyet.com.tr/sondakika/");
WebClient client = new WebClient();
client.Encoding = System.Text.Encoding.UTF8;
var html = client.DownloadString(url);
HtmlAgilityPack.HtmlDocument dokuman = new HtmlAgilityPack.HtmlDocument();
dokuman.LoadHtml(html);
IEnumerable<HtmlNode> htmlNodes = dokuman.DocumentNode.Descendants("ul").Where(d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("sonDK"));
foreach (HtmlNode htmlNode in htmlNodes)
{
IEnumerable<HtmlNode> liList = htmlNode.Descendants("li").Where(l => (l.Attributes.Contains("class") && l.Attributes["class"].Value.Contains("title")) == false);
foreach (HtmlNode liNode in liList)
{
Console.WriteLine("strong:" + liNode.FirstChild.InnerText + "- link:" + liNode.LastChild.Attributes["href"].Value);
}
}
I have a string long string with some tags inside:
client.Encoding = System.Text.Encoding.GetEncoding(1255);
string page = client.DownloadString("http://rotter.net/scoopscache.html");
StreamWriter w = new StreamWriter(#"d:\rotterhtml\rotterscoops.html");
w.Write(page);
w.Close();
I want to get from the page variable or either the html file all the text between the two tags:
<a href="http://rotter.net/cgi-bin/forum/dcboard.cgi?az=read_count&om=81020&forum=scoops1"><b>test</b>
I want to parse the word test. So in the end i will have all the words between:
<a href="http://rotter.net/cgi-bin/forum/dcboard.cgi?az=read_count&om=81020&forum=scoops1"><b>
and </b>
EDIT**
This is in the constructor how i saving the html file:
client.Encoding = System.Text.Encoding.GetEncoding(1255);
string page = client.DownloadString("http://rotter.net/scoopscache.html");
StreamWriter w = new StreamWriter(#"d:\rotterhtml\rotterscoops.html");
w.Write(page);
w.Close();
ExtractText(#"d:\rotterhtml\rotterscoops.html");
private void ExtractText(string filePath)
{
List<string> text = new List<string>();
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.Load(filePath);
if (htmlDoc.DocumentNode != null)
{
var nodes = htmlDoc.DocumentNode.SelectNodes("//a/b");
foreach (var node in nodes)
{
//Console.WriteLine(node.InnerText);
text.Add(node.InnerText);
}
}
}
In the text List i dont see hebrew but gibberish.
The html file on my hard disk i see inside hebrew fonts since i encoded it in the constructor.
But in the text List i see it in gibberish again.
You could use an HTML parsing library such as HtmlAgilityPack which would allow you to easily locate the information you are looking for inside the markup:
string filePath = #"d:\rotterhtml\rotterscoops.html"
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.Load(filePath);
if (htmlDoc.DocumentNode != null)
{
var nodes = htmlDoc.DocumentNode.SelectNodes("//a/b");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerText);
}
}
In this example I have selected the value of all <b> tags nested inside an <a> tag. You might need to adapt the selector to match your needs:
htmlDoc.DocumentNode.SelectNodes("//a/b");
I am reading a .docx file using OpenXML in C#. It reads everything correctly but strangely, the content of textbox is being read thrice. What could be wrong? Here is the code to read .docx:
public static string TextFromWord(String file)
{
const string wordmlNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
StringBuilder textBuilder = new StringBuilder();
using (WordprocessingDocument wdDoc = WordprocessingDocument.Open(file, false))
{
// Manage namespaces to perform XPath queries.
NameTable nt = new NameTable();
XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
nsManager.AddNamespace("w", wordmlNamespace);
// Get the document part from the package.
// Load the XML in the document part into an XmlDocument instance.
XmlDocument xdoc = new XmlDocument(nt);
xdoc.Load(wdDoc.MainDocumentPart.GetStream());
XmlNodeList paragraphNodes = xdoc.SelectNodes("//w:p", nsManager);
foreach (XmlNode paragraphNode in paragraphNodes)
{
XmlNodeList textNodes = paragraphNode.SelectNodes(".//w:t", nsManager);
foreach (System.Xml.XmlNode textNode in textNodes)
{
textBuilder.Append(textNode.InnerText);
}
textBuilder.Append(Environment.NewLine);
}
}
return textBuilder.ToString();
}
The part of file I am talking about is:
The result is: I read it in a test application like this:
What's wrong here?
I am trying to parse a webpage. But it is giving an error. Please help me. Thanks.
Here's the code:
static void myMain()
{
using (var client = new WebClient())
{
string data = client.DownloadString("http://www.google.com");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(data);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (HtmlNode link in nodes)
{
HtmlAttribute att = link.Attributes["href"];
Console.WriteLine(att.Value);
}
}
}
It is giving error that The type 'System.Windows.Form.HtmlDocument' has no constructors defined. I have included HAP.
Thanks
Change
HtmlDocument doc = new HtmlDocument();
to
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
Because you don't want work with System.Windows.Form.HtmlDocument
I am trying to parse the following data from an HTML document using HTMLAgillityPack:
abilene <br>
<b>albany</b> <br>
amarillo <br>
...
I would like parse out the URL and the name of the city into 2 separate files.
Example:
urls.txt
"http://abilene.craigslist.org/"
"http://albany.craigslist.org/"
"http://amarillo.craigslist.org/"
cities.txt
abilene
albany
amarillo
Here is what I have so far:
public void ParseHtml()
{
//Clear text box
textBox1.Clear();
//managed wrapper around the HTML Document Object Model (DOM).
HtmlAgilityPack.HtmlDocument hDoc = new HtmlAgilityPack.HtmlDocument();
//Load file
hDoc.Load(#"c:\AllCities.html");
try
{
//Execute the input XPath query from text box
foreach (HtmlNode hNode in hDoc.DocumentNode.SelectNodes(xpathText.Text))
{
textBox1.Text += hNode.InnerHtml + "\r\n";
}
}
catch (NullReferenceException nre)
{
textBox1.Text += "Can't process XPath query, modify it and try again.";
}
}
Any help would be greatly appreciated! Thanks guys!
I get it that you want to parse them from craigslist.org?
Here's how I'd do it.
List<string> links = new List<string>();
List<string> names = new List<string>();
HtmlDocument doc = new HtmlDocument();
//Load the Html
doc.Load(new WebClient().OpenRead("http://geo.craigslist.org/iso/us"));
//Get all Links in the div with the ID = 'list' that have an href-Attribute
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//div[#id='list']/a[#href]");
//or if you have only the links already saved somewhere
//HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a[#href]");
if (linkNodes != null)
{
foreach (HtmlNode link in linkNodes)
{
links.Add(link.GetAttributeValue("href", ""));
names.Add(link.InnerText);//Get the InnerText so you don't get any Html-Tags
}
}
//Write both lists to a File
File.WriteAllText("urls.txt", string.Join(Environment.NewLine, links.ToArray()));
File.WriteAllText("cities.txt", string.Join(Environment.NewLine, names.ToArray()));