HTMLAgillityPack Parsing

HTMLAgillityPack Parsing - c#

I am trying to parse the following data from an HTML document using HTMLAgillityPack:
abilene <br>
<b>albany</b> <br>
amarillo <br>
...
I would like parse out the URL and the name of the city into 2 separate files.
Example:
urls.txt
"http://abilene.craigslist.org/"
"http://albany.craigslist.org/"
"http://amarillo.craigslist.org/"
cities.txt
abilene
albany
amarillo
Here is what I have so far:
public void ParseHtml()
{
//Clear text box
textBox1.Clear();
//managed wrapper around the HTML Document Object Model (DOM).
HtmlAgilityPack.HtmlDocument hDoc = new HtmlAgilityPack.HtmlDocument();
//Load file
hDoc.Load(#"c:\AllCities.html");
try
{
//Execute the input XPath query from text box
foreach (HtmlNode hNode in hDoc.DocumentNode.SelectNodes(xpathText.Text))
{
textBox1.Text += hNode.InnerHtml + "\r\n";
}
}
catch (NullReferenceException nre)
{
textBox1.Text += "Can't process XPath query, modify it and try again.";
}
}
Any help would be greatly appreciated! Thanks guys!

I get it that you want to parse them from craigslist.org?
Here's how I'd do it.
List<string> links = new List<string>();
List<string> names = new List<string>();
HtmlDocument doc = new HtmlDocument();
//Load the Html
doc.Load(new WebClient().OpenRead("http://geo.craigslist.org/iso/us"));
//Get all Links in the div with the ID = 'list' that have an href-Attribute
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//div[#id='list']/a[#href]");
//or if you have only the links already saved somewhere
//HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//a[#href]");
if (linkNodes != null)
{
foreach (HtmlNode link in linkNodes)
{
links.Add(link.GetAttributeValue("href", ""));
names.Add(link.InnerText);//Get the InnerText so you don't get any Html-Tags
}
}
//Write both lists to a File
File.WriteAllText("urls.txt", string.Join(Environment.NewLine, links.ToArray()));
File.WriteAllText("cities.txt", string.Join(Environment.NewLine, names.ToArray()));

Related

'The underlying connection was closed when using Html Agility Pack

I am trying to get all links in my txt file to extract them using the Html Agility Pack but when extracting I get an error:
Can you explain why?
Code:
string[] lines = File.ReadAllLines("links");
foreach (string line in lines)
{
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc = hw.Load(line.ToString());
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
string hrefValue = link.GetAttributeValue("href", string.Empty);
if (!hrefValue.ToString().StartsWith("http://") && !hrefValue.ToString().StartsWith("https://"))
continue;
if (!crawlListbox.Items.Contains(hrefValue))
{
crawlListbox.Items.Add(hrefValue);
}
}
}

C# OpenXML How to Replace \r\n with Break()?

I have a text field in my database and it has a text with many lines.
When generating a MS Word document using OpenXML and bookmarks, the text become one single line.
I've noticed that in each new line the bookmark value show the characters "\r\n".
Looking for a solution, I've found some answers which helped me, but I'm still having a problem.
I've used the run.Append(new Break()); solution, but the text replaced is showing the name of the bookmark as well.
For example:
bookmark test = "Big text here in first paragraph\r\nSecond paragraph".
It is shown in MS Word document like:
testBig text here in first paragraph
Second paragraph
Can anyone, please, help me to eliminate the bookmark name?
Here is my code:
public void UpdateBookmarksVistoria(string originalPath, string copyPath, string fileType)
{
string wordmlNamespace = "http://schemas.openxmlformats.org/wordprocessingml/2006/main";
// Make a copy of the template file.
File.Copy(originalPath, copyPath, true);
//Open the document as an Open XML package and extract the main document part.
using (WordprocessingDocument wordPackage = WordprocessingDocument.Open(copyPath, true))
{
MainDocumentPart part = wordPackage.MainDocumentPart;
//Setup the namespace manager so you can perform XPath queries
//to search for bookmarks in the part.
NameTable nt = new NameTable();
XmlNamespaceManager nsManager = new XmlNamespaceManager(nt);
nsManager.AddNamespace("w", wordmlNamespace);
//Load the part's XML into an XmlDocument instance.
XmlDocument xmlDoc = new XmlDocument(nt);
xmlDoc.Load(part.GetStream());
//pega a url para exibir as fotos
string url = HttpContext.Current.Request.Url.ToString();
string enderecoURL;
if (url.Contains("localhost"))
enderecoURL = url.Substring(0, 26);
else if (url.Contains("www."))
enderecoURL = url.Substring(0, 24);
else
enderecoURL = url.Substring(0, 20);
//Iterate through the bookmarks.
int cont = 56;
foreach (KeyValuePair<string, string> bookmark in bookmarks)
{
var res = from bm in part.Document.Body.Descendants<BookmarkStart>()
where bm.Name == bookmark.Key
select bm;
var bk = res.SingleOrDefault();
if (bk != null)
{
Run bookmarkText = bk.NextSibling<Run>();
if (bookmarkText != null) // if the bookmark has text replace it
{
var texts = bookmark.Value.Split(new[] { Environment.NewLine }, StringSplitOptions.None);
for (int i = 0; i < texts.Length; i++)
{
if (i > 0)
bookmarkText.Append(new Break());
Text text = new Text();
text.Text = texts[i];
bookmarkText.Append(text); //HERE IS MY PROBLEM
}
}
else // otherwise append new text immediately after it
{
var parent = bk.Parent; // bookmark's parent element
Text text = new Text(bookmark.Value);
Run run = new Run(new RunProperties());
run.Append(text);
// insert after bookmark parent
parent.Append(run);
}
bk.Remove(); // we don't want the bookmark anymore
}
}
//Write the changes back to the document part.
xmlDoc.Save(wordPackage.MainDocumentPart.GetStream(FileMode.Create));
wordPackage.Close();
}}

How do i parse a text between two tags?

I have a string long string with some tags inside:
client.Encoding = System.Text.Encoding.GetEncoding(1255);
string page = client.DownloadString("http://rotter.net/scoopscache.html");
StreamWriter w = new StreamWriter(#"d:\rotterhtml\rotterscoops.html");
w.Write(page);
w.Close();
I want to get from the page variable or either the html file all the text between the two tags:
<a href="http://rotter.net/cgi-bin/forum/dcboard.cgi?az=read_count&om=81020&forum=scoops1"><b>test</b>
I want to parse the word test. So in the end i will have all the words between:
<a href="http://rotter.net/cgi-bin/forum/dcboard.cgi?az=read_count&om=81020&forum=scoops1"><b>
and </b>
EDIT**
This is in the constructor how i saving the html file:
client.Encoding = System.Text.Encoding.GetEncoding(1255);
string page = client.DownloadString("http://rotter.net/scoopscache.html");
StreamWriter w = new StreamWriter(#"d:\rotterhtml\rotterscoops.html");
w.Write(page);
w.Close();
ExtractText(#"d:\rotterhtml\rotterscoops.html");
private void ExtractText(string filePath)
{
List<string> text = new List<string>();
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.Load(filePath);
if (htmlDoc.DocumentNode != null)
{
var nodes = htmlDoc.DocumentNode.SelectNodes("//a/b");
foreach (var node in nodes)
{
//Console.WriteLine(node.InnerText);
text.Add(node.InnerText);
}
}
}
In the text List i dont see hebrew but gibberish.
The html file on my hard disk i see inside hebrew fonts since i encoded it in the constructor.
But in the text List i see it in gibberish again.

You could use an HTML parsing library such as HtmlAgilityPack which would allow you to easily locate the information you are looking for inside the markup:
string filePath = #"d:\rotterhtml\rotterscoops.html"
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionFixNestedTags = true;
htmlDoc.Load(filePath);
if (htmlDoc.DocumentNode != null)
{
var nodes = htmlDoc.DocumentNode.SelectNodes("//a/b");
foreach (var node in nodes)
{
Console.WriteLine(node.InnerText);
}
}
In this example I have selected the value of all <b> tags nested inside an <a> tag. You might need to adapt the selector to match your needs:
htmlDoc.DocumentNode.SelectNodes("//a/b");

how to remove the error type has no constructors defined

I am trying to parse a webpage. But it is giving an error. Please help me. Thanks.
Here's the code:
static void myMain()
{
using (var client = new WebClient())
{
string data = client.DownloadString("http://www.google.com");
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(data);
var nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (HtmlNode link in nodes)
{
HtmlAttribute att = link.Attributes["href"];
Console.WriteLine(att.Value);
}
}
}
It is giving error that The type 'System.Windows.Form.HtmlDocument' has no constructors defined. I have included HAP.
Thanks

Change
HtmlDocument doc = new HtmlDocument();
to
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
Because you don't want work with System.Windows.Form.HtmlDocument

best way to find end of body tag in html

I'm writing a program to add some code to html files
I was going to use a series of indexof and loops to find what is essentially ""X
(where X is the spot im looking for)
It occurred to me that there might be a more eloquent way of doing this
does anyone have any suggestions.
what it looks like currently
<body onLoad="JavaScript:top.document.title='Abraham L Barbrow'; if (self == parent) document.getElementById('divFrameset').style.display='block';">
what it should look like when im done
<body onLoad="JavaScript:top.document.title='Abraham L Barbrow'; if (self == parent) document.getElementById('divFrameset').style.display='block';">
<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
try {
var pageTracker = _gat._getTracker("UA-9xxxxxx-1");
pageTracker._trackPageview();
} catch(err) {}</script>

I'm not sure I'm understanding you, but do you mean this?
// Given an HTML document in "htmlDocument", and new content in "newContent"
string newHtmlDocument = htmlDocument.Replace("</body>", newContent+"</body>");
And it's probably obvious I don't know c#... You'd probably want to make the "body" tag case insensitive via regexps.

I would recommend to use HtmlAgilityPack to parse the html into DOM and work with it.

public string AddImageLink(string emailBody,string imagePath)
{
try
{
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(emailBody);
HtmlNode node = doc.DocumentNode.SelectSingleNode("//body");
// get body using xpath query ("//body")
// create the new node ..
HtmlNodeCollection LinkNode = new HtmlNodeCollection(node);
//
HtmlNode linkNode = new HtmlNode(HtmlNodeType.Element,doc,0);
linkNode.Name = "A";
linkNode.Attributes.Add("href","www.splash-solutions.co.uk");
HtmlNode imgNode = new HtmlNode(HtmlNodeType.Element,doc,1);
imgNode.Name = "img";
imgNode.Attributes.Add("src",imagePath);
//appending the linknode with image node
linkNode.AppendChild(imgNode);
LinkNode.Append(linkNode);
//appending LinkNode to the body of the html
node.AppendChildren(LinkNode);
StringWriter writer = new StringWriter();
doc.Save(writer);
emailBody = writer.ToString();
return emailBody;
}

If the HTML files are valid XHTML you could always use the XmlDocument class to interpret it. You could then easily look for the body element and append a child element to it. This would place the element right before the closing </body> tag.

You might want to look at using the Html Agility Pack
http://www.codeplex.com/htmlagilitypack

I'm not sure whether the example content you want to add after the tag is the correct one or not, but if it is, I'm seeing two problems:
The Google Analytics code should be added just before the end tag, not the opening tag. That ensures that you don't have to wait for it to load before loading your own code.
If you're adding some other javascript, why not add that in an external file, and execute that one onload instead?
Hope that's of some help :)

This is what i got
feel free to make suggestions
private void button1_Click(object sender, EventArgs e)
{
OpenFileDialog OFD = new OpenFileDialog();
OFD.Multiselect = true;
OFD.Filter = "HTML Files (*.htm*)|*.HTM*|" +
"All files (*.*)|*.*";
if (OFD.ShowDialog() == DialogResult.OK)
{
foreach (string s in OFD.FileNames)
{
Console.WriteLine(s);
AddAnalytics(s);
}
MessageBox.Show("done!");
}
}
private void AddAnalytics(string filename)
{
string Htmlcode = "";
using (StreamReader sr = new StreamReader(filename))
{
Htmlcode = sr.ReadToEnd();
}
if (!Htmlcode.Contains(textBox1.Text))
{
Htmlcode = Htmlcode.Replace("</body>", CreateCode(textBox1.Text) + "</body>");
using (StreamWriter sw = new StreamWriter(filename))
{
sw.Write(Htmlcode);
}
}
}
private string CreateCode(string Number)
{
StringBuilder sb = new StringBuilder();
sb.AppendLine();
sb.AppendLine("<script type=\"text/javascript\">");
sb.AppendLine("var gaJsHost = ((\"https:\" == document.location.protocol) ? \"https://ssl.\" : \"http://www.\");");
sb.AppendLine("document.write(unescape(\"%3Cscript src='\" + gaJsHost + \"google-analytics.com/ga.js' ");
sb.AppendLine("<//script>");
sb.AppendLine("<script type=/\"text//javascript/\">");
sb.AppendLine("try {");
sb.AppendLine(string.Format("var pageTracker = _gat._getTracker(/\"{0}/\");", Number));///"UA-9909000-1"
sb.AppendLine("pageTracker._trackPageview();");
sb.AppendLine("} catch(err) {}<//script>");
sb.AppendLine();
return sb.ToString();
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HTMLAgillityPack Parsing - c#

Related

'The underlying connection was closed when using Html Agility Pack

C# OpenXML How to Replace \r\n with Break()?

How do i parse a text between two tags?

how to remove the error type has no constructors defined

best way to find end of body tag in html

Categories

Resources