Working with HtmlAgilityPack - c#

I'm trying to get a link and another element from an HTML page, but I don't really know what to do. This is what I have right now:
var client = new HtmlWeb(); // Initialize HtmlAgilityPack's functions.
var url = "http://p.thedgtl.net/index.php?tag=-1&title={0}&author=&o=u&od=d&page=-1&"; // The site/page we are indexing.
var doc = client.Load(string.Format(url, textBox1.Text)); // Index the whole DB.
var nodes = doc.DocumentNode.SelectNodes("//a[#href]"); // Get every url.
string authorName = "";
string fileName = "";
string fileNameWithExt;
foreach (HtmlNode link in nodes)
{
string completeUrl = link.Attributes["href"].Value; // The complete plugin download url.
#region Get all jars
if (completeUrl.Contains(".jar")) // Check if the url contains .jar
{
fileNameWithExt = completeUrl.Substring(completeUrl.LastIndexOf('/') + 1); // Get the filename with extension.
fileName = fileNameWithExt.Remove(fileNameWithExt.LastIndexOf('.')); ; // Get the filename without extension.
Console.WriteLine(fileName);
}
#endregion
#region Get all Authors
if (completeUrl.Contains("?author=")) // Check if the url contains .jar
{
authorName = completeUrl.Substring(completeUrl.LastIndexOf('=') + 1); // Get the filename with extension.
Console.WriteLine(authorName);
}
#endregion
}
I am trying to get all the filenames and authors next to each other, but now everything is like randomly placed, why?
Can someone help me with this? Thanks!

If you look at the HTML, it's very unfortunate it is not well-formed. There's a lot of open tags and the way HAP structures it is not like a browser, it interprets the majority of the document as deeply nested. So you can't just simply iterate through the rows of the table like you would in the browser, it gets a lot more complicated than that.
When dealing with such documents, you have to change your queries quite a bit. Rather than searching through child elements, you have to search through descendants adjusting for the change.
var title = System.Web.HttpUtility.UrlEncode(textBox1.Text);
var url = String.Format("http://p.thedgtl.net/index.php?title={0}", title);
var web = new HtmlWeb();
var doc = web.Load(url);
// select the rows in the table
var xpath = "//div[#class='content']/div[#class='pluginList']/table[2]";
var table = doc.DocumentNode.SelectSingleNode(xpath);
// unfortunately the `tr` tags are not closed so HAP interprets
// this table having a single row with multiple descendant `tr`s
var rows = table.Descendants("tr")
.Skip(1); // skip header row
var query =
from row in rows
// there may be a row with an embedded ad
where row.SelectSingleNode("td/script") == null
// each row has 6 columns so we need to grab the next 6 descendants
let columns = row.Descendants("td").Take(6).ToList()
let titleText = columns[1].Elements("a").Select(a => a.InnerText).FirstOrDefault()
let authorText = columns[2].Elements("a").Select(a => a.InnerText).FirstOrDefault()
let downloadLink = columns[5].Elements("a").Select(a => a.GetAttributeValue("href", null)).FirstOrDefault()
select new
{
Title = titleText ?? "",
Author = authorText ?? "",
FileName = Path.GetFileName(downloadLink ?? ""),
};
So now you can just iterate through the query and write out what you want for each of the rows.
foreach (var item in query)
{
Console.WriteLine("{0} ({1})", item.FileName, item.Author);
}

Related

How to Read and Write to an XML file using XDocument and XElements? [duplicate]

I'm using LINQ together with XDocument to read a XML File. This is the code:
XDocument xml = XDocument.Load(filename);
var q = from b in xml.Descendants("product")
select new
{
name = b.Element("name").Value,
price = b.Element("price").Value,
extra = b.Element("extra1").Value,
deeplink = b.Element("deepLink").Value
};
Now the problem is, the extra1 field is not always present. There are items in the XML file without that node. If that happens it's crashing with a NullReferenceException.
Is there any possibility to include a "check if null" so I can prevent it from crashing?
Use (string) instead of .Value:
var q = from b in xml.Descendants("product")
select new
{
name = (string)b.Element("name"),
price = (double?)b.Element("price"),
extra = (string)b.Element("extra1"),
deeplink = (string)b.Element("deepLink")
};
This also works with other datatypes, including many nullable types in case the element is not always present.
You can use "null coalescing" operator:
var q = from b in xml.Descendants("product")
select new
{
name = (string)b.Element("name") ?? "Default Name",
price = (double?)b.Element("price") ?? 0.0,
extra = (string)b.Element("extra1") ?? String.Empty,
deeplink = (string)b.Element("deepLink") ?? String.Empty
};
This way, you will have full control about default value used when there's no element.
Use the following example for checking existence of any element before using that element.
if( b.Elements("extra1").Any() )
{
extra = b.Element("extra1").Value;
}
Here is sample example to read XML file using XDocument.
XDocument objBooksXML = XDocument.Load(Server.MapPath("books.xml"));
var objBooks = from book in
objBooksXML.Descendants("Book")
select new {
Title = book.Element("Title").Value,
Pages = book.Element("Pages").Value
};
Response.Write(String.Format("Total {0} books.", objBooks.Count()));
gvBooks.DataSource = objBooks;
gvBooks.DataBind();

Get the titles and URLs of Yahoo result page in c#

I want to get titles and URLs of Yahoo result page with htmlagility pack
HtmlWeb w = new HtmlWeb();
string SearchResults = "https://en-maktoob.search.yahoo.com/search?p=" + query.querytxt;
var hd = w.Load(SearchResults);
var nodes = hd.DocumentNode.SelectNodes("//a[#cite and #href]");
if (nodes != null)
{
foreach (var node in nodes)
{
{
string Text = node.Attributes["title"].Value;
string Href = node.Attributes["href"].Value;
}
}
It works but all links in search result are not Appropriate links how to omit ads link , Yahoo links and etc .
I want to access the correct links
What about this:
HtmlWeb w = new HtmlWeb();
string search = "https://en-maktoob.search.yahoo.com/search?p=veverke";
//ac-algo ac-21th lh-15
var hd = w.Load(search);
var titles = hd.DocumentNode.CssSelect(".title a").Select(n => n.InnerText);
var links = hd.DocumentNode.CssSelect(".fz-15px.fw-m.fc-12th.wr-bw.lh-15").Select(n => n.InnerText);
for (int i = 0; i < titles.Count() - 1; i++)
{
var title = titles.ElementAt(i);
string link = string.Empty;
if (links.Count() > i)
link = links.ElementAt(i);
Console.WriteLine("Title: {0}, Link: {1}", title, link);
}
Keep in mind that I am using the extension method CssSelect, from nuget package's ScrapySharp. Install it just like you installed HtmlAgilityPack, then add a using statement at the top of the code like using ScrapySharp.Extensions; and you are good to go. (I use it because its easier to refer to css selectors instead of xpath expressions...)
Regarding skipping ads, I noticed ads in these yahoo search results will come at the last record only ? Assuming I am correct, simply skip the last one.
Here's the output I get for running the code above:

HTMLAgilityPack selects nodes from first iteration through divs

I'm trying to use HTMLAgilityPack to parse some website for the first time. Everything works as expected but only for first iteration. On each iteration I get unique div with its data, but SelectNodes() always gets data from first iteration.
The code listed below explains the problem
All the properties for station get values from first iteration.
static void Main(string[] args)
{
List<Station> stations = new List<Station>();
wClient = new WebClient();
wClient.Proxy = null;
wClient.Encoding = encode;
for (int i = 1; i <= 1; i++)
{
HtmlDocument html = new HtmlDocument();
string link = string.Format("http://energybase.ru/powerPlant/index?PowerPlant_page={0}&pageSize=20&q=/powerPlant", i);
html.LoadHtml(wClient.DownloadString(link));
var stationList = html.DocumentNode.SelectNodes("//div[#class='items']").First().ChildNodes.Where(x=>x.Name=="div").ToList();//get list of nodes with PowerStation Data
foreach (var item in stationList) //each iteration returns Item with unique InnerHTML
{
Station st = new Station();
st.Name = item.SelectNodes("//div[#class='col-md-20']").First().SelectNodes("//div[#class='name']").First().ChildNodes["a"].InnerText;//gets name from first iteration
st.Url = item.SelectNodes("//div[#class='col-md-20']").First().SelectNodes("//div[#class='name']").First().ChildNodes["a"].Attributes["href"].Value;//gets url from first iteration and so on
st.Company = item.SelectNodes("//div[#class='col-md-20']").First().SelectNodes("//div[#class='name']").First().ChildNodes["small"].ChildNodes["em"].ChildNodes["a"].InnerText;
stations.Add(st);
}
}
Maybe I am not getting some of essentials of OOP?
Your code can be greatly simplified by using the full power of XPath.
var stationList = html.DocumentNode.SelectNodes("//div[#class='items']/div");
// XPath-expression may be so: "//div[#class='items'][1]/div"
// where [1] means first node
foreach (var item in stationList)
{
Station st = new Station();
st.Name = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']/a").InnerText;
st.Url = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']/a").Attributes["href"].Value;
string rawText = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']/small/em").InnerText;
st.Company = HttpUtility.HtmlDecode(rawText.Trim());
stations.Add(st);
}
Your mistake was to use XPath descendants axis: //div.
Even better rewrite code like this:
var divName = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']");
var nodeA = divName.SelectSingleNode("a");
st.Name = nodeA.InnerText;
st.Url = nodeA.Attributes["href"].Value;
string rawText = divName.SelectSingleNode("small/em").InnerText;
st.Company = HttpUtility.HtmlDecode(rawText.Trim());
This article contains some good exaples on various aspects of html agility pack.
have a look into this article, it would give you a quick start.

Trying to compare tags in mongodb string array, with textbox input c#

Okay so i have a mongodb that has a collection that is called videos and in videos i have a field called tags. what i want to do is compare a textbox input with the tags on all videos in the collection and return them to a gridview if a tag matches the input from the textbox. When i create a new video the tags field is a string Array so it is possible to store more than one tag. I am trying to do this in c#. Hope you some of you can help thanks!
Code for creating a new video document.
#region Database Connection
var client = new MongoClient();
var server = client.GetServer();
var db = server.GetDatabase("Database");
#endregion
var videos = db.GetCollection<Video>("Videos");
var name = txtVideoName.Text;
var location = txtVideoLocation.Text;
var description = txtVideoDescription.Text;
var user = txtVideoUserName.Text;
string[] lst = txtVideoTags.Text.Split(new char[] { ',' });
var index = videos.Count();
var id = 0;
if (id <= index)
{
id += (int)index;
}
videos.CreateIndex(IndexKeys.Ascending("Tags"), IndexOptions.SetUnique(false));
var newVideo = new Video(id, name, location, description, lst, user);
videos.Insert(newVideo);
Okay so here is how the search method looks like i have just made the syntax a little diffrent from what Grant Winney ansewred.
var videos = db.GetCollection<Video>("Videos");
string[] txtInput = txtSearchTags.Text.Split(new char[] { ',' });
var query = (from x in videos.AsQueryable<Video>()
where x.Tags.ContainsAny(txtInput)
select x);
This finds all videos with tags that contain a tag specified in the TextBox, assuming the MongoDB driver can properly translate it into a valid query.
var videos = db.GetCollection<Video>("Videos")
.AsQueryable()
.Where(v => v.Tags.Split(',')
.ContainsAny(txtVideoTags.Text.Split(',')))
.ToList();
Make sure you've got using MongoDB.Driver.Linq; at the top.

No hits when searching for "mvc2" with lucene.net

I am indexing and searching with lucene.net, the only problem I am having with my code is that it does not find any hits when searching for "mvc2"(it seems to work with all the other words I search), I have tried a different analyzer(see comments by analyzer) and older lucene code, here is my index and search code, I would really appreciate if someone can show me where I am going wrong with this, Thanks.
////Indexing code
public void DoIndexing(string CvContent)
{
//state the file location of the index
const string indexFileLocation = #"C:\RecruitmentIndexer\IndexedCVs";
//if directory does not exist, create it, and create new index for it.
//if directory does exist, do not create directory, do not create new //index(add field to previous index).
bool creatNewDirectory; //to pass into lucene GetDirectory
bool createNewIndex; //to pass into lucene indexWriter
if (!Directory.Exists(indexFileLocation))
{
creatNewDirectory = true;
createNewIndex = true;
}
else
{
creatNewDirectory = false;
createNewIndex = false;
}
Lucene.Net.Store.Directory dir =
Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation, creatNewDirectory);//creates if true
//create an analyzer to process the text
Lucene.Net.Analysis.Analyzer analyzer = new
Lucene.Net.Analysis.SimpleAnalyzer(); //this analyzer gets all //hits exept mvc2
//Lucene.Net.Analysis.Standard.StandardAnalyzer(); //this leaves out sql once //and mvc2 once
//create the index writer with the directory and analyzer defined.
Lucene.Net.Index.IndexWriter indexWriter = new
Lucene.Net.Index.IndexWriter(dir, analyzer,
/*true to create a new index*/ createNewIndex);
//create a document, add in a single field
Lucene.Net.Documents.Document doc = new
Lucene.Net.Documents.Document();
Lucene.Net.Documents.Field fldContent =
new Lucene.Net.Documents.Field("content",
CvContent,//"This is some text to search by indexing",
Lucene.Net.Documents.Field.Store.YES,
Lucene.Net.Documents.Field.Index.ANALYZED,
Lucene.Net.Documents.Field.TermVector.YES);
doc.Add(fldContent);
//write the document to the index
indexWriter.AddDocument(doc);
//optimize and close the writer
indexWriter.Optimize();
indexWriter.Close();
}
////search code
private void button2_Click(object sender, EventArgs e)
{
string SearchString = textBox1.Text;
///after creating an index, search
//state the file location of the index
const string indexFileLocation = #"C:\RecruitmentIndexer\IndexedCVs";
Lucene.Net.Store.Directory dir =
Lucene.Net.Store.FSDirectory.GetDirectory(indexFileLocation, false);
//create an index searcher that will perform the search
Lucene.Net.Search.IndexSearcher searcher = new
Lucene.Net.Search.IndexSearcher(dir);
SearchString = SearchString.Trim();
SearchString = QueryParser.Escape(SearchString);
//build a query object
Lucene.Net.Index.Term searchTerm =
new Lucene.Net.Index.Term("content", SearchString);
Lucene.Net.Search.Query query = new Lucene.Net.Search.TermQuery(searchTerm);
//execute the query
Lucene.Net.Search.Hits hits = searcher.Search(query);
label1.Text = hits.Length().ToString();
//iterate over the results.
for (int i = 0; i < hits.Length(); i++)
{
Lucene.Net.Documents.Document docMatch = hits.Doc(i);
MessageBox.Show(docMatch.Get("content"));
}
}
I believe that StandardAnalyzer actually strips out "2" from "mvc2", leaving the indexed word to be only "mvc". I'm not sure about SimpleAnalyzer though. You could try to use WhitespaceAnalyzer, which I believe doesn't strip out numbers.
You should also process your search input the same way that you process indexing. A TermQuery is a "identical" match, which means that you if you try to search for "mvc2" where the actual strings in your index always says "mvc", then you won't get a match.
I haven't found a way to actually make use of an analyzer unless I use the QueryParser, and even then I always had odd results.
You could try this in order to "tokenize" your search string in the same way as you index your document, and make a boolean AND search on all terms:
// We use a boolean query to combine all prefix queries
var analyzer = new SimpleAnalyzer();
var query = new BooleanQuery();
using ( var reader = new StringReader( queryTerms ) )
{
// This is what we need to do in order to get the terms one by one, kind of messy but seemed to be the only way
var tokenStream = analyzer.TokenStream( "why_do_I_need_this", reader );
var termAttribute = tokenStream.GetAttribute( typeof( TermAttribute ) ) as TermAttribute;
// This will return false when all tokens has been processed.
while ( tokenStream.IncrementToken() )
{
var token = termAttribute.Term();
query.Add( new PrefixQuery( new Term( KEYWORDS_FIELD_NAME, token ) ), BooleanClause.Occur.MUST );
}
// I don't know if this is necessary, but can't hurt
tokenStream.Close();
}
You can replace the PrefixQuery with TermQuery if you only want full matches (PrefixQuery would match anything starting with, "search*")

Categories