How I can get data from multiple pages with HTML Agility Pack - c#

Hello I've got a problem with Agility pack in C#. Maybe I don't see somethink what I'm doing wrong.
I want to get movies from multiple pages but when I run my app then getting everythink from 1st page and repeating that n - times (n it's a number what I give). For exaple 10 titles from page is written x4 times in loop
bool looping = true;
string mainUrl = "https://www.filmweb.pl/films/search";
HtmlWeb web = new HtmlWeb();
HtmlDocument docu = web.Load(mainUrl);
int inc = 0;
var tags = docu.DocumentNode.SelectNodes("//h2[#class='filmPreview__title']");
while(looping)
{
var nextPage = docu.DocumentNode.SelectNodes("//a[#title='następna']/#href").ToList();
if(inc < 4)
{
string link = mainUrl + nextPage[0].Attributes["href"].Value;
var urlDecode = HttpUtility.HtmlDecode(link);
docu = web.Load(urlDecode);
foreach (var item in tags)
{
Movie mv = new Movie();
mv.Tytul = item.InnerText;
tytuly.Add(mv);
}
inc++;
}
else
{
looping = false;
}
}
And below my view code
#for (int i = 0; i < Model.Count; i++)
{
<p>#Model[i].Tytul</p>
}
I tried with different loops and everytime was same situation. Can you help me? I think I don't see my mistakes
Thank you in advance!

There are some logic problem in your codes. You get the films' titles by looping the tags, while the tags are always from the first page, you have not overridden it when you get the new page. I made some changes to your codes, and got the right results
bool looping = true;
string mainUrl = "https://www.filmweb.pl/films/search";
HtmlWeb web = new HtmlWeb();
HtmlDocument docu = web.Load(mainUrl);
int inc = 0;
var tags = docu.DocumentNode.SelectNodes("//h2[#class='filmPreview__title']");
while (looping)
{
if (inc < 4)
{
foreach (var item in tags)
{
Movie mv = new Movie();
mv.Tytul = item.InnerText;
tytuly.Add(mv);
}
var nextPage = docu.DocumentNode.SelectNodes("//a[#title='następna']/#href").ToList();
string link = mainUrl + nextPage[0].Attributes["href"].Value;
var urlDecode = HttpUtility.HtmlDecode(link);
docu = web.Load(urlDecode);
tags = docu.DocumentNode.SelectNodes("//h2[#class='filmPreview__title']");
inc++;
}
else
{
looping = false;
}
}

Related

how to make a for loop for xpath in C#?

I'm trying to make a for loop to get all the data in the div, it did worked with one but didn't for the other
my code
if(websearch != mainSearchUrl) {
var webGet = new HtmlWeb();
var doc = webGet.Load(websearch);
var webnode = doc.DocumentNode.SelectNodes("/html/body/div/div[1]/div/div[2]");
foreach (HtmlNode node in webnode)
{
for (int i = 1; i < 15; i++)
{
var title = node.SelectSingleNode("/html/body/div/div[1]/div/div[2]/div["+i+"]/div/a");
var chapters = node.SelectSingleNode("/html/body/div/div[1]/div/div[2]/div[1]/div/div[4]"); //here is the error when i put "i" instead of the pre last number it results null
comboBox1.Items.Add(title.InnerText + chapters.InnerText);// error chapters null
}
}

Why are my link clicks for web crawling from a list of links really slow? C#

I want to click on all links with text "300". My code for web-scraping clicks each link really slow. I store the links in a list and click them one by one.
I count the links for indexing then use for(int pos = 0; pos < numberOfElementsFound; pos++). I have tried this code to count and click (By.PartialLinkText("3600") on [https://www.w3schools.com/html/default.asp] and is very responsive but it is very slow on another site.
class Program
{
private static IWebDriver driver = null;
static void Main(string[] args)
{
driver = new InternetExplorerDriver();
driver.Manage().Window.Maximize();
driver.Navigate().GoToUrl("https://arbitrary.com/");
clickAllLinks("300");
}
//clicking links AND get data
public static void clickAllLinks(string tagName)
{
IWebElement element =
driver.FindElement(By.XPath("//div[#class='data']"));
int elements =
element.FindElements(By.PartialLinkText(tagName)).Count();
for (int pos = 0; pos < elements; pos++)
{
getElementWithIndex(By.PartialLinkText(tagName), pos).Click();
//fetchdata();
}
}
public static IWebElement getElementWithIndex(By by, int pos)
{
IWebElement element =
driver.FindElement(By.XPath("//div[#class='data']"));
IList<IWebElement> elements =
element.FindElements(By.PartialLinkText("300"));
return elements.ElementAt(pos);
}
//scrape data
public static async void fetchdata()
{
string currentURL = driver.Url; //url to string
Console.WriteLine("URL: " + currentURL);
var httpclient = new HttpClient();
var html = await httpclient.GetStringAsync(currentURL);
var htmldoc = new HtmlDocument();
htmldoc.LoadHtml(html); //html to htmldoc
List<List<string>> Receipt =
htmldoc.DocumentNode.SelectSingleNode("//table[#class='classname']")
//htmldoc into list TABLE->TR->TD->InnerText
.Descendants("tr")
.Where(tr => tr.Elements("td").Count() > 0)
.Select(tr => tr.Elements("td")
.ToList())
.ToList();
Here is the simplified version of your clickAllLinks method. This will reduce the overhead in your current methods (getting the elements and storing unnecessarly which may impact the execution speed).
//clicking links AND get data
public static void clickAllLinks(string tagName)
{
int elements =
driver.FindElements(By.xpath("//div[#class='data']//a[contains(.," + tagName + ")]").Count();
for (int pos = 1; pos < elements; pos++)
{
driver.FindElements(By.xpath("(//div[#class='data']//a[contains(.," + tagName + ")])[" + pos + "]").Click();
//fetchdata();
}
}

using Aspose Words to replace page numbers with barcodes

this may be a silly question but I cannot work out an answer to it and after a day I am turning to the community at large for help...
I am using Aspose for Word (C# or .Net) and I am trying to replace the generated page numbering for barcode images of my own creation. I can use fonts to do it currently but I have found they are less reliable with my barcode reader and thus need to be able to read the value from the page numbering and replace it with an image of my own creation.
So really I need to find the numbering container, read the value in it and replace it. Once I have that creating the barcode and inserting it is easy.
Can anyone help?
The current method (sorry its messy but i keep trying new things):
internal static void SetFooters(ref Document doc)
{
doc.FirstSection.HeadersFooters.LinkToPrevious(false);
var builder = new DocumentBuilder(doc);
builder.MoveToDocumentStart();
Section currentSection = builder.CurrentSection;
PageSetup pageSetup = currentSection.PageSetup;
int totalPages = doc.PageCount;
int j = 1;
foreach (Section sect in doc.Sections)
{
//Loop through all headers/footers
foreach (HeaderFooter hf in sect.HeadersFooters)
{
if (
hf.HeaderFooterType == HeaderFooterType.FooterPrimary || hf.HeaderFooterType == HeaderFooterType.FooterEven || hf.HeaderFooterType == HeaderFooterType.FooterFirst)
{
builder.MoveToHeaderFooter(hf.HeaderFooterType);
Field page = builder.InsertField("PAGE");
builder.Document.UpdatePageLayout();
try
{
page.Update();
}
catch { }
int pageNumber = j;
if (int.TryParse(page.Result, out pageNumber))
{ j++; }
// Remove PAGE field.
page.Remove();
builder.Write(string.Format("{0}/{1}", pageNumber, totalPages));
}
}
}
}
HeaderFooter is a section-level node and can only be a child of Section. The page field inside the header/footer returns the latest updated value and it will be same value for all pages of a section.
In your case, I suggest you to insert text-box at the top/bottom of each page and inset the desired contents in it. Following code example inserts the text-box on each page of document and insert page field and some text in it. Hope this helps you.
public static void InsertTextBoxAtEachPage()
{
string filePathIn = MyDir + #"input.docx";
string filePathOut = MyDir + #"output.docx";
Document doc = new Document(filePathIn);
DocumentBuilder builder = new DocumentBuilder(doc);
LayoutCollector collector = new LayoutCollector(doc);
int pageIndex = 1;
foreach (Section section in doc.Sections)
{
NodeCollection paragraphs = section.Body.GetChildNodes(NodeType.Paragraph, true);
foreach (Paragraph para in paragraphs)
{
if (collector.GetStartPageIndex(para) == pageIndex)
{
builder.MoveToParagraph(paragraphs.IndexOf(para), 0);
builder.StartBookmark("BM_Page" + pageIndex);
builder.EndBookmark("BM_Page" + pageIndex);
pageIndex++;
}
}
}
collector = new LayoutCollector(doc);
LayoutEnumerator layoutEnumerator = new LayoutEnumerator(doc);
const int PageRelativeY = 0;
const int PageRelativeX = 0;
foreach (Bookmark bookmark in doc.Range.Bookmarks)
{
if (bookmark.Name.StartsWith("BM_"))
{
Paragraph para = (Paragraph)bookmark.BookmarkStart.ParentNode;
Shape textbox = new Shape(doc, Aspose.Words.Drawing.ShapeType.TextBox);
textbox.Top = PageRelativeY;
textbox.Left = PageRelativeX;
int currentPageNumber = collector.GetStartPageIndex(para);
string barcodeString = string.Format("page {0} of {1}", currentPageNumber, doc.PageCount);
string barcodeEncodedString = "some barcode string";
Paragraph paragraph = new Paragraph(doc);
ParagraphFormat paragraphFormat = paragraph.ParagraphFormat;
paragraphFormat.Alignment = ParagraphAlignment.Center;
Aspose.Words.Style paragraphStyle = paragraphFormat.Style;
Aspose.Words.Font font = paragraphStyle.Font;
font.Name = "Tahoma";
font.Size = 12;
paragraph.AppendChild(new Run(doc, barcodeEncodedString));
textbox.AppendChild(paragraph);
paragraph = new Paragraph(doc);
paragraphFormat = paragraph.ParagraphFormat;
paragraphFormat.Alignment = ParagraphAlignment.Center;
paragraphStyle = paragraphFormat.Style;
font = paragraphStyle.Font;
font.Name = "Arial";
font.Size = 10;
paragraph.AppendChild(new Run(doc, barcodeString));
textbox.AppendChild(paragraph);
//Set the width height according to your requirements
textbox.Width = doc.FirstSection.PageSetup.PageWidth;
textbox.Height = 50;
textbox.BehindText = false;
para.AppendChild(textbox);
textbox.RelativeHorizontalPosition = Aspose.Words.Drawing.RelativeHorizontalPosition.Page;
textbox.RelativeVerticalPosition = Aspose.Words.Drawing.RelativeVerticalPosition.Page;
bool isInCell = bookmark.BookmarkStart.GetAncestor(NodeType.Cell) != null;
if (isInCell)
{
var renderObject = collector.GetEntity(bookmark.BookmarkStart);
layoutEnumerator.Current = renderObject;
layoutEnumerator.MoveParent(LayoutEntityType.Cell);
RectangleF location = layoutEnumerator.Rectangle;
textbox.Top = PageRelativeY - location.Y;
textbox.Left = PageRelativeX - location.X;
}
}
}
doc.Save(filePathOut, SaveFormat.Docx);
}
I work with Aspose as Developer evangelist.

Html Agility Pack xpath IEnumerable

I can not add html code, because it is very very big! 5 scrolls or more. Please, follow link in htmlWeb.load().
I look at this code already 2 hours and I can not figure out what is wrong.
HtmlWeb htmlWeb = new HtmlWeb {OverrideEncoding = Encoding.Default};
HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("https://www.parimatch.com/en/sport/futbol/germanija-bundesliga");
var matches = document.DocumentNode.SelectNodes("//tr[#class='bk']").
Select(tr => new FootballMatch()
{
Number = string.Join(" ", tr.SelectNodes("./td[1]//text()[normalize-space()]").Select(t =>t.InnerText)),
Time = string.Join(" ", tr.SelectNodes("./td[2]//text()[normalize-space()]").Select(t => t.InnerText)),
Teams = string.Join(" ", tr.SelectNodes("./td[3]//text()[normalize-space()]").Select(t => t.InnerText)),
Allowance = string.Join(" ", tr.SelectNodes("./td[4]//text()[normalize-space()]").Select(t => t.InnerText)),
CoefficientAllowance = string.Join(" ", tr.SelectNodes("./td[5]//text()[normalize-space()]").Select(t => t.InnerText)),
Total = tr.SelectSingleNode("./td[7]//text()[normalize-space()]").InnerText,
P1 = tr.SelectSingleNode("./td[10]//text()[normalize-space()]").InnerText,
X = tr.SelectSingleNode("./td[11]//text()[normalize-space()]").InnerText,
/*P2 = tr.SelectSingleNode("./td[12]//text()[normalize-space()]").InnerText,
P1X = tr.SelectSingleNode("./td[13]//text()[normalize-space()]").InnerText,
P1P2 = tr.SelectSingleNode("./td[14]//text()[normalize-space()]").InnerText,
P2X = tr.SelectSingleNode("./td[15]//text()[normalize-space()]").InnerText*/
});
P2,P1X,P1P2,P2X always null.
and it is possible to do this code more neater?
When you click on an event , a popup menu appears , this data is read too , but I do not need this . How can I disable this ?
This is also not the prettiest. But it works. Still some work needs to be done in respect to sperating certain cells. Since some <td>s contain <br> to separate lines. Hope this helps you moving on.
string xpath = "//tr[#class='bk']";
HtmlNodeCollection matches = htmlDoc.DocumentNode.SelectNodes(xpath);
List<List<string>> footballMatches = new List<List<string>>();
foreach (HtmlNode x in matches)
{
List<string> mess = new List<string>();
HtmlNodeCollection hTC = x.SelectNodes("./td");
if (hTC.Count > 15)
{
for (int i = 0; i < 15; i++)
{
if (i != 5)
{
mess.Add(hTC[i].InnerText);
}
}
}
footballMatches.Add(mess);
}

changing a node type to #text whilst keeping the innernodes with the HtmlAgilityPack

I'm using the HtmlAgilityPack to parse an XML file that I'm converting to HTML. Some of the nodes will be converted to an HTML equivalent. The others that are unnecessary I need to remove while maintaining the contents. I tried converting it to a #text node with no luck. Here's my code:
private HtmlNode ConvertElementsPerDatabase(HtmlNode parentNode, bool transformChildNodes)
{
var listTagsToReplace = XmlTagMapping.SelectAll(string.Empty); // Custom Dataobject
var node = parentNode;
if (node != null)
{
var bNodeFound = false;
if (node.Name.Equals("xref"))
{
bNodeFound = true;
node = NodeXref(node);
}
if (node.Name.Equals("graphic"))
{
bNodeFound = true;
node = NodeGraphic(node);
}
if (node.Name.Equals("ext-link"))
{
bNodeFound = true;
node = NodeExtLink(node);
}
foreach (var infoTagToReplace in listTagsToReplace)
{
if (node.Name.Equals(infoTagToReplace.XmlTag))
{
bNodeFound = true;
node.Name = infoTagToReplace.HtmlTag;
if (!string.IsNullOrEmpty(infoTagToReplace.CssClass))
node.Attributes.Add("class", infoTagToReplace.CssClass);
if (node.HasAttributes)
{
var listTagAttributeToReplace = XmlTagAttributeMapping.SelectAll_TagId(infoTagToReplace.Id); // Custom Dataobject
for (int i = 0; i < node.Attributes.Count; i++ )
{
var bDeleteAttribute = true;
foreach (var infoTagAttributeToReplace in listTagAttributeToReplace)
{
if (infoTagAttributeToReplace.XmlName.Equals(node.Attributes[i].Name))
{
node.Attributes[i].Name = infoTagAttributeToReplace.HtmlName;
bDeleteAttribute = false;
}
}
if (bDeleteAttribute)
node.Attributes.Remove(node.Attributes[i].Name);
}
}
}
}
if (transformChildNodes)
for (int i = 0; i < parentNode.ChildNodes.Count; i++)
parentNode.ChildNodes[i] = ConvertElementsPerDatabase(parentNode.ChildNodes[i], true);
if (!bNodeFound)
{
// Replace with #text
}
}
return parentNode;
}
At the end I need to do the node replacement (where you see the "Replace with #text" comment) if the node is not found. I've been ripping my hair (what's left of it) out all day and it's probably something silly. I'm unable to get the help to compile and there is no online version. Help Stackoverflow! You're my only hope. ;-)
I would think you could just do this:
return new HtmlNode(HtmlNodeType.Text, parentNode.OwnerDocument, 0);
This of course adds the node to the head of the document, but I assume you have some sort of code in place to handle where in the document the node should be added.
Regarding the documentation comment, the current (as of this writing) download of the Html Agility Pack documentation contains a CHM file which doesn't require compilation in order to view.

Categories