HtmlAgilityPack - Select td Attribute from Table

HtmlAgilityPack - Select td Attribute from Table - c#

I'm trying to get the author name from this site .. The site simply shows a result of 25 Rows .. Each row contain different info like authors name, Title ...etc
I tried lots of solution to select the author name for each tr .. but failed to retrieve the author name .. Here is my code if someone can help me to know what i missed!
var documentx = new HtmlWeb().Load(post.ExtLink);
var div = documentx.DocumentNode.SelectNodes("//*//table[2]//tr");
if (div != null)
{
foreach (var item in div)
{
Book model = new Book();
var author= item.SelectSingleNode("//td[1]//a").InnerText.ToString();
//var title = item.SelectNodes("//td").Skip(2).FirstOrDefault().InnerText;
//var img = item.Descendants("img").Select(a1 => a1.GetAttributeValue("src", null)).FirstOrDefault();
model.Book_Description = author;
}
}
I want to get the author name for each row this photo explain exactly what i want:
I tried to debug the code .. and it's doing well before the foreach it shows that it has a 25 row result then when foreach start executing it's not showing the expected result or value.

Try using:
var div = documentx.DocumentNode.SelectNodes("//*//table[3]//tr");
instead of:
var div = documentx.DocumentNode.SelectNodes("//*//table[2]//tr");
and use it like this:
var author = item.ChildNodes[0].InnerText;
var series = item.ChildNodes[1].InnerText;
var title = item.ChildNodes[2].InnerText;

Related

Foreach loop returns null after 10th iteration

I used the Google Books Api through .NET. and with MVC Core 2. I retrieve an IEnumerable of data and by a simple foreach loop iteration, I assign every book to a book object inside a list.
Here is the code which I customized from here.
public IEnumerable<Book> GetBooks(string query, int offset, int count)
{
var queryList = _booksService.Volumes.List(query);
queryList.MaxResults = count;
queryList.StartIndex = offset;
var result = queryList.Execute();
if (result != null) {
var books = result.Items.Select(b => new Book
{
// id = b.Id,
title = b.VolumeInfo.Title,
subtitle = b.VolumeInfo.Subtitle,
authors = string.Join(",",b.VolumeInfo.Authors),
publisher = b.VolumeInfo.Publisher,
publishedDate = b.VolumeInfo.PublishedDate,
description = b.VolumeInfo.Description,
pageCount = (int)b.VolumeInfo.PageCount,
category = string.Join(",",b.VolumeInfo.Categories),
maturityRating = b.VolumeInfo.MaturityRating,
language = b.VolumeInfo.Language,
smallThumbnail = b.VolumeInfo.ImageLinks.SmallThumbnail,
thumbnail = b.VolumeInfo.ImageLinks.Thumbnail
}).ToList();
return books;
}
else
{
return null;
}
}
It was working fine before, but now, after the 10th iteration I get a "InvalidOperationException: Nullable object must have a value." in my View,
in the following line:
var books = result.Items.Select(b => new Book
This is very weird, because the debugger shows that the request executed succesfully and I get results for this.
What could possible have gone wrong?
Thanks in advance.

One of the properties values is null. Eg: The PageCount. When the debugger halts, check the current item has data for every property that you try to assign/use.
Also, the InnerException might contain the name of the property that is failing.

MailMerge TaleStart-TableEnd: Add enter on end of page with multiline rows

We have a MailMerge docx which has the following table:
_____________________________________________________________________________
Date Id Description Amount
_____________________________________________________________________________
{{TableStart {{Id}} {{Description}} € {{Amount
:Lines}}{{Da \# 0,00}}{{
te \#"dd-MM- TableEnd:Li
yyyy"}} nes}}
_____________________________________________________________________________
Total € {{Total \#
0,00}}
_____________________________________________________________________________
Here is an example result row:
____________________________________________________________________________
Date Id Description Amount
____________________________________________________________________________
03-09-2015 0001 Company Name € 25,00
Buyer Name 1, Buyer Name 2
Product description
Extra description line
As you can see, the description has multiple lines. When the end of a page is reached, it just continues on the next page. So with the example above, the line could be like this at the end of page 1:
03-09-2015 0001 Company Name € 25,00
Buyer Name 1, Buyer Name 2
And like this at the start of page 2:
Product description
Extra description line
What I'd like instead is the following: When an item doesn't fit on the page anymore, the entire item must go to the start of the next page. Basically I want to prevent items from splitting between pages. Is there any way to accomplish this with MailMerge?
Also, we use C# in our project. Here is the code we use for the MailMerge. I think it's a bit to ambitious to ask if there is a setting to allow the behavior I desire in the MailMerge libraries. Anyway, here is the code we use to convert the data & docx to a pdf:
var pdf = _documentService.CreateTableFile(new TableFileData(date, companyId,
dataList.Select(x => new TableRowData
{
Description = x.Description,
Amount = x.Amount,
Date = x.Date,
Id = x.Id
}).ToList()));
var path = Path.Combine(FileService.GetTemporaryPath(), Path.GetRandomFileName());
var file = Path.ChangeExtension(path, "pdf");
using (var fs = File.OpenWrite(file))
{
fs.Write(pdf, 0, pdf.Length);
}
Process.Start(file);
With CreateTableFile-method:
public byte[] CreateTableFile(TableFileData data)
{
if (data == null) throw new ArgumentNullException("data");
const string fileName = "TableFile.docx";
var path = Path.Combine(_templatePath, fileName);
using (var fs = File.OpenRead(path))
{
var dataSource = new DocumentDataSource(data);
return GenerateDocument(fs, dataSource);
}
}
With GenerateDocument-method:
private static byte[] GenerateDocument(Stream template, DocumentDataSource dataSource, IFieldMergingCallback fieldMergingCallback = null)
{
var doc = new Document(template);
doc.MailMerge.FieldMergingCallback = fieldMergingCallback;
doc.MailMerge.UseNonMergeFields = true;
doc.MailMerge.CleanupOptions = MailMergeCleanupOptions.RemoveContainingFields |
MailMergeCleanupOptions.RemoveUnusedFields |
MailMergeCleanupOptions.RemoveUnusedRegions |
MailMergeCleanupOptions.RemoveEmptyParagraphs;
doc.MailMerge.Execute(dataSource);
doc.MailMerge.ExecuteWithRegions((IMailMergeDataSourceRoot)dataSource);
doc.UpdateFields();
using (var ms = new MemoryStream())
{
var options = new PdfSaveOptions { WarningCallback = new AsposeWarningCallback() };
doc.Save(ms, options);
return ms.ToArray();
}
}

After #bibadia's suggestion in the first comment of the question, I've unchecked the suggested checkbox of the table settings in the docx:
This did the trick, so thanks a lot bibadia!

Why this difference between foreach vs Parallel.ForEach?

Can anyone explain to me in simple langauage why I get a file about 65 k when using foreach and more then 3 GB when using Parallel.ForEach?
The code for the foreach:
// start node xml document
var logItems = new XElement("log", new XAttribute("start", DateTime.Now.ToString("yyyy-MM-ddTHH:mm:ss")));
var products = new ProductLogic().SelectProducts();
var productGroupLogic = new ProductGroupLogic();
var productOptionLogic = new ProductOptionLogic();
// loop through all products
foreach (var product in products)
{
// is in a specific group
var id = Convert.ToInt32(product["ProductID"]);
var isInGroup = productGroupLogic.GetProductGroups(new int[] { id }.ToList(), groupId).Count > 0;
// get product stock per option
var productSizes = productOptionLogic.GetProductStockByProductId(id).ToList();
// any stock available
var stock = productSizes.Sum(ps => ps.Stock);
var hasStock = stock > 0;
// get webpage for this product
var productUrl = string.Format(url, id);
var htmlPage = Html.Page.GetWebPage(productUrl);
// check if there is anything to log
var addToLog = false;
XElement sizeElements = null;
// if has no stock or in group
if (!hasStock || isInGroupNew)
{
// page shows => not ok => LOG!
if (!htmlPage.NotFound) addToLog = true;
}
// if page is ok
if (htmlPage.IsOk)
{
sizeElements = GetSizeElements(htmlPage.Html, productSizes);
addToLog = sizeElements != null;
}
if (addToLog) logItems.Add(CreateElement(productUrl, htmlPage, stock, isInGroup, sizeElements));
}
// save
var xDocument = new XDocument(new XDeclaration("1.0", "utf-8", "yes"), new XElement("log", logItems));
xDocument.Save(fileName);
Use of the parallel code is a minor change, just replaced the foreach with Parallel.ForEach:
// loop through all products
Parallel.ForEach(products, product =>
{
... code ...
};
The methods GetSizeElements and CreateElements are both static.
update1
I made the methods GetSizeElements and CreateElements threadsafe with a lock, also doesn't help.
update2
I get answer to solve the problem. That's nice and fine. But I would like to get some more insigths on why this codes creates a file that is so much bigger then the foreach solutions. I am trying get some more sense in how the code is working when using threads. That way I get more insight and can I learn to avoid the pitfalls.

One thing stands out:
if (addToLog)
logItems.Add(CreateElement(productUrl, htmlPage, stock, isInGroup, sizeElements));
logItems is not tread-safe. That could be your core problem but there are lots of other possibilities.
You have the output files, look for the differences.

Try to define the following parameters inside the foreach loop.
var productGroupLogic = new ProductGroupLogic();
var productOptionLogic = new ProductOptionLogic();
I think the only two is used by all of your threads inside the parallel foreach loop and the result is multiplied unnecessaryly.

Working with HtmlAgilityPack

I'm trying to get a link and another element from an HTML page, but I don't really know what to do. This is what I have right now:
var client = new HtmlWeb(); // Initialize HtmlAgilityPack's functions.
var url = "http://p.thedgtl.net/index.php?tag=-1&title={0}&author=&o=u&od=d&page=-1&"; // The site/page we are indexing.
var doc = client.Load(string.Format(url, textBox1.Text)); // Index the whole DB.
var nodes = doc.DocumentNode.SelectNodes("//a[#href]"); // Get every url.
string authorName = "";
string fileName = "";
string fileNameWithExt;
foreach (HtmlNode link in nodes)
{
string completeUrl = link.Attributes["href"].Value; // The complete plugin download url.
#region Get all jars
if (completeUrl.Contains(".jar")) // Check if the url contains .jar
{
fileNameWithExt = completeUrl.Substring(completeUrl.LastIndexOf('/') + 1); // Get the filename with extension.
fileName = fileNameWithExt.Remove(fileNameWithExt.LastIndexOf('.')); ; // Get the filename without extension.
Console.WriteLine(fileName);
}
#endregion
#region Get all Authors
if (completeUrl.Contains("?author=")) // Check if the url contains .jar
{
authorName = completeUrl.Substring(completeUrl.LastIndexOf('=') + 1); // Get the filename with extension.
Console.WriteLine(authorName);
}
#endregion
}
I am trying to get all the filenames and authors next to each other, but now everything is like randomly placed, why?
Can someone help me with this? Thanks!

If you look at the HTML, it's very unfortunate it is not well-formed. There's a lot of open tags and the way HAP structures it is not like a browser, it interprets the majority of the document as deeply nested. So you can't just simply iterate through the rows of the table like you would in the browser, it gets a lot more complicated than that.
When dealing with such documents, you have to change your queries quite a bit. Rather than searching through child elements, you have to search through descendants adjusting for the change.
var title = System.Web.HttpUtility.UrlEncode(textBox1.Text);
var url = String.Format("http://p.thedgtl.net/index.php?title={0}", title);
var web = new HtmlWeb();
var doc = web.Load(url);
// select the rows in the table
var xpath = "//div[#class='content']/div[#class='pluginList']/table[2]";
var table = doc.DocumentNode.SelectSingleNode(xpath);
// unfortunately the `tr` tags are not closed so HAP interprets
// this table having a single row with multiple descendant `tr`s
var rows = table.Descendants("tr")
.Skip(1); // skip header row
var query =
from row in rows
// there may be a row with an embedded ad
where row.SelectSingleNode("td/script") == null
// each row has 6 columns so we need to grab the next 6 descendants
let columns = row.Descendants("td").Take(6).ToList()
let titleText = columns[1].Elements("a").Select(a => a.InnerText).FirstOrDefault()
let authorText = columns[2].Elements("a").Select(a => a.InnerText).FirstOrDefault()
let downloadLink = columns[5].Elements("a").Select(a => a.GetAttributeValue("href", null)).FirstOrDefault()
select new
{
Title = titleText ?? "",
Author = authorText ?? "",
FileName = Path.GetFileName(downloadLink ?? ""),
};
So now you can just iterate through the query and write out what you want for each of the rows.
foreach (var item in query)
{
Console.WriteLine("{0} ({1})", item.FileName, item.Author);
}

Errors when creating a custom Querable object with MVC and Subsonic pagedlist

hiya, i have the following code but when i try and create a new IQuerable i get an error that the interface cannot be implemented, if i take away the new i get a not implemented exception, have had to jump back and work on some old ASP classic sites for past month and for the life of me i can not wake my brain up into C# mode.
Could you please have a look at below and give me some clues on where i'm going wrong:
The code is to create a list of priceItems, but instead of a categoryID (int) i am going to be showing the name as string.
public ActionResult ViewPriceItems(int? page)
{
var crm = 0;
page = GetPage(page);
// try and create items2
IQueryable<ViewPriceItemsModel> items2 = new IQueryable<ViewPriceItemsModel>();
// the data to be paged,but unmodified
var olditems = PriceItem.All().OrderBy(x => x.PriceItemID);
foreach (var item in olditems)
{
// set category as the name not the ID for easier reading
items2.Concat(new [] {new ViewPriceItemsModel {ID = item.PriceItemID,
Name = item.PriceItem_Name,
Category = PriceCategory.SingleOrDefault(
x => x.PriceCategoryID == item.PriceItem_PriceCategory_ID).PriceCategory_Name,
Display = item.PriceItems_DisplayMethod}});
}
crm = olditems.Count() / MaxResultsPerPage;
ViewData["numtpages"] = crm;
ViewData["curtpage"] = page + 1;
// return a paged result set
return View(new PagedList<ViewPriceItemsModel>(items2, page ?? 0, MaxResultsPerPage));
}
many thanks

you do not need to create items2. remove the line with comment try and create items2. Use the following code. I have not tested this. But I hope this works.
var items2 = (from item in olditems
select new ViewPriceItemsModel
{
ID = item.PriceItemID,
Name = item.PriceItem_Name,
Category = PriceCategory.SingleOrDefault(
x => x.PriceCategoryID == item.PriceItem_PriceCategory_ID).PriceCategory_Name,
Display = item.PriceItems_DisplayMethod
}).AsQueryable();

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

HtmlAgilityPack - Select td Attribute from Table - c#

Related

Foreach loop returns null after 10th iteration

MailMerge TaleStart-TableEnd: Add enter on end of page with multiline rows

Why this difference between foreach vs Parallel.ForEach?

Working with HtmlAgilityPack

Errors when creating a custom Querable object with MVC and Subsonic pagedlist

Categories

Resources