Html Agility Pack xpath IEnumerable - c#

I can not add html code, because it is very very big! 5 scrolls or more. Please, follow link in htmlWeb.load().
I look at this code already 2 hours and I can not figure out what is wrong.
HtmlWeb htmlWeb = new HtmlWeb {OverrideEncoding = Encoding.Default};
HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("https://www.parimatch.com/en/sport/futbol/germanija-bundesliga");
var matches = document.DocumentNode.SelectNodes("//tr[#class='bk']").
Select(tr => new FootballMatch()
{
Number = string.Join(" ", tr.SelectNodes("./td[1]//text()[normalize-space()]").Select(t =>t.InnerText)),
Time = string.Join(" ", tr.SelectNodes("./td[2]//text()[normalize-space()]").Select(t => t.InnerText)),
Teams = string.Join(" ", tr.SelectNodes("./td[3]//text()[normalize-space()]").Select(t => t.InnerText)),
Allowance = string.Join(" ", tr.SelectNodes("./td[4]//text()[normalize-space()]").Select(t => t.InnerText)),
CoefficientAllowance = string.Join(" ", tr.SelectNodes("./td[5]//text()[normalize-space()]").Select(t => t.InnerText)),
Total = tr.SelectSingleNode("./td[7]//text()[normalize-space()]").InnerText,
P1 = tr.SelectSingleNode("./td[10]//text()[normalize-space()]").InnerText,
X = tr.SelectSingleNode("./td[11]//text()[normalize-space()]").InnerText,
/*P2 = tr.SelectSingleNode("./td[12]//text()[normalize-space()]").InnerText,
P1X = tr.SelectSingleNode("./td[13]//text()[normalize-space()]").InnerText,
P1P2 = tr.SelectSingleNode("./td[14]//text()[normalize-space()]").InnerText,
P2X = tr.SelectSingleNode("./td[15]//text()[normalize-space()]").InnerText*/
});
P2,P1X,P1P2,P2X always null.
and it is possible to do this code more neater?
When you click on an event , a popup menu appears , this data is read too , but I do not need this . How can I disable this ?

This is also not the prettiest. But it works. Still some work needs to be done in respect to sperating certain cells. Since some <td>s contain <br> to separate lines. Hope this helps you moving on.
string xpath = "//tr[#class='bk']";
HtmlNodeCollection matches = htmlDoc.DocumentNode.SelectNodes(xpath);
List<List<string>> footballMatches = new List<List<string>>();
foreach (HtmlNode x in matches)
{
List<string> mess = new List<string>();
HtmlNodeCollection hTC = x.SelectNodes("./td");
if (hTC.Count > 15)
{
for (int i = 0; i < 15; i++)
{
if (i != 5)
{
mess.Add(hTC[i].InnerText);
}
}
}
footballMatches.Add(mess);
}

Related

how to make a for loop for xpath in C#?

I'm trying to make a for loop to get all the data in the div, it did worked with one but didn't for the other
my code
if(websearch != mainSearchUrl) {
var webGet = new HtmlWeb();
var doc = webGet.Load(websearch);
var webnode = doc.DocumentNode.SelectNodes("/html/body/div/div[1]/div/div[2]");
foreach (HtmlNode node in webnode)
{
for (int i = 1; i < 15; i++)
{
var title = node.SelectSingleNode("/html/body/div/div[1]/div/div[2]/div["+i+"]/div/a");
var chapters = node.SelectSingleNode("/html/body/div/div[1]/div/div[2]/div[1]/div/div[4]"); //here is the error when i put "i" instead of the pre last number it results null
comboBox1.Items.Add(title.InnerText + chapters.InnerText);// error chapters null
}
}

How I can get data from multiple pages with HTML Agility Pack

Hello I've got a problem with Agility pack in C#. Maybe I don't see somethink what I'm doing wrong.
I want to get movies from multiple pages but when I run my app then getting everythink from 1st page and repeating that n - times (n it's a number what I give). For exaple 10 titles from page is written x4 times in loop
bool looping = true;
string mainUrl = "https://www.filmweb.pl/films/search";
HtmlWeb web = new HtmlWeb();
HtmlDocument docu = web.Load(mainUrl);
int inc = 0;
var tags = docu.DocumentNode.SelectNodes("//h2[#class='filmPreview__title']");
while(looping)
{
var nextPage = docu.DocumentNode.SelectNodes("//a[#title='następna']/#href").ToList();
if(inc < 4)
{
string link = mainUrl + nextPage[0].Attributes["href"].Value;
var urlDecode = HttpUtility.HtmlDecode(link);
docu = web.Load(urlDecode);
foreach (var item in tags)
{
Movie mv = new Movie();
mv.Tytul = item.InnerText;
tytuly.Add(mv);
}
inc++;
}
else
{
looping = false;
}
}
And below my view code
#for (int i = 0; i < Model.Count; i++)
{
<p>#Model[i].Tytul</p>
}
I tried with different loops and everytime was same situation. Can you help me? I think I don't see my mistakes
Thank you in advance!
There are some logic problem in your codes. You get the films' titles by looping the tags, while the tags are always from the first page, you have not overridden it when you get the new page. I made some changes to your codes, and got the right results
bool looping = true;
string mainUrl = "https://www.filmweb.pl/films/search";
HtmlWeb web = new HtmlWeb();
HtmlDocument docu = web.Load(mainUrl);
int inc = 0;
var tags = docu.DocumentNode.SelectNodes("//h2[#class='filmPreview__title']");
while (looping)
{
if (inc < 4)
{
foreach (var item in tags)
{
Movie mv = new Movie();
mv.Tytul = item.InnerText;
tytuly.Add(mv);
}
var nextPage = docu.DocumentNode.SelectNodes("//a[#title='następna']/#href").ToList();
string link = mainUrl + nextPage[0].Attributes["href"].Value;
var urlDecode = HttpUtility.HtmlDecode(link);
docu = web.Load(urlDecode);
tags = docu.DocumentNode.SelectNodes("//h2[#class='filmPreview__title']");
inc++;
}
else
{
looping = false;
}
}

The recursive method is getting into deadloop

I am writing a simple crawler based on HTMLAgilityPack and Fizzler, in order to check if a keyword is contained anywhere on the webpage and it's corresponding sublinks. Then the same procedure is repeated for all of the sublinks up to 50 level deep. So that the number grows exponentially.
The issue is that I wanted to convert the method that I have written to a recursive one, but it doesn't work - gets stuck after first link, as well as works really slow.
This is what I've done currently:
public static void GetAllLinks(HtmlWeb web, List<string> relevantLinks, string inputLink)
{
string mainLink = "http://www.cnet.com";
Console.WriteLine("Current count of links: " + relevantLinks.Count + "\tCurrent link: " + inputLink);
HtmlDocument html = web.Load(inputLink);
HtmlDocument htmlInner = new HtmlDocument();
html.DocumentNode.Descendants()
.Where(n => n.Name == "script" || n.Name == "style")
.ToList()
.ForEach(n => n.Remove());
var text = htmlInner.DocumentNode.InnerText.ToLower();
text = Regex.Replace(text, #"\r\n?|\n", "");
text = Regex.Replace(text, " {2,}", " ");
text = text.Trim();
if (text.Contains("microsoft"))
{
if (!relevantLinks.Contains(inputLink))
{
relevantLinks.Add(inputLink);
}
}
var linkTagsList = html.DocumentNode.QuerySelectorAll("a").ToList();
foreach (var linkTag in linkTagsList)
{
if (linkTag.Attributes["href"] != null)
{
var link = linkTag.Attributes["href"].Value;
// Check if the link found is the sublink of the main link
if (!link.Contains(mainLink))
{
// Check if only partial link then concat with main one
if (link.Substring(0, 1) == "/")
{
if (inputLink.Substring(inputLink.Length - 1, 1) == "/")
inputLink = inputLink.Substring(0, inputLink.Length - 1);
link = inputLink + link;
}
else
{
link = String.Empty;
}
}
if (!string.IsNullOrEmpty(link))
{
Console.WriteLine(link);
GetAllLinks(web, relevantLinks, link);
}
}
}
}
Any hint or advice is highly appreciated.

Find values which sum to 0 in Excel with many items

I have to find each subset in a enough big list, 500/1000 items that are positive and negative and are decimal, whiches sum to 0. I'm not an expert so I read many and many articles and solutions, and then I wrote my code. Datas comes from Excel worksheet and I would to mark found sums there.
Code works in this way:
Initally I find all pair that sum to 0
Then I put the remains sums into a list and take the combinations within 20 items, beacause I know the it is not possible bigger combination sum to 0
In these combinations I search if one combinations sums to 0 and save it in result list, else save sum in dictionary as key and then I'll search if dictionary contains next sums (so I check pairs of these subsets)
I keep track of the index so I can reach and modify the cells
To found solutions is enough fast but when I want elaborate the results in Excel become really slow. I don't take care about find all solutions but I want to find as max as possible in a short time.
What do you think about this solution? How can I improve the speed? How can I skip easly the sums that are already taken? And how can mark the cells fastly in my worksheet, beacuse now here is the bottleneck of the program?
I hope it is enough clear :) Thanks to everybody for any help
Here my code of the combination's part:
List<decimal> listDecimal = new List<decimal>();
List<string> listRange = new List<string>();
List<decimal> resDecimal = new List<decimal>();
List<IEnumerable<decimal>> resDecimal2 = new List<IEnumerable<decimal>>();
List<IEnumerable<string>> resIndex = new List<IEnumerable<string>>();
Dictionary<decimal, int> dicSumma = new Dictionary<decimal, int>();
foreach (TarkistaSummat.CellsRemain el in list)
{
decimal sumDec = Convert.ToDecimal(el.Summa.Value);
listDecimal.Add(sumDec);
string row = el.Summa.Cells.Row.ToString();
string col = el.Summa.Cells.Column.ToString();
string range = el.Summa.Cells.Row.ToString() + ":" + el.Summa.Cells.Column.ToString();
listRange.Add(range);
}
var subsets = new List<IEnumerable<decimal>> { new List<decimal>() };
var subsetsIndex = new List<IEnumerable<string>> { new List<string>() };
for (int i = 0; i < list.Count; i++)
{
if (i > 20)
{
List<IEnumerable<decimal>> parSubsets = subsets.GetRange(i, i + 20);
List<IEnumerable<string>> parSubsetsIndex = subsetsIndex.GetRange(i, i + 20);
var Z = parSubsets.Select(x => x.Concat(new[] { listDecimal[i] }));
//var Zfound = Z.Select(x => x).Where(w => w.Sum() ==0);
subsets.AddRange(Z.ToList());
var Zr = parSubsetsIndex.Select(x => x.Concat(new[] { listRange[i] }));
subsetsIndex.AddRange(Zr.ToList());
}
else
{
var T = subsets.Select(y => y.Concat(new[] { listDecimal[i] }));
//var Tfound = T.Select(x => x).Where(w => w.Sum() == 0);
//resDecimal2.AddRange(Tfound);
//var TnotFound = T.Except(Tfound);
subsets.AddRange(T.ToList());
var Tr = subsetsIndex.Select(y => y.Concat(new[] { listRange[i] }));
subsetsIndex.AddRange(Tr.ToList());
}
for (int i = 0; i < subsets.Count; i++)
{
decimal sumDec = subsets[i].Sum();
if (sumDec == 0m)
{
resDecimal2.Add(subsets[i]);
resIndex.Add(subsetsIndex[i]);
continue;
}
else
{
if(dicSumma.ContainsKey(sumDec * -1))
{
dicSumma.TryGetValue(sumDec * -1, out int index);
IEnumerable<decimal> addComb = subsets[i].Union(subsets[index]);
resDecimal2.Add(addComb);
var indexComb = subsetsIndex[i].Union(subsetsIndex[index]);
resIndex.Add(indexComb);
}
else
{
if(!dicSumma.ContainsKey(sumDec))
{
dicSumma.Add(sumDec, i);
}
}
}
}
for (int i = 0; i < resIndex.Count; i++)
{
//List<Range> ranges = new List<Range>();
foreach(string el in resIndex[i])
{
string[] split = el.Split(':');
Range cell = actSheet.Cells[Convert.ToInt32(split[0]), Convert.ToInt32(split[1])];
cell.Interior.ColorIndex = 6;
}
}
}

Finding values within certain tags using regex

I have a sample string:
<num>1.</num> <Ref>véase anomalía de Ebstein</Ref> <num>2.</num> <Ref>-> vascularización</Ref>
I wish to make a comma seperated string with the values inside ref tags.
I have tried the following:
Regex r = new Regex("<ref>(?<match>.*?)</ref>");
Match m = r.Match(csv[4].ToLower());
if (m.Groups.Count > 0)
{
if (m.Groups["match"].Captures.Count > 0)
{
foreach (Capture c in m.Groups["match"].Captures)
{
child.InnerText += c.Value + ", ";
}
child.InnerText = child.InnerText.Substring(0, child.InnerText.Length - 2).Replace("-> ", "");
}
}
But this only ever seems to find the value inside the first ref tag.
Where am I going wrong?
You want to be using Matches rather than match to get all matches that occur, something like:
Regex r = new Regex("<ref>(?<match>.*?)</ref>");
foreach (Match m in r.Matches(csv[4]))
{
if (m.Groups.Count > 0)
{
if (m.Groups["match"].Captures.Count > 0)
{
foreach (Capture c in m.Groups["match"].Captures)
{
child.InnerText += c.Value + ", ";
}
child.InnerText = child.InnerText.Substring(0, child.InnerText.Length - 2).Replace("-> ", "");
}
}
}
I strongly recommend using XPath over regular expressions to search XML documents.
string xml = #"<test>
<num>1.</num> <Ref>véase anomalía de Ebstein</Ref> <num>2.</num> <Ref>-> vascularización</Ref>
</test>";
XmlDocument d = new XmlDocument();
d.LoadXml(xml);
var list = from XmlNode n in d.SelectNodes("//Ref") select n.InnerText;
Console.WriteLine(String.Join(", ", list.ToArray()));
Regex is often hungry, therefore it would match from the first tag to the last tag. If your XML is well formed, you can change to regex to something like:
Regex r = new Regex("<ref>(?<match>[^<]*?)</ref>");
To search for anything other than a <

Categories