C# loop through two variables - c#

I have 2 variables div1, div2 and want to get all value from them.
I can loop through one variable with foreach, but it's possible to get both divs InnerHtml?
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
var div1 = doc.DocumentNode.SelectNodes("//div[contains(#class,'class1')]");
var div2 = doc.DocumentNode.SelectNodes("//div[contains(#class,'class2')]");
foreach (HtmlNode div in div1)
{
String text = div.InnerHtml;
Debug.WriteLine(text);
}

#mcjmzn, #Jonathan, and #Nenad answers are correct as far as printing all innerHtmls.
I'm guessing you want to print the first div1 innerHtml and then the first div2 innerHtml, and then second div1 innerHtml, and second div2 innerHtml, and so on. You'll want a regular loop instead of a foreach, and add checks to make sure you don't exceed div1 or div2 array lengths:
var div1Max = div1.Count;
var div2Max = div2.Count;
var overallMax = Math.Max(div1Max, div2Max);
for(var i = 0; i < overallMax; i++)
{
if (i < div1Max)
{
String text1 = div1[i].InnerHtml;
Debug.WriteLine(text1);
}
if (i < div2Max)
{
String text2 = div2[i].InnerHtml;
Debug.WriteLine(text2);
}
}

You can use Concat extension method of IEnumerable to combine both collections of nodes.
HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load(url);
var div1 = doc.DocumentNode.SelectNodes("//div[contains(#class,'class1')]");
var div2 = doc.DocumentNode.SelectNodes("//div[contains(#class,'class2')]");
var allNodes = div1.Concat(div2);
foreach (HtmlNode div in allNodes)
{
String text = div.InnerHtml;
Debug.WriteLine(text);
}

Why don't you simply iterate one after another, instead of concatenating, etc?
foreach (HtmlNode div in div1)
{
String text = div.InnerHtml;
Debug.WriteLine(text);
}
foreach (HtmlNode div in div2)
{
String text = div.InnerHtml;
Debug.WriteLine(text);
}

Create a container list at the beginning, add the results to it, and then loop through the container list:
var nodes = new HtmlNodeCollection();
nodes.Add(doc.DocumentNode.SelectNodes("//div[contains(#class,'class1')]"));
nodes.Add(doc.DocumentNode.SelectNodes("//div[contains(#class,'class2')]"));
foreach(HtmlNode node in nodes){
Debug.WriteLine(node.InnerHtml);
}
It is also possible to build up a different query that will get all the class1s and class2s at the same time:
doc.DocumentNode.SelectNodes("//div[contains(#class,'class1') or contains(#class,'class2')]");
Edit after comment # 22:24:56Z:
If there is only one result for each selector, you could simplify your approach something like this:
var text1 = doc.DocumentNode.SelectSingleNode("//div[contains(#class,'class1')]")?.InnerHtml ?? String.Empty;
var text2 = doc.DocumentNode.SelectSingleNode("//div[contains(#class,'class2')]")?.InnerHtml ?? String.Empty;
Those question marks are null-coalescing operators. See:
https://learn.microsoft.com/en-us/dotnet/csharp/language-reference/operators/null-coalescing-operator

Related

how to make a for loop for xpath in C#?

I'm trying to make a for loop to get all the data in the div, it did worked with one but didn't for the other
my code
if(websearch != mainSearchUrl) {
var webGet = new HtmlWeb();
var doc = webGet.Load(websearch);
var webnode = doc.DocumentNode.SelectNodes("/html/body/div/div[1]/div/div[2]");
foreach (HtmlNode node in webnode)
{
for (int i = 1; i < 15; i++)
{
var title = node.SelectSingleNode("/html/body/div/div[1]/div/div[2]/div["+i+"]/div/a");
var chapters = node.SelectSingleNode("/html/body/div/div[1]/div/div[2]/div[1]/div/div[4]"); //here is the error when i put "i" instead of the pre last number it results null
comboBox1.Items.Add(title.InnerText + chapters.InnerText);// error chapters null
}
}

How I can get data from multiple pages with HTML Agility Pack

Hello I've got a problem with Agility pack in C#. Maybe I don't see somethink what I'm doing wrong.
I want to get movies from multiple pages but when I run my app then getting everythink from 1st page and repeating that n - times (n it's a number what I give). For exaple 10 titles from page is written x4 times in loop
bool looping = true;
string mainUrl = "https://www.filmweb.pl/films/search";
HtmlWeb web = new HtmlWeb();
HtmlDocument docu = web.Load(mainUrl);
int inc = 0;
var tags = docu.DocumentNode.SelectNodes("//h2[#class='filmPreview__title']");
while(looping)
{
var nextPage = docu.DocumentNode.SelectNodes("//a[#title='następna']/#href").ToList();
if(inc < 4)
{
string link = mainUrl + nextPage[0].Attributes["href"].Value;
var urlDecode = HttpUtility.HtmlDecode(link);
docu = web.Load(urlDecode);
foreach (var item in tags)
{
Movie mv = new Movie();
mv.Tytul = item.InnerText;
tytuly.Add(mv);
}
inc++;
}
else
{
looping = false;
}
}
And below my view code
#for (int i = 0; i < Model.Count; i++)
{
<p>#Model[i].Tytul</p>
}
I tried with different loops and everytime was same situation. Can you help me? I think I don't see my mistakes
Thank you in advance!
There are some logic problem in your codes. You get the films' titles by looping the tags, while the tags are always from the first page, you have not overridden it when you get the new page. I made some changes to your codes, and got the right results
bool looping = true;
string mainUrl = "https://www.filmweb.pl/films/search";
HtmlWeb web = new HtmlWeb();
HtmlDocument docu = web.Load(mainUrl);
int inc = 0;
var tags = docu.DocumentNode.SelectNodes("//h2[#class='filmPreview__title']");
while (looping)
{
if (inc < 4)
{
foreach (var item in tags)
{
Movie mv = new Movie();
mv.Tytul = item.InnerText;
tytuly.Add(mv);
}
var nextPage = docu.DocumentNode.SelectNodes("//a[#title='następna']/#href").ToList();
string link = mainUrl + nextPage[0].Attributes["href"].Value;
var urlDecode = HttpUtility.HtmlDecode(link);
docu = web.Load(urlDecode);
tags = docu.DocumentNode.SelectNodes("//h2[#class='filmPreview__title']");
inc++;
}
else
{
looping = false;
}
}

Html Agility Pack xpath IEnumerable

I can not add html code, because it is very very big! 5 scrolls or more. Please, follow link in htmlWeb.load().
I look at this code already 2 hours and I can not figure out what is wrong.
HtmlWeb htmlWeb = new HtmlWeb {OverrideEncoding = Encoding.Default};
HtmlAgilityPack.HtmlDocument document = htmlWeb.Load("https://www.parimatch.com/en/sport/futbol/germanija-bundesliga");
var matches = document.DocumentNode.SelectNodes("//tr[#class='bk']").
Select(tr => new FootballMatch()
{
Number = string.Join(" ", tr.SelectNodes("./td[1]//text()[normalize-space()]").Select(t =>t.InnerText)),
Time = string.Join(" ", tr.SelectNodes("./td[2]//text()[normalize-space()]").Select(t => t.InnerText)),
Teams = string.Join(" ", tr.SelectNodes("./td[3]//text()[normalize-space()]").Select(t => t.InnerText)),
Allowance = string.Join(" ", tr.SelectNodes("./td[4]//text()[normalize-space()]").Select(t => t.InnerText)),
CoefficientAllowance = string.Join(" ", tr.SelectNodes("./td[5]//text()[normalize-space()]").Select(t => t.InnerText)),
Total = tr.SelectSingleNode("./td[7]//text()[normalize-space()]").InnerText,
P1 = tr.SelectSingleNode("./td[10]//text()[normalize-space()]").InnerText,
X = tr.SelectSingleNode("./td[11]//text()[normalize-space()]").InnerText,
/*P2 = tr.SelectSingleNode("./td[12]//text()[normalize-space()]").InnerText,
P1X = tr.SelectSingleNode("./td[13]//text()[normalize-space()]").InnerText,
P1P2 = tr.SelectSingleNode("./td[14]//text()[normalize-space()]").InnerText,
P2X = tr.SelectSingleNode("./td[15]//text()[normalize-space()]").InnerText*/
});
P2,P1X,P1P2,P2X always null.
and it is possible to do this code more neater?
When you click on an event , a popup menu appears , this data is read too , but I do not need this . How can I disable this ?
This is also not the prettiest. But it works. Still some work needs to be done in respect to sperating certain cells. Since some <td>s contain <br> to separate lines. Hope this helps you moving on.
string xpath = "//tr[#class='bk']";
HtmlNodeCollection matches = htmlDoc.DocumentNode.SelectNodes(xpath);
List<List<string>> footballMatches = new List<List<string>>();
foreach (HtmlNode x in matches)
{
List<string> mess = new List<string>();
HtmlNodeCollection hTC = x.SelectNodes("./td");
if (hTC.Count > 15)
{
for (int i = 0; i < 15; i++)
{
if (i != 5)
{
mess.Add(hTC[i].InnerText);
}
}
}
footballMatches.Add(mess);
}

How to extract src attibute of a HTML tag?

I have a string in HTML format
<div class="ExternalClass6FC23FEAF7454B3A8006CF7E1D2257B8">
<audio src="/sites/audioblogs/Group2Doc/0.021950338035821915.wav" controls="controls"></audio><br/><img src="/sites/audioblogs/Group2Doc/20140103_152938.jpg" alt=""/></div>
I need only the source(src) attribute,
I'm trying to use Regex.Match,
Is there any other alternative?
Thanks,
Sachin
I'd use HtmlAgilityPack to parse HTML, not regex:
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html); // html is your string
var audio = doc.DocumentNode.Descendants("audio")
.FirstOrDefault(n => n.Attributes["src"] != null);
string src = null;
if (audio != null)
src = audio.Attributes["src"].Value;
Result: /sites/audioblogs/Group2Doc/0.021950338035821915.wav
string yourFullHtmlstring = ".....";
//will make sure all of your double quotes are single quotes
yourFullHtmlstring= yourFullHtmlstring.Replace("\"", "'");
//will turn it into array
string[] arr = yourFullHtmlstring.Split( new string[] {"src='"}, StringSplitOptions.None);
//this will trim the sources found only to the source value.
//start from 1 because we skip the first entry before the first src
for (int i = 1; i < arr.Length; i++)
{
arr[i] = arr[i].Substring(0, arr[i].IndexOf("'"));
}

Read first 3 paragraphs of a long string. [C#, HTML AgilityPack]

I would like to read from a long string and just output the first 3 paragraphs of the string. How do I achieve this? I wanted to use this code to show (n) number of words but I have since changed to paragraphs.
public string MySummary(string html, int max)
{
string summaryHtml = string.Empty;
// load our html document
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
int wordCount = 0;
foreach (var element in htmlDoc.DocumentNode.ChildNodes)
{
// inner text will strip out all html, and give us plain text
string elementText = element.InnerText;
// we split by space to get all the words in this element
string[] elementWords = elementText.Split(new char[] { ' ' });
// and if we haven't used too many words ...
if (wordCount <= max)
{
// add the *outer* HTML (which will have proper
// html formatting for this fragment) to the summary
summaryHtml += element.OuterHtml;
wordCount += elementWords.Count() + 1;
}
else
{
break;
}
}
return summaryHtml ;
}
If by paragraphs you mean <p> tags, get all the childnodes of the document which are <p>s and pull the first 3's inner text?
Edit re comment:
RTFM?
http://htmlagilitypack.codeplex.com/wikipage?title=Examples&referringTitle=Home
something like:
string.Join(doc.DocumentElement.SelectNodes("//p").Take(3).Select(n => n.Text).ToArray(), " ");
Why don't you just use string tokenizer and read up to just before where forth is located?
I've just had to do this myself and have come up with a very simplistic but forgiving way of doing this that works fine for our particular scenario:
public string GetParagraphs(string html, int numberOfParagraphs)
{
const string paragraphSeparator = "</p>";
var paragraphs = html.Split(new[] { paragraphSeparator }, StringSplitOptions.RemoveEmptyEntries);
return string.Join("", paragraphs.Take(numberOfParagraphs).Select(paragraph => paragraph + paragraphSeparator));
}
I realise how naive this is regarding the structure of the document, it will also get any non <p> tags between <p>, however in my use case that is actually what I want - maybe that will work for you too?
It is better answer. but if we want to take paragraph from 2 to 5, then what will be coding.
public string GetParagraphs(string html, int numberOfParagraphs) {
const string paragraphSeparator = "</p>";
var paragraphs = html.Split(new[] { paragraphSeparator }, StringSplitOptions.RemoveEmptyEntries);
return string.Join("", paragraphs.Take(numberOfParagraphs).Select(paragraph => paragraph + paragraphSeparator));
}
You have to use HtmlAgilityPack.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(HtmlContent);
string Html = string.Join(" ", doc.DocumentNode.SelectNodes("//p").Take(2).Select(n => n.OuterHtml).ToArray());
string Html = string.Join(" ", doc.DocumentNode.SelectNodes("//p").Take(2).Select(n => n.OuterHtml).ToArray());

Categories