I am currently working with an XML document which has RSS feeds inside. And I wanted to parse it so that if a div tag with a class name "feedflare" is found, the code would remove the whole DIV.
I could not find an example of doing this as the search for it is polluted with "HTML editor errors" and other irrelevant data.
Would anyone here be kind enough to share methods in reaching my goal?
I must state that I DO NOT want to use HtmlAgilityPack if I can avoid it.
This is my process:
Load XML, parse through elements and pick out, Title, Description, Link.
Then save all this as HTML (with tags being added programatically to build a web page) and then when all of the tags are added, I want to parse the resulting "HTML text" and remove the annoying DIV tag.
Let's assume "string HTML = textBox1.text" where textBox1 is where the resulting HTML is pasted, after parsing the main XML document.
How would I then loop through the contents of textBox1.text and remove ONLY the div tag called "feedflare" (see below).
<div class="feedflare">
<a href="http://feeds.gawker.com/~ff/kotaku/full?a=lB-zYAGjzDU:1zqeSgzxt90:yIl2AUoC8zA">
<img src="http://feeds.feedburner.com/~ff/kotaku/full?d=yIl2AUoC8zA" border="0"></img></a>
<a href="http://feeds.gawker.com/~ff/kotaku/full?a=lB-zYAGjzDU:1zqeSgzxt90:H0mrP-F8Qgo">
<img src="http://feeds.feedburner.com/~ff/kotaku/full?d=H0mrP-F8Qgo" border="0"></img></a>
<a href="http://feeds.gawker.com/~ff/kotaku/full?a=lB-zYAGjzDU:1zqeSgzxt90:D7DqB2pKExk">
<img src="http://feeds.feedburner.com/~ff/kotaku/full?i=lB-zYAGjzDU:1zqeSgzxt90:D7DqB2pKExk" border="0"></img></a>
<a href="http://feeds.gawker.com/~ff/kotaku/full?a=lB-zYAGjzDU:1zqeSgzxt90:V_sGLiPBpWU">
<img src="http://feeds.feedburner.com/~ff/kotaku/full?i=lB-zYAGjzDU:1zqeSgzxt90:V_sGLiPBpWU" border="0"></img></a>
</div>
Thank you in advance.
Using this xml library, do:
XElement root = XElement.Load(file); // or .Parse(string);
XElement div = root.XPathElement("//div[#class={0}]", "feedflare");
div.Remove();
root.Save(file); // or string = root.ToString();
try this
System.Xml.XmlDocument d = new System.Xml.XmlDocument();
d.LoadXml(Your_XML_as_String);
foreach(System.Xml.XmlNode n in d.GetElementsByTagName("div"))
d.RemoveChild(n);
and use d.OuterXml to retrieve the new xml.
My solution in Javascript is:
function unrichText(texto) {
var n = texto.indexOf("\">"); //Finding end of "<div class="ExternalClass...">
var sub = texto.substring(0, n+2); //Adding first char and last two (">)
var tmp = texto.replace(sub, ""); //Removing it
tmp = replaceAll(tmp, "</div>", ""); //Removing last "div"
tmp = replaceAll(tmp, "<p>", ""); //Removing other stuff
tmp = replaceAll(tmp, "</p>", "");
tmp = replaceAll(tmp, " ", "");
return (tmp);
}
function replaceAll(str, find, replace) {
return str.replace(new RegExp(find, 'g'), replace);
}
Related
I want to match first empty P tag for each DIV and insert some text. I am using (<p[^>]*>)(</p>) this regular expression which is matching to all P tags inside DIV.
var yourDivString = "<DIV WITH Paragraph Tag(s) and many other tags>";
yourDivString = Regex.Replace(yourDivString , "(<p[^>]*>)(</p>)", "THIS IS FIRST EMPTY P TAG in EACH DIV")
Example:
<div>
<p></p>
<p></p>
</div>
Excepted Output:
<div>
<p>THIS IS FIRST EMPTY P TAG in EACH DIV</p>
<p></p>
</div>
Note: we are not using any HTML files to parse. Its only a few strings.
we can acheive using HTMLAgilityPack.
Code Explanation: Creating a instance of HTMlDocument and load the html string . Selecting the first node from given string and inserting text for paragraph tag with innerHTML. If no need to create or save document, we can directly use OuterHtml to see output.
using HtmlAgilityPack;
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(yourString);
var p = doc.DocumentNode.SelectSingleNode("//p");
p.InnerHtml = "THIS IS FIRST EMPTY P TAG in EACH DIV";
yourString = doc.DocumentNode.OuterHtml;
Console.WriteLine(yourString);
I'm trying to build a web scraping tool for a news website. I'm having problems selecting the relevant text since the text is divided into multiple different elements. I'm using HTML Agility Pack and I have tried to select text ( //text() ) from the main div, but when I do this I get a lot of garbage text I don't want, like javascript code.
How can I select text from some nested elements and ignore other elements?
<div class="texto_container paywall">
Some text I want
<a href="https://www.sabado.pt/sabermais/ana-gomes" target="_blank" rel="noopener">
Text I want
</a>
sample of text I want
<em>
another text i want
</em>
<aside class="multimediaEmbed contentRight">
A lot of nested elements here with some text I dont want
</aside>
<div class="inContent">
A lot of nested elements here with some text I don't want
</div>
Back to the text I want!
<twitter-widget class="twitter-tweet twitter-tweet-rendered" id="twitter-widget-0" >
Don't want any of this text located in nested elements!
</twitter-widget>
<p>
Final revelant text i want to collect!
</p>
</div>
EDIT
I tried to use XPath to exclude the tags I don't want, but I still get text nodes from those tags in the result.
var parse_me = htmlDoc.DocumentNode.SelectNodes("//div[#class='texto_container paywall']//text()[not(parent::aside)][not(parent::div[#class='inContent'])][not(parent::twitter-widget)]");
I think this code doesn't work because on the tags I don't want to include the text parent nodes aren't the "main" tag, because it is inside of a lot of nested tags.
EDIT
After some thinking and some research I fixed the previous problem by using ancestor:: instead of parent:: and I got rid of some of the intended text.
But I still can't get rid of the twitter-widget text, because it always returns a null node even with the XPath copied from the Google Chrome inspect element tool.
var Twitter_Node = htmlDoc.DocumentNode.SelectSingleNode("//*[#id='twitter - widget - 0']");
This gets returned as null. How is this possible? XPath was copied from Chrome.
You can try to exclude the text from specific tags :
//body//text()[not(parent::aside)][not(parent::div[#class="inContent"])][not(parent::twitter-widget)]
You could use concat but it's more complicated since you have to know the number and the position of each tag in the "chain" :
concat(//body//div[#class="texto_container paywall"]/text()[1],//body//a[#href]/text(),//body//div[#class="texto_container paywall"]/text()[2],//body//em/text(),//body//div[#class="texto_container paywall"]/text()[5],//body//p/text())
I am using ScrapySharp nuget which adds in my sample below, (It's possible HtmlAgilityPack offers the same functionality built it, I am just used to ScrapySharp from years ago)
You can simply punctually extract all the texts you do NOT want, and then replace their occurrences in the main div text with an empty string, removing them from the final result.
var doc = new HtmlDocument();
doc.Load(#"C:\Desktop\z.html"); //I created an html with your sample HTML set as the html body
List<string> textsIWant = new List<string>();
var textsIdoNotWant = new List<string>();
//text I do not want
var aside = doc.DocumentNode.CssSelect(".multimediaEmbed.contentRight").FirstOrDefault();
if (aside != null)
{
textsIdoNotWant.Add(aside.InnerText);
}
var inContent = doc.DocumentNode.CssSelect(".inContent").FirstOrDefault();
if (inContent != null)
{
textsIdoNotWant.Add(inContent.InnerText);
}
var twitterWidget = doc.DocumentNode.CssSelect("#twitter-widget-0").FirstOrDefault();
if (twitterWidget != null)
{
textsIdoNotWant.Add(twitterWidget.InnerText);
}
var div = doc.DocumentNode.CssSelect(".texto_container.paywall").FirstOrDefault();
if (div != null)
{
var text = div.InnerText;
foreach (var textIDoNotWant in textsIdoNotWant)
{
text = text.Replace(textIDoNotWant, string.Empty);
}
textsIWant.Add(text);
}
foreach (var text in textsIWant)
Console.WriteLine(text);
Is it possible to remove the whole div with a specific class name? For example;
<body>
<div class="head">...</div>
<div class="container">...</div>
<div class="foot">...</div>
</body>
I would like to remove the div with the "container" class.
A C# code example would be verry useful, thank you.
The proper way (I suppose) to do this is via built in Gecko DOM classes and methods.
So, in your case something like:
var containers = yourDocument.GetElementsByClassName("container");
//this returns an IEnumerable of elements with this class. If you only ever gonna have one, you can do it like that:
var yourContainer = containers.FirstOrDefault();
yourContainer.Parent.RemoveChild(yourContainer);
Obviously, you can also do loops etc.
If you want to parse html in c# the best way is to use Html agility pack :
https://htmlagilitypack.codeplex.com/
HtmlDocument document = new HtmlDocument();
document.Load(#"C:\yourfile.html")
HtmlNode nodesToRemove= document .DocumentNode.SelectNodes("//div[#class='container']").ToList();
foreach (var node in nodesToRemove)
node.Remove();
Well, with the help of regex, you can remove your desired div
var data = "<body>\n<div class=\"head\">...</div>\n" +
"<div class=\"container\">...</div>\n" +
"<div class=\"foot\">...</div>\n</body>";
var rxStr = "<div[^<]+class=([\"'])container\\1.*</div>";
var rx = new System.Text.RegularExpressions.Regex (rxStr,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
var nStr = rx.Replace (data, "");
Console.WriteLine (nStr);
This will reduce your string to
<body>
<div class="head">...</div>
<div class="foot">...</div>
</body>
I've got this code repeated in a div tag and want to write an XPath expression to find the dsd link so that I can click on it, based on the text in the h4 tag. Changing the HTML isn't an option.
<div>
<h4>Test Block</h4>
<br/>
<div>
Option 1
Option 2
</div>
</div>
At the moment, I'm trying something like, where name is the name of the h4 tag;
var findSubmitButton = Driver.FindElement(By.XPath("//div/h4[contains(text(), '" + name + "')]"));
var submitButton = findSubmitButton.FindElement(By.XPath("../div/a[contains(#href,'dsd')]"));
submitButton.Click();
But I'm unable to get this to work. Any suggestions would be gratefully received.
I do not see an issue with your xpaths. The HTML you supplied is invalid due to your placeholders, but your xpaths appear to work with this:
void Main()
{
var xml = #"
<div>
<h4>Test Block</h4>
<br/>
<div>
Option 1
Option 2
</div>
</div>";
var xmldoc = new XmlDocument();
xmldoc.LoadXml(xml);
var node = xmldoc.DocumentElement.SelectSingleNode("//div/h4[contains(text(),'Test Block')]");
node = node.SelectSingleNode("../div/a[contains(#href,'dsd')]");
Console.WriteLine(node.InnerText);
}
I don't have a working machine so I can't test this, but you said any feedback would be well received, so, I'm pretty sure using XPath you can grab individual elements from a child. If you know for sure that this HTML will always be the same, you could do:
../div[0] //(First element of the child)
You could use //div[h4[contains(., 'Test Block')]]//a[contains(#href, 'dsd')]. Also something like //div[h4[contains(., 'Test Block')]]//a[contains(., 'Option 1')] should work.
why don't you use the following-sibling
var findSubmitButton = Driver.FindElement(By.XPath("//div/h4[contains(text(), '" + name + "')]"));
var submitButton = findSubmitButton.FindElement(By.XPath("following-sibling::div/a[contains(#href,'dsd')]"));
Is there a way to get a page to parse through its self?
So far I have:
string whatever = TwitterSpot.InnerHtml;
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(whatever);
foreach("this is where I am stuck")
{
}
I want to parse the page so what I did is create a parent div named TwitterSpot. Put the InnerHtml into a string, and have loaded it as a new HtmlDocument.
Next I want to get within that a string value of "#XXXX+n " and replace it in the page infront with some cool formatting.
I am getting stuck on my foreach loop do not know how I should search for a # or how to look through the loaded HtmlDocument.
The next step is to apply change to where ever I have seen a # tag. I could do this is JavaScript probably a lot easier I know but I am adament on seeing how I can get asp.net c# to do it.
The # is a string value within the html I am not referring to it as a Control ID.
Assuming you're using HtmlAgilityPack, you could use xpath to find text nodes which contain your value:
var matchedNodes = document.DocumentNode
.SelectNodes("//text()[contains(.,'#XXXX+n ')]");
Then you could just interate through these nodes and make all the necessary replacemens:
foreach (HtmlTextNode node in matchedNodes)
{
node.Text = node.Text.Replace("#XXXX+n ", "brand new text");
}
You can use http://htmlagilitypack.codeplex.com/ to parse HTML and manipulate its content; works very well.
I guess you could use RegEx to find all matches and loop through them.
You could just change it to be:
string whatever = TwitterSpot.InnerHtml;
whatever = whatever.Replace("#XXXX+n ", String.format("<b>{0}</b>", "#XXXX+n "));
No parsing required...
When I did this before, I stored the HTML in an XML doc and looped through each node. You can then apply XSLT or just parse the nodes.
It sounds like for your purposes though that you don't really need to do that. I'd recommend making the divs into server controls and programmatically looping through their child controls, as such:
foreach (Object o in divSomething.Controls)
{
if (o.GetType == "TextBox" && ((TextBox)o).ID == "txtSomething")
{
((TextBox)o).Attributes.Add("style", "font: Arial; color: Red;");
}
}