Process HTML Markup in C#

Process HTML Markup in C# - c#

I want to process/manipulate some HTML markup
e.g.
<a id="flFileList_gvDoItFiles_btnContent_1" href="javascript:__doPostBack('flFileList$gvDoItFiles$ctl03$btnContent','')">Untitled.png.3154ROGG635264188946573079.png</a>
changed to
<a id="flFileList_gvDoItFiles_btnContent_1" href="javascript:__doPostBack('flFileList$gvDoItFiles$ctl03$btnContent','')">Untitled.png</a>
I want achieve this using C# string processing.
Not getting any idea for this.
I have logic written convert
Untitled.png.3154ROGG635264188946573079.png to
Untitled.png
I am stuck in how do I identify and replace th string in markup?
String.Split()??

I suggest you to use HtmlAgilityPack for parsing HTML. You can easily get a element by it's id, and then replace it's inner text:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html_string);
string xpath = "//a[#id='flFileList_gvDoItFiles_btnContent_1']";
var a = doc.DocumentNode.SelectSingleNode(xpath);
a.InnerHtml = ConvertValue(a.InnerHtml); // call your logic for converting value
string result = a.OuterHtml;

Related

HtmlAgiltyPack parse HTML and take value out of span tag and class name

I have an HTML that I download via my webrequest client. And out of entire html I want to parse only this part of HTML:
<span class="sku">
<span class="fb">SKU :</span>118880101
</span>
I'm using HTML agilty pack to retrieve this value: 118880101
And I've written something like this:
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
return htmlDoc.DocumentNode.SelectNodes("//span[#class='sku']").ElementAt(0).InnerText;
And this returns me this value from HTML:
SKU :118880101
Literally like this, spaces included... How can I fix this logic with HTML Agilty pack so that I can only take out this 118880101 value?
Can someone help me out?
Edit: a regex like this would do the thing:
Substring(skuRaw.LastIndexOf(':') + 1);
which would mean to take everything after ":' sign in string that I receive... But I'm not sure if it's safe to use regex like this ?

Try This
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);
var innerText=htmlDoc.DocumentNode.SelectNodes("//span[#class='sku']")
.ElementAt(0).InnerText;
return innerText.replace(/\D/g,'');
if you want to use only Html Agility pack try this
var child = htmlDoc.DocumentNode.SelectNodes("//span[#class='fb']")
.FirstOrDefault();
if (child != null)
{
var parent = child.ParentNode;
parent.RemoveChild(child);
var innerText = parent.InnerText;
}

retrive the last match case or list with regular expression and than work with it

my issue is that I'll download html page content to string with
System.Net.WebClient wc = new System.Net.WebClient();
string webData = wc.DownloadString("http://prices.shufersal.co.il/");
and trying to retrive the last number of page from the navigation menu
<a data-swhglnk=\"true\" href=\"/?page=2\">2</a>
so at the end I'll want want to find the last data-swhglnk and retrive from it the last page.
I try
Regex.Match(webData, #"swhglnk", RegexOptions.RightToLeft);
I would be happy to understand the right approch to issues like this

If you're about to parse HTML and find some information in it, you should use method more reliable than regex, i.e:
-HtmlAgilityPack https://htmlagilitypack.codeplex.com/
-csQuery https://github.com/jamietre/CsQuery
and operate on objects, not strings.
Update
If you decide to use HtmlAgilityPack, you will have to write code like this:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(webData);
foreach (HtmlNode node in doc.DocumentNode.SelectNodes("//a[#data-swhglnk]"))
{
HtmlAttribute data = node.Attributes["data-swhglnk"];
//do your processing here
}

Get text that lies after pattern without class or id

I am using the HtmlAgiityPack.
It is an excellent tool for parsing data, however every instance I have used it, I have always had either a class or id to aim at, i.e. -
string example = doc.DocumentNode.SelectSingleNode("//div[#class='target']").InnerText.Trim();
However I have come across a piece of text that isn't nested in any particular pattern with a class or id I can aim at. E.g. -
<p>Example Header</p>: This is the text I want!<br>
However the example given does always following the same patter i.e. the text will always be after </p>: and before <br>.
I can extract the text using a regular expression however would prefer to use the agility pack as the rest of the code follows suit. Is there a means of doing this using the pack?

This XPath works for me :
var html = #"<div class=""target"">
<p>Example Header</p>: This is the text I want!<br>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/div[#class='target']/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);
/text() select all text nodes that is direct child of the <div>
[(normalize-space())] exclude all text nodes those contain only white
spaces (there are 2 new lines excluded from this html sample : one before <p> and the other after <br>)
Result :
UPDATE I :
All element must have a parent, like <div> in above example. Or if it is the root node you're talking about, the same approach should still work. The key is to use /text() XPath to get text node :
var html = #"<p>Example Header</p>: This is the text I want!<br>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
var result = doc.DocumentNode.SelectSingleNode("/text()[(normalize-space())]").OuterHtml;
Console.WriteLine(result);
UPDATE II :
Ok, so you want to select text node after <p> element and before <br> element. You can use this XPath then :
var result =
doc.DocumentNode
.SelectSingleNode("/text()[following-sibling::br and preceding-sibling::p]")
.OuterHtml;

Removing DIV from a text file if it contains a certain classname

I am currently working with an XML document which has RSS feeds inside. And I wanted to parse it so that if a div tag with a class name "feedflare" is found, the code would remove the whole DIV.
I could not find an example of doing this as the search for it is polluted with "HTML editor errors" and other irrelevant data.
Would anyone here be kind enough to share methods in reaching my goal?
I must state that I DO NOT want to use HtmlAgilityPack if I can avoid it.
This is my process:
Load XML, parse through elements and pick out, Title, Description, Link.
Then save all this as HTML (with tags being added programatically to build a web page) and then when all of the tags are added, I want to parse the resulting "HTML text" and remove the annoying DIV tag.
Let's assume "string HTML = textBox1.text" where textBox1 is where the resulting HTML is pasted, after parsing the main XML document.
How would I then loop through the contents of textBox1.text and remove ONLY the div tag called "feedflare" (see below).
<div class="feedflare">
<a href="http://feeds.gawker.com/~ff/kotaku/full?a=lB-zYAGjzDU:1zqeSgzxt90:yIl2AUoC8zA">
<img src="http://feeds.feedburner.com/~ff/kotaku/full?d=yIl2AUoC8zA" border="0"></img></a>
<a href="http://feeds.gawker.com/~ff/kotaku/full?a=lB-zYAGjzDU:1zqeSgzxt90:H0mrP-F8Qgo">
<img src="http://feeds.feedburner.com/~ff/kotaku/full?d=H0mrP-F8Qgo" border="0"></img></a>
<a href="http://feeds.gawker.com/~ff/kotaku/full?a=lB-zYAGjzDU:1zqeSgzxt90:D7DqB2pKExk">
<img src="http://feeds.feedburner.com/~ff/kotaku/full?i=lB-zYAGjzDU:1zqeSgzxt90:D7DqB2pKExk" border="0"></img></a>
<a href="http://feeds.gawker.com/~ff/kotaku/full?a=lB-zYAGjzDU:1zqeSgzxt90:V_sGLiPBpWU">
<img src="http://feeds.feedburner.com/~ff/kotaku/full?i=lB-zYAGjzDU:1zqeSgzxt90:V_sGLiPBpWU" border="0"></img></a>
</div>
Thank you in advance.

Using this xml library, do:
XElement root = XElement.Load(file); // or .Parse(string);
XElement div = root.XPathElement("//div[#class={0}]", "feedflare");
div.Remove();
root.Save(file); // or string = root.ToString();

try this
System.Xml.XmlDocument d = new System.Xml.XmlDocument();
d.LoadXml(Your_XML_as_String);
foreach(System.Xml.XmlNode n in d.GetElementsByTagName("div"))
d.RemoveChild(n);
and use d.OuterXml to retrieve the new xml.

My solution in Javascript is:
function unrichText(texto) {
var n = texto.indexOf("\">"); //Finding end of "<div class="ExternalClass...">
var sub = texto.substring(0, n+2); //Adding first char and last two (">)
var tmp = texto.replace(sub, ""); //Removing it
tmp = replaceAll(tmp, "</div>", ""); //Removing last "div"
tmp = replaceAll(tmp, "<p>", ""); //Removing other stuff
tmp = replaceAll(tmp, "</p>", "");
tmp = replaceAll(tmp, " ", "");
return (tmp);
}
function replaceAll(str, find, replace) {
return str.replace(new RegExp(find, 'g'), replace);
}

Parse through current page

Is there a way to get a page to parse through its self?
So far I have:
string whatever = TwitterSpot.InnerHtml;
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(whatever);
foreach("this is where I am stuck")
{
}
I want to parse the page so what I did is create a parent div named TwitterSpot. Put the InnerHtml into a string, and have loaded it as a new HtmlDocument.
Next I want to get within that a string value of "#XXXX+n " and replace it in the page infront with some cool formatting.
I am getting stuck on my foreach loop do not know how I should search for a # or how to look through the loaded HtmlDocument.
The next step is to apply change to where ever I have seen a # tag. I could do this is JavaScript probably a lot easier I know but I am adament on seeing how I can get asp.net c# to do it.
The # is a string value within the html I am not referring to it as a Control ID.

Assuming you're using HtmlAgilityPack, you could use xpath to find text nodes which contain your value:
var matchedNodes = document.DocumentNode
.SelectNodes("//text()[contains(.,'#XXXX+n ')]");
Then you could just interate through these nodes and make all the necessary replacemens:
foreach (HtmlTextNode node in matchedNodes)
{
node.Text = node.Text.Replace("#XXXX+n ", "brand new text");
}

You can use http://htmlagilitypack.codeplex.com/ to parse HTML and manipulate its content; works very well.

I guess you could use RegEx to find all matches and loop through them.

You could just change it to be:
string whatever = TwitterSpot.InnerHtml;
whatever = whatever.Replace("#XXXX+n ", String.format("<b>{0}</b>", "#XXXX+n "));
No parsing required...

When I did this before, I stored the HTML in an XML doc and looped through each node. You can then apply XSLT or just parse the nodes.
It sounds like for your purposes though that you don't really need to do that. I'd recommend making the divs into server controls and programmatically looping through their child controls, as such:
foreach (Object o in divSomething.Controls)
{
if (o.GetType == "TextBox" && ((TextBox)o).ID == "txtSomething")
{
((TextBox)o).Attributes.Add("style", "font: Arial; color: Red;");
}
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Process HTML Markup in C# - c#

Related

HtmlAgiltyPack parse HTML and take value out of span tag and class name

retrive the last match case or list with regular expression and than work with it

Get text that lies after pattern without class or id

Removing DIV from a text file if it contains a certain classname

Parse through current page

Categories

Resources