Match and replace string in text using regular expressions - c#

I have a large string and it might have the following:
<div id="Specs" class="plinks">
<div id="Specs" class="plinks2">
<div id="Specs" class="sdfsf">
<div id="Specs" class="ANY-OTHER_NAME">
How can I replace values in the string from anything above to:
<div id="Specs" class="">
this is what I came up with, but it does not work:
string source = "bunch of text";
string regex = "<div id=\"Specs\" class=[\"']([^\"']*)[\"']>";
string regexReplaceTo = "<div id=\"Specs\" class=\"\">";
string output = Regex.Replace(source, regex, regexReplaceTo);

What about...
Regex to match : class=\"[A-Za-z0-9_\-]+\"
Replace with : class=\"\"
This way, we ignore the first part (id="Specs", etc) and
just replace the class name... with nothing.

Looks like another case of http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html. What happens to the following valid tags with a Regex?
<div class="reversed" id="Specs">
<div id="Specs" class="additionalSpaces" >
<div id="Specs" class="additionalAttributes" style="" >
I don't see a how using Linq2Xml wouldn't work with any combination:
XElement root = XElement.Parse(xml); // XDocument.Load(xmlFile).Root
var specsDivs = root.Descendants()
.Where(e => e.Name == "div"
&& e.Attributes.Any(a => a.Name == "id")
&& e.Attributes.First(a => a.Name == "id").Value == "Specs"
&& e.Attributes.Any(a => a.Name == "class"));
foreach(var div in specsDivs)
{
div.Attributes.First(a => a.Name == "class").value = string.Empty;
}
string newXml = root.ToString()

If your input isn't XML compliant, which most HTML isn't, then you can use the HTML Agility Pack to parse the HTML and manipulate the contents. With the HTML Agility PAck, combined with Linq or Xpath, the order of your attributes no longer matters (which it does when you use Regex) and the overall stability of your solution increases a lot.
Using the HTML Agility Pack (project page, nuget), this does the trick:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("your html here");
// or doc.Load(stream);
var nodes = doc.DocumentNode.DescendantNodes("div").Where(div => div.Id == "Specs");
foreach (var node in nodes)
{
var classAttribute = node.Attributes["class"];
if (classAttribute != null)
{
classAttribute.Value = string.Empty;
}
}
var fixedText = doc.DocumentNode.OuterHtml;
//doc.Save(/* stream */);

Related

c# Regex inside some html tag

I'm trying during some hour with regex to take text inside some html tag:
<div class="ewok-rater-header-section">
<ul class="header">
<li><h1>meow</h1></li>
<li><h1>meow2</h1></li>
<li><h1>Time = <span class="work-weight">9.0 minutes</span></h1></li>
</ul>
</div>
i take meow with
var regexpost = new System.Text.RegularExpressions.Regex(#"<h1(.*?)>(.*?)</h1>");
var mpost = regexpost.Match(reqpost);
string lechat = (mpost.Groups[2].Value).ToString();
but not other
I like to add meow in a textbox , meow2 in a second textbox and 9.0 (minutes) in a last one
In these situations a Html parser can help a lot, and can also be a lot more precise and robust
Html Agility pack
Example
var html = #"<div class=""ewok-rater-header-section"">
<li><h1>meow</h1></li>
<li><h1>meow2</h1></li>
<li><h1>Time = <span class=""work-weight"">9.0 minutes</span></h1></li>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
// you can search for the heading
foreach (var node in doc.DocumentNode.SelectNodes("//li//h1"))
{
Console.WriteLine("Found heading : " + node.InnerText);
}
// or you can be more specific
var someSpan = doc.DocumentNode
.SelectNodes("//span[#class='work-weight']")
.FirstOrDefault();
Console.WriteLine("Found span : " + someSpan.InnerText);
Output
Found heading : meow
Found heading : meow2
Found heading : Time = 9.0 minutes
Found span : 9.0 minutes
Demo here
it s for parse http reponse. Then is it not slow to use a html parser to create document ?

Find specific link in html doc c# using HTML Agility Pack

I am trying to parse an HTML document in order to retrieve a specific link within the page. I know this may not be the best way, but I'm trying to find the HTML node I need by its inner text. However, there are two instances in the HTML where this occurs: the footer and the navigation bar. I need the link from the navigation bar. The "footer" in the HTML comes first. Here is my code:
public string findCollegeURL(string catalog, string college)
{
//Find college
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(catalog);
var root = doc.DocumentNode;
var htmlNodes = root.DescendantsAndSelf();
// Search through fetched html nodes for relevant information
int counter = 0;
foreach (HtmlNode node in htmlNodes) {
string linkName = node.InnerText;
if (linkName == colleges[college] && counter == 0)
{
counter++;
continue;
}
else if(linkName == colleges[college] && counter == 1)
{
string targetURL = node.Attributes["href"].Value; //"found it!"; //
return targetURL;
}/* */
}
return "DID NOT WORK";
}
The program is entering into the if else statement, but when attempting to retrieve the link, I get a NullReferenceException. Why is that? How can I retrieve the link I need?
Here is the code in the HTML doc that I'm trying to access:
<tr class>
<td id="acalog-navigation">
<div class="n2_links" id="gateway-nav-current">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
<div class="n2_links">...</div>
College of Science ==$0
</div>
This is the link that I want: /content.php?catoid=10&navoid=1210
I find using XPath easier to use instead of writing a lot of code
var link = doc.DocumentNode.SelectSingleNode("//a[text()='College of Science']")
.Attributes["href"].Value;
If you have 2 links with the same text, to select the 2nd one
var link = doc.DocumentNode.SelectSingleNode("(//a[text()='College of Science'])[2]")
.Attributes["href"].Value;
The Linq version of it
var links = doc.DocumentNode.Descendants("a")
.Where(a => a.InnerText == "College of Science")
.Select(a => a.Attributes["href"].Value)
.ToList();

AngleSharp Parsing

Can't find many examples of using AngleSharp for parsing when you don't have a class name or id to use.
HTML
<span><span class="icon icon_none"></span></span>
<span><span class="icon icon_none"></span></span>
<span><span class="icon icon_none"></span></span>
I want to find the href from any <a> tags that have a title = Bing
In Python BeautifulSoup I would use
item_needed = a_row.find('a', {'title': 'Bing'})
and then grab the href attribute
or jQuery
a[title='Bing']
But, I'm stuck using AngleSharp
eg. following example
https://github.com/AngleSharp/AngleSharp/wiki/Examples#getting-certain-elements
c# AngleSharp
var parser = new AngleSharp.Parser.Html.HtmlParser();
var document = parser.Parse(#"<span><span class=""icon icon_none""></span></span>< span >< a href = ""bing.com"" title = ""Bing"" >< span class=""icon icon_none""></span></a></span><span><span class=""icon icon_none""></span></span>");
//Do something with LINQ
var blueListItemsLinq = document.All.Where(m => m.LocalName == "a" && //stuck);
Looks like there was problem in your HTML markup that cause AngleSharp failed to find the target element i.e the spaces around angle-brackets :
< span >< a href = ""bing.com"" title = ""Bing"" >< span class=""icon icon_none"">
Having the HTML fixed, both LINQ and CSS selector successfully select the target link :
var parser = new AngleSharp.Parser.Html.HtmlParser();
var document = parser.ParseDocument(#"<span><span class=""icon icon_none""></span></span><span><span class=""icon icon_none""></span></span><span><span class=""icon icon_none""></span></span>");
//LINQ example
var blueListItemsLinq = document.All
.Where(m => m.LocalName == "a" &&
m.GetAttribute("title") == "Bing"
);
//LINQ equivalent CSS selector example
var blueListItemsCSS = document.QuerySelectorAll("a[title='Bing']");
//print href attributes value to console
foreach (var item in blueListItemsCSS)
{
Console.WriteLine(item.GetAttribute("href"));
}

Regex to remove and replace characters

I have the following
<option value="Abercrombie">Abercrombie</option>
My file has about 2000 rows in it each row has a different location, I'm trying to understand regex but unfortunately nothing I learn will go in and I'm unsure if this is possible.
What I want to do is run a regex which will strip the above HTML which will leave the following
Abercrombie
I then want to prefix a particular number to the front so the result would be for example
2,Abercrombie
Is this possible?
Don't use a regular expression since HTML is not a regular language. You can use Linq's XML parser. If you want to process the entire file, you can replace the elements inline:
int myNumber = 2;
var html = #"<html><body><option value=""Abercrombie"">Abercrombie</option><div><option value=""Forever21"">Forever21</option></div></body></html>";
var doc = XDocument.Load(new StringReader(html));
var options = doc.Descendants().Where(o => o.Name == "option").ToList();
foreach (var element in options)
{
element.ReplaceWith(string.Format("{0},{1}", myNumber, element.Value));
}
var result = doc.ToString();
This gives:
<html>
<body>2,Abercrombie<div>2,Forever21</div></body>
</html>
If you just want to grab the text for a specific tag, you can use the following:
int myNumber = 2;
var html = #"<option value=""Abercrombie"">Abercrombie</option>";
var doc = XDocument.Load(new StringReader(html));
var element = doc.Descendants().FirstOrDefault(o => o.Name == "option");
var attribute = element.Attribute("value").Value;
var result = string.Format("{0},{1}", myNumber, attribute);
//result == "2,Abercrombie"

remove only some html tags on c#

I have a string:
string hmtl = "<DIV><B> xpto </B></DIV>
and need to remove the tags of <div> and </DIV>. With a result of : <B> xpto </B>
Just <DIV> and </DIV> without the removal of a lot of html tags, but save the <B> xpto </B>.
Use htmlagilitypack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml("<html>yourHtml</html>");
foreach(var item in doc.DocumentNode.SelectNodes("//div"))// "//div" is a xpath which means select div nodes that are anywhere in the html
{
item.InnerHtml;//your div content
}
If you want only B tags..
foreach(var item in doc.DocumentNode.SelectNodes("//B"))
{
item.OuterHtml;//your B tag and its content
}
If you are just removing div tags, this will get div tags as well as any attributes they may have.
var html =
"<DIV><B> xpto <div text='abc'/></B></DIV><b>Other text <div>test</div>"
var pattern = "#"(\</?DIV(.*?)/?\>)"";
// Replace any match with nothing/empty string
Regex.Replace(html, pattern, string.Empty, RegexOptions.IgnoreCase);
Result
<B> xpto </B><b>Other text test
Use Regex:
var result = Regex.Replace(html, #"</?DIV>", "");
UPDATED
as you mentioned, by this code, regex removes all tages else B
var hmtl = "<DIV><B> xpto </B></DIV>";
var remainTag = "B";
var pattern = String.Format("(</?(?!{0})[^<>]*(?<!{0})>)", remainTag );
var result = Regex.Replace(hmtl , pattern, "");
you can use regular
<[(/body|html)\s]*>
in c#:
var result = Regex.Replace(html, #"<[(/body|html)\s]*>", "");
<html>
<body>
< / html>
< / body>
html = Regex.Replace(html,#"<*DIV>", String.Empty);

Categories