Regex to catch specific anchor formula - c#

I have an html file, that file contains html tags I want to select all anchors that have specific formula that formula like the following
<a href="AnyTextHereFollowingByThatChar/" target="_blank">
I write regex like that following
\<a\s*href\=\"(.*?)"\s*target\="_blank"
but this regex select the first anchor that it match until find keyword target on any other anchor and then stop after selecting all characters in between.
Appreciate any help to catch those anchors <a href="AnyTextHereFollowingByThatChar/" target="_blank">

Finally I reached to the regex that I need
\<a\s*href\="(?<value>[a-zA-Z0-9]+[^/])*\/"\s*target\="_blank">
this regex will select only the anchor that I need as in the question above

Related

REGEX to match relative urls inside href attribute in a html file

I am making a c# code that converts relative to absolute URLs in href and src attributes of an inputted HTML code in a Richtextbox when the user clicks a button, using a path that the user input. I need a regex that only matches relative URLs inside href and src attributes and converts them to absolute. this is what I am trying to achieve:
example:
if the path that the user inputted: https://example.com/page
and the html code in Richtextbox is :
click
click
<img src="/img1.png" />
<img src="../img2.png" />
this is the result that I want for the html code:
click //this doesn't change
click
<img src="https://example.com/page/img1.png" />
<img src="https://example.com/img2.png" />
I have only been able to come up with regex that matches href attributes .href=(["])(.?)\1
but I can't come up with a regex that does the work above (relative to absolute).
A couple of tips for you:
Please don't use regex to parse HTML. It could break the universe. See here for more info: RegEx match open tags except XHTML self-contained tags
Instead, you can use HTML Agility Pack, as suggested in this SO answer

Regex removes to much text

In our CMS we are using some tags which should be replaced on exporting for other systems.
The code for replacing is stated below:
var rxStr = "<div[^<]+class=([\"'])related-document-content\\1.*</div>";
var rx = new System.Text.RegularExpressions.Regex(rxStr,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
bodyText = rx.Replace(bodyText, "");
Our problem occurs when there are to instances of the tag in rxStr :
<p>First paragraph</p>
<div class='related-document-content' id='457'>First related text</div>
<p>Second paragraph</p>
<div class='related-document-content' id='458'>Second related text</div>
<p>Third paragraph</p>
When the code runs it removes the second paragraph and the output will be
<p>First paragraph</p>
<p>Third paragraph</p>
Can anyone help me adjust code so that only the div tags get removed
Besides the obvious "Use an HTML parser/write instead":
What your regex matches is the < of the next HTML tag over, that's why it skips one.
Your rxStr looks for "anything but the next open tag" <div[^<]+.
Instead it should look for "anything but the current tag's end" <div[^>]+.
You then also add the > to your regular expression. See below:
// Added [^>]+> towards the end.
// Also adding () within the div so you can debug better which matches were found.
var rxStr = "<div[^>]+class=([\"'])related-document-content\\1[^>]*>(.*)</div>";
If the innerHTML of your div is actually text-only use [^<]* instead of .*:
var rxStr = "<div[^>]+class=([\"'])related-document-content\\1[^>]*>([^<]*)</div>";

Remove <span> tag Using Regex Only C#

I want to remove only that span tag where class is "neighborhood". And I do not want to remove rest of the span tags. I want to do only with regex, i dont have other choice.
<span class="latitude">34.008253</span>
<span class="longitude">-118.414593</span>
<span class="neighborhood">Neighborhood: Clarkdale</span>
Please help me out. Thank You
Try this one:
#"<span class=""neighborhood"">[^<]*</span>"
That will work assuming that no other tag is closed before span. Also, HTML allows a lot of whitespace, so you might need to adjust to that.

Make link unclickable with regex unless from the certain domain

I want to write a regex which will make link unclickable - simply removing href html tags and leaving the link url as text. So it will ignore anchor text if it is given only leave the url.
But this regex should make links unclikable if they are not from the certain domain. I want this for my private messaging system.
So regex will do following
if the link is not targeting the certain domain make the link url text. Ignore given anchor text.
asp.net 4.0 , C# 4.0
Example
My domain
This will be parsed as http://www.monstermmorpg.com/ it will be text
Something like this works on jQuery:
$(document).ready(removeLink("domain.com"));
function removeLink(domain)
{
$('a').each(function(){
if($(this).attr("href")!=null)
if($(this).attr("href").indexOf(domain)>=0)
$(this).removeAttr("href");
})
}
Explanation:
Gets all anchor elements in the HTML rendered iterating through them and removing the href attribute of the hyperlink for all the the ones that have domain.com as part of it.
jsfiddle demo.

Regex to replace href values in anchor tags of HTML

I have the HTML in the form of a string and before I display it in the browser, I want to change all the relative urls on the page to absolute urls. How can I do it the best way? I was thinking of Regex as an option to get the href attributes of anchor tags and append the base url to it, but not sure how to do it? Can someone help or suggest a better solution?
PS: I want to exclude all the links that have only "#" symbol in the link. For example: I want to replace <a href="/dir/file1.htm" /> with <a href="http://mysite/dir/file1.htm" /> but I want to exclude <a href="#A1" />
I would appreciate any help on this.
In general, using RegEx to parse HTML is a bad idea - see here for why.
You can use an HTML parser like the HTML Agility Pack in order to extract URLs from HTML:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
You can then exclude any URLs that start with #.

Categories