REGEX to match relative urls inside href attribute in a html file - c#

I am making a c# code that converts relative to absolute URLs in href and src attributes of an inputted HTML code in a Richtextbox when the user clicks a button, using a path that the user input. I need a regex that only matches relative URLs inside href and src attributes and converts them to absolute. this is what I am trying to achieve:
example:
if the path that the user inputted: https://example.com/page
and the html code in Richtextbox is :
click
click
<img src="/img1.png" />
<img src="../img2.png" />
this is the result that I want for the html code:
click //this doesn't change
click
<img src="https://example.com/page/img1.png" />
<img src="https://example.com/img2.png" />
I have only been able to come up with regex that matches href attributes .href=(["])(.?)\1
but I can't come up with a regex that does the work above (relative to absolute).

A couple of tips for you:
Please don't use regex to parse HTML. It could break the universe. See here for more info: RegEx match open tags except XHTML self-contained tags
Instead, you can use HTML Agility Pack, as suggested in this SO answer

Related

Regex to catch specific anchor formula

I have an html file, that file contains html tags I want to select all anchors that have specific formula that formula like the following
<a href="AnyTextHereFollowingByThatChar/" target="_blank">
I write regex like that following
\<a\s*href\=\"(.*?)"\s*target\="_blank"
but this regex select the first anchor that it match until find keyword target on any other anchor and then stop after selecting all characters in between.
Appreciate any help to catch those anchors <a href="AnyTextHereFollowingByThatChar/" target="_blank">
Finally I reached to the regex that I need
\<a\s*href\="(?<value>[a-zA-Z0-9]+[^/])*\/"\s*target\="_blank">
this regex will select only the anchor that I need as in the question above

How to render invalid HTML tags as text alongside valid HTML tags?

I have an HTML string in my code behind which has some valid html tags such as <br/>, <p></p>, and some invalid tags such as <test>. I want to render both sets of tags in the browser in such a way that invalid tags are rendered as plain text.
For example, the string
<test><br/>hi mark.<br/>how are you.My email is <test#test.com>
would to need to output on the browser as
<test>
hi mark.how are you.My email is <test#test.com>
You would need to whitelist the elements you consider to be valid, and then encode those which do not match the whitelist so that they are instead converted to <test> as opposed to <test>. The ampersand values will render as < text on the page.

Regular expression to find anchor link with special href?

I just need to find a regular expression for the following:
I have some content in div tag, that includes lot of anchor links in it. So my task is to find anchor links with href as format of "components/showdoc.aspx?docid=" And then add onclick event for that anchor link only, leave the rest of the anchor links.
<div id="content" runat="server">
test doc
</div>
This expression gives and add target to it.
RegEx.Replace(inputString, "<(a)([^>]+)>", "<$1 target=""_blank""$2>")
Thanks
If you are looking to make permanent changes to your HTML file, first manage your HTML parsing by loading it into a System.Windows.Forms.WebBrowser control. From there you can perform DOM-like modifications to the HTML without the dangerous repercussions of parsing corruption that can be caused by performing Regex.Replace on the raw file. (Apparently RegEx + HTML is a serious issue for some).
So first in your code you would:
WebBrowser myBrowser = new WebBrowser();
myBrowser.URL = #"C:\MyPath\MyFile.HTML";
HtmlElement myDocBody = myBrowser.Document.Body;
Then you can navigate through your document body, seeking out your div tag and looking for your anchor tags by using the HtmlElement.Id property and HtmlElement.GetAttribute method.
Note: feel free to still use RegEx matching on the URL strings but only after extracting them from a GetAttribute("href") method.
To add the onClick method, simply invoke the HtmlElement.SetAttribute method.
When you have finished all your modifications, save the changes by writing the WebBrowser.DocumentText to file.
Here is a reference:
http://msdn.microsoft.com/en-us/library/system.windows.forms.htmlelement.aspx
Don't use regex to parse html, it's evil.
You could use the HTML Agility Pack, it even has a nice NuGet Package.
Alternatively, you could do this on the client side with a single line of jQuery:
$('a[href*="components/showdoc.aspx?docid="]').on('click', myClickFunction);
This is making use of the Attribute Contains Selector.
If you want to find the docid in your click function, you could write something like this in your click function:
function myClickFunction(e){
var href = $(this).attr('href');
var docId = href.split('=')[1];
alert(docId);
}
Note that this assumes there's only ever one query string value, if you wanted to make this more robust you could do something like in this answer: https://stackoverflow.com/a/1171731/21200

Make link unclickable with regex unless from the certain domain

I want to write a regex which will make link unclickable - simply removing href html tags and leaving the link url as text. So it will ignore anchor text if it is given only leave the url.
But this regex should make links unclikable if they are not from the certain domain. I want this for my private messaging system.
So regex will do following
if the link is not targeting the certain domain make the link url text. Ignore given anchor text.
asp.net 4.0 , C# 4.0
Example
My domain
This will be parsed as http://www.monstermmorpg.com/ it will be text
Something like this works on jQuery:
$(document).ready(removeLink("domain.com"));
function removeLink(domain)
{
$('a').each(function(){
if($(this).attr("href")!=null)
if($(this).attr("href").indexOf(domain)>=0)
$(this).removeAttr("href");
})
}
Explanation:
Gets all anchor elements in the HTML rendered iterating through them and removing the href attribute of the hyperlink for all the the ones that have domain.com as part of it.
jsfiddle demo.

Regex to replace href values in anchor tags of HTML

I have the HTML in the form of a string and before I display it in the browser, I want to change all the relative urls on the page to absolute urls. How can I do it the best way? I was thinking of Regex as an option to get the href attributes of anchor tags and append the base url to it, but not sure how to do it? Can someone help or suggest a better solution?
PS: I want to exclude all the links that have only "#" symbol in the link. For example: I want to replace <a href="/dir/file1.htm" /> with <a href="http://mysite/dir/file1.htm" /> but I want to exclude <a href="#A1" />
I would appreciate any help on this.
In general, using RegEx to parse HTML is a bad idea - see here for why.
You can use an HTML parser like the HTML Agility Pack in order to extract URLs from HTML:
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
You can then exclude any URLs that start with #.

Categories