Regular expression to find anchor link with special href? - c#

I just need to find a regular expression for the following:
I have some content in div tag, that includes lot of anchor links in it. So my task is to find anchor links with href as format of "components/showdoc.aspx?docid=" And then add onclick event for that anchor link only, leave the rest of the anchor links.
<div id="content" runat="server">
test doc
</div>
This expression gives and add target to it.
RegEx.Replace(inputString, "<(a)([^>]+)>", "<$1 target=""_blank""$2>")
Thanks

If you are looking to make permanent changes to your HTML file, first manage your HTML parsing by loading it into a System.Windows.Forms.WebBrowser control. From there you can perform DOM-like modifications to the HTML without the dangerous repercussions of parsing corruption that can be caused by performing Regex.Replace on the raw file. (Apparently RegEx + HTML is a serious issue for some).
So first in your code you would:
WebBrowser myBrowser = new WebBrowser();
myBrowser.URL = #"C:\MyPath\MyFile.HTML";
HtmlElement myDocBody = myBrowser.Document.Body;
Then you can navigate through your document body, seeking out your div tag and looking for your anchor tags by using the HtmlElement.Id property and HtmlElement.GetAttribute method.
Note: feel free to still use RegEx matching on the URL strings but only after extracting them from a GetAttribute("href") method.
To add the onClick method, simply invoke the HtmlElement.SetAttribute method.
When you have finished all your modifications, save the changes by writing the WebBrowser.DocumentText to file.
Here is a reference:
http://msdn.microsoft.com/en-us/library/system.windows.forms.htmlelement.aspx

Don't use regex to parse html, it's evil.
You could use the HTML Agility Pack, it even has a nice NuGet Package.
Alternatively, you could do this on the client side with a single line of jQuery:
$('a[href*="components/showdoc.aspx?docid="]').on('click', myClickFunction);
This is making use of the Attribute Contains Selector.
If you want to find the docid in your click function, you could write something like this in your click function:
function myClickFunction(e){
var href = $(this).attr('href');
var docId = href.split('=')[1];
alert(docId);
}
Note that this assumes there's only ever one query string value, if you wanted to make this more robust you could do something like in this answer: https://stackoverflow.com/a/1171731/21200

Related

How to remove html tag and just leave text in C#?

I see a similar case which answer my question in JQuery way:
Remove tag but leave contents - jQuery/Javascript
But I need to know how C# works for this? Is there any API in C# that I can use to strip out the tag and leave the plane text?
From:
some content and more and more content
To:
some content and more and more content
Thanks.
Use HtmlAgilityPack ( http://htmlagilitypack.codeplex.com/ ), load your HTML document into a HtmlDocument instance and query document.DocumentNode.InnerText.

format string containing html

I have a simple string variable that contains a portion of HTML inside. For example:
string contents = "<div><p>Hi how are you. Click here if you want to know more";
I want to include this HTML in page:
<div class="description">
#contents
</div>
However, it messes up the rest of the page because of unclosed tags.
Is there a function (or a helper) that reads and formats the HTML inside for example, to complete the HTML without errors:
#Html.DisplayProperHTML(contents)
This will render as:
<div><p>Hi how are you. Click here if you want to know more</p></div>
There is no such functionality built-in.
You can use the HTML Agility Pack to parse and fix broken HTML.

How to find a link in HTML document? (C#)

I have a C# Form with WebBrowser object.
This object contains HTML Document.
And there is a link in that document that has no markers (no id and no name)
How can I access this element??
I tried to use this:
webBrowser1.Document.GetElementsByTagName("a")[n]
But it is not very useful, because if there will be some new link on the page, I'll need to rebuild all program.
I also can not do loops through document, or get a substring of Document.ToString() because then I can not click the link.
Would be great if you could give me some advice.
In this kind of situation the best idea is always to find an "Anchor", meaning - a place in the document that never change.
Lets say that
dada
Doesn't have an ID or Name, so the closest you can go is check if the parent of the element you're looking for has an ID.
<div id="parentDiv">
Some text
Some other stuff
The link you're looking for
</div>
That way you could get the parentDiv, which you know doesn't change, and then the A tag inside that parent (which should be permanent unless that website completely changes the structure which is one of the problems in parsing external HTML pages)
Shai.
you can use Html Agility Pack. and select links by xpath
HtmlWeb htmlWeb = new HtmlWeb();
HtmlDocument doc = htmlWeb.Load(/* url */);
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[#href"])
{
// do stuff
}
You should have some info on how to identify the link. it may be id or name or the text. If the text is always same then check the inner text of that link.

Make link unclickable with regex unless from the certain domain

I want to write a regex which will make link unclickable - simply removing href html tags and leaving the link url as text. So it will ignore anchor text if it is given only leave the url.
But this regex should make links unclikable if they are not from the certain domain. I want this for my private messaging system.
So regex will do following
if the link is not targeting the certain domain make the link url text. Ignore given anchor text.
asp.net 4.0 , C# 4.0
Example
My domain
This will be parsed as http://www.monstermmorpg.com/ it will be text
Something like this works on jQuery:
$(document).ready(removeLink("domain.com"));
function removeLink(domain)
{
$('a').each(function(){
if($(this).attr("href")!=null)
if($(this).attr("href").indexOf(domain)>=0)
$(this).removeAttr("href");
})
}
Explanation:
Gets all anchor elements in the HTML rendered iterating through them and removing the href attribute of the hyperlink for all the the ones that have domain.com as part of it.
jsfiddle demo.

How to find a matching closing tag in html string?

Imagine the following HTML:
<div>
<b></b>
<div>
<table>...</table>
</div>
</div> <!-- this one -->
...
How could I find the matching closing tag for the first opening div tag? Is there a reg ex that could find it? I guess this is quite a common requirement but I'm struggling to find anything straightforward, just full blown HTML parsers.
No.
Use a full blown HTML parser. There's a reason they exist.
Use Html Agility Pack.
I'm assuming that you have tokeinized the html tags... Now create a stack and every time you see an opening tag push and everytime you see a closing tag pop... and see if the ones you pop macth the closing tag...
But there are already HTML parsers for this so search for one on codeplex.
Well, You need to have a 'clear' view of the syntax ! However, regexp are very limited in scope and I would'nt recommand using it for multi-line/tag syntax.
You rather need to track each tag (open/close) and use a 'handler' to deal with your request. You could use some Lex/Yacc tools but this may be overkilling. Depending on the language you use, you may already have modules for this purpose (like HTMLParser in Python).
There's always LinqToXml if you want to parse HTML and don't need every little detail.

Categories