Not able to make a valid Regex - c#

I am new to Regex and started playing with it a few days ago. But now I am stuck at one string.
For example:
I have this string -> <a href="http://somelink.com", example1="", example2="">
I am trying to Replace all of that string from <a to > but I want to keep the href part and the link. I have been going over this at https://regex101.com but to no avail. The Regex pattern that I am trying is <a(\s?)(?!.*?href=\".*?\").*?>. This pattern does not find anything in the string. I am using C#.
Any help would be greatly appreciated. Thanks
Update:
The actual string looks like
<a href="http://somelink.com", example1="", example2="">
and I want to remove this part
, example1="", example2=""
But then I want to keep this part
<a href="http://somelink.com">

This might work <a(?:.*?)(href=\"(.*?)\").*?>
Group(1) = href="..." and Group(2) = link
Demo https://regex101.com/r/Yz0kPc/1
Added more attributes https://regex101.com/r/Yz0kPc/2

I think you can search by this:
<a[^>]*?(href="[^"]+")[^>]*>
and replace by this:
<a $1>
Run the source
Demo

Related

How to change rel to nofollow with Regex - c#

i want to change some urls to nofollow and i also want, some urls dofollow
i try to do it with this Regex :
(<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!blogs.cc)[^"]+)"([^>]*)>
i can support one url to dofollow (in this ex:"blogs.cc")
if i want to dofollow more of one, what do i do?
i try with :
(<a\s*(?!.*\brel=)[^>]*)(href="https?://)(((?!blogs.cc)[^"]+)||((?!wikipedia.org)[^"]+))"([^>]*)>
but i didn't get a correct answer
what's solution?
i resolved it and put my solution here for everybody who has same question.
just do it
(<a\s*(?!.*\brel=)[^>]*)(href="https?://)((?!(?:blogs.cc|wikipedia.org|moreUrls.com))[^"]+))"([^>]*)>
C# Sample Code:
Regex.Replace(str, "(<a\\s*(?!.*\brel=)[^>]*)(href=\"https?://)((?!(?:blogs.cc|wikipedia.org))[^\"]+)\"([^>]*)>", "<a $2$3\" $4 rel=\"nofollow\">")
i hope it would be useful

Find all <a> tags in which first attributes isn't 'title'

I'm trying to fix my site to meet WCAG 2.0. This means that all links in my site must have a title. To do it right and not miss any <a> tags, I'm ading title to each link as a first attribute :
<a title="..."
But this site has a lot of links and I'm struggling to find all the links without a title. Can anyone help me with a regular expression that I could use to find all tags that start with <a but the next letter isn't 't'?
If someone has an answer on how to find specific tag without specific attribute it will be even better! I'm working on visual studio 2015
Generally, it is not a good idea to use regular expressions on a complex system like a DOM. However, in your (simple) example, you might get along with:
<a(?:(?!\btitle\b)[^>])*>
This ignores links with a title attribute, regardless where they are. See a demo on regex101.com.
Remember that it will fail on e.g. This one fails not matching it in the way you intented.
How about this one?
(<a )(?!title)
Matches:
<a >
But not:
<a title="..."
Try it here.

Regex for extract URL from string fails when string contains multiple double quotes?

I am using regex for extracting url from string and it's working mostly;
var regex=new Regex("<a [^>]*href=(?:'(?<href>.*?)')|(?:\"(?<href>.*?)\")",RegexOptions.IgnoreCase);
following strings working fine:
"This is Test page <a href='test.aspx'>test page</a>"
"This is Test page <a href='test1.aspx'>test</a> another one <a href='test2.aspx'>test</a>"
"This is Tests\"s page <a href='test1.aspx'>test</a> another one <a href='test2.aspx'>test</a>"
"This is Test page"
"This is Test page\"s without problem"
But some time it's not returning good result. Following code return bad result (string contains 2 double quotes) -
var inputString="This string create \"problem\" for me";
var regex=new Regex("<a [^>]*href=(?:'(?<href>.*?)')|(?:\"(?<href>.*?)\")",RegexOptions.IgnoreCase);
var urls=regex.Matches(inputString).OfType<Match>().Select(m =>m.Groups["href"].Value);
foreach(var zzzzzzz in urls){
Console.WriteLine(zzzzzzz);
}
Demo with problem
Could anyone help me to solve this problem?
Maybe you can change your regex like this:<a .*?href=(?:['"](?<href>[^'"]*?)['"])
On Csharp:"<a .*?href=(?:['\"](?<href>[^'\"]*?)['\"])"
Solution:
You should use an HTML Parser to get rid of current and further headaches. A tested and working example can be found for example here.
Regex explanation:
As for your regex, it currently fails because of alternation that you did not enclose into a group. Thus, it can return strings that have no <a... href inside them. More, there are other issues that you can have with your current regex.
A "fixed" regex (meaning it will be capable of handling escaped entities and both double and single quotes) would look like:
(?i)<a\b[^<]*href=(?:(?:'(?<href>[^'\\]*(?:\\.[^'\\]*)*)')|(?:\"(?<href>[^'\\]*(?:\\.[^'\\]*)*))\")
But it is unlikely you can fully rely on regex when parsing HTML. Use the solution, not a workaround.

Extracts all sub strings between string separators in a string (C#)

I'm trying to parse content of a string to see if the string includes urls, to convert the full string to html, to make the string clickable.
I'm not sure if there is a smarter way of doing this, but I started trying creating a parser with the Split method for strings, or Regex.Split in C#. But I can't find a good way of doing it.
(It is a ASP.NET MVC application, so perhaps there is some smarter way of doing this)
I want to ex. convert the string;
"Customer office is responsible for this. Contact info can be found {link}{www.customerservice.com}{here!}{link} More info can be found {link}{www.customerservice.com/moreinfo}{here!}{link}"
Into
"Customer office is responsible for this. Contact info can be found <a href=www.customerservice.com>here!</a> More info can be found <a href=www.customerservice.com/moreinfo>here!</a>"
i.e.
{link}{url}{text}{link} --> <a href=url>text</a>
Anyone have a good suggestion? I can also change the way the input string is formatted.
You can use the following to match:
{link}{([^}]*)}{([^}]*)}{link}
And replace with:
<a href=$1>$2</a>
See DEMO
Explanation:
{link} match {link} literally
{([^}]*)} match all characters except } in capturing group 1 (for url)
{([^}]*)} match all characters except } in capturing group 2 (for value)
{link} match {link} literally again
you can use regex as
{link}{(.*?)}{(.*?)}{link}
and substution as
<a href=\1>\2</a>
Regex
For your simple link format {link}{url}{text} you can use simple Regex.Replace:
Regex.Replace(input, #"\{link\}\{([^}]*)\}\{([^}]*)\}", #"$2");
Also this non-regex idea may help
var input = "Customer office is responsible for this. Contact info can be found {link}{www.customerservice.com}{here!}{link} More info can be found {link}{www.customerservice.com/moreinfo}{here!}{link}";
var output = input.Replace("{link}{", "<a href=")
.Replace("}{link}", "</a>")
.Replace("}{", ">");

Using regex to split a formatted string to URL like StackOverFlow

I'm trying to write a parser that will create links found in posted text that are formatted like so:
[Site Description](http://www.stackoverflow.com)
to be rendered as a standard HTML link like this:
Site Description
So far what I have is the expression listed below and will work on the example above, but if will not work if the URL has anything after the ".com". Obviously there is no single regex expression that will find every URL but would like to be able to match as many as I can.
(\[)([A-Za-z0-9 -_]*)(\])(\()((http|https|ftp)\://[A-Za-z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?)(\))
Any help would be greatly appreciated. Thanks.
Darn. It seems #Jerry and #MikeH beat me to it. My answer is best, however, as the link tags are all uppercase ;)
Find what: \[([^]]+)\]\(([^)]+)\)
Replace with: $1
http://regex101.com/r/cY7lF0
Well, you could try negated classes so you don't have to worry about the parsing of the url itself?
\[([^]]+)\]\(([^)]+)\)
And replace with:
$1
regex101 demo
Or maybe use only the beginning parts to identify a url?
\[([^]]+)\]\(((?:https?|ftp)://[^)]+)\)
The replace is the same.

Categories