I want to locate all image tags in my html with src not containing http:// and prepend http:// to the src attribute.
I have got the regex to find all img tags not starting with http://. I'm having some trouble appdening http:// to the src attribute alone. How can I achieve this using regex replace.
<img [^<]*src="(?!http://)(?<source>[^"]*)"[^<]*/>
Source will contain the src value. I just need it to say $2 = "http://" + $2. How can I write this in c# code.
Since you don't want to break existing tags, you will need to assign groups to the parts of the string you are not interested in; in order to be able to include those parts of the match in the replace pattern:
(<img [^<]*src=")(?!http://)(?<source>[^"]*)("[^<]*/>)
Then the replace is trivial:
regex.Replace(input, "$1http://$3$2");
(Also, this might work for your application use case, but I should mention, that in general it is not considered a good idea to parse HTML with regex)
Related
I have the following C# regex
#"(?:https?:\/\/)?(?:www\.)?(?:(?:(?:youtube\.com\/watch\?[^?]*v=|youtu\.be\/)))([\w-]+)";
How can I correct this so the regex won't match URLs with double quote at the beginning of the URL. so if the URL is in an href attribute in an hyperlink, it will be ignored and not captured.
I've used this expression in my other Twitter Regex pattern, but I can't make it work in this one.
(?<!"")
It worked on the Twitter pattern:
(?<!"")https?://twitter\.com/(?:#!/)?(\w+)/status(?:es)?/(\d+)
So the YouTube Regex should grab only URLs that are not with double quotes in the beginning of the URL.
To answer the question: (?<!") will fail a match if there is no " immediately before the current location. If there must be no " followed with 0+ other chars before the current location, you may leverage .NET infinite-width lookbehind.
In this case, you might want to turn your loobehind into
(?<!"[^"<>]*)
See the regex demo. Note that [^"<>]* matches 0+ chars other than ", < and >, so, the " will be checked only when inside an element node if the HTML is perfectly serialized. If it contains plain < or > inside attribute values, this approach won't work.
That is why you should think about using an appropriate HTML parser for this task, too, since you are using it already in the project. If you let know what you are trying to achieve, I will update the answer.
I have a text field for user comments, a user may or may not insert a URL into this field.
e.g. they could have any of the following (plus other variations):
Look at http://www.google.com some more text here possibly
Look at https://www.google.com some more text here possibly
Look at ftp://www.google.com some more text here possibly
Look at http://google.com some more text here possibly
Look at www.google.com some more text here possibly
What I want to do is match on these and change the string to include an HTML anchor tag.
Using the various other Stack Overflow answers about this subject I have come up with the below:
text = text.Trim();
text = Regex.Replace(text,
#"((https?|ftp):\/\/(?:www\.|(?!www))[^\s\.]+\.[^\s]{2,}|www\.[^\s]+\.[^\s]{2,})",
"<a target='_blank' href='$1'>$1</a>");
This works almost perfectly, it matches all the required patterns BUT when it matched against www.google.com (without the http(s)://) part, the anchor tag created isn't correct, the href of the anchor needs the http:// part or it creates the link as a relative url to the site.
How can I change the code above so that if the match doesn't contain the http:// part, it will add it to the href part of the anchor?
Interestingly, as I'm typing this question, the preview part is creating links out of my URLs above - all except my "trouble" one - the one without the http/ftp:// prefix.
Use a match evaluator to check if Group 2 ((https?|ftp)) matched. If it did not, use one logic, else, use another.
var text = "Look at http://google.com some more text here possibly,\nLook at www.google.com some more text here possibly";
text = text.Trim();
text = Regex.Replace(text,
#"((https?|ftp)://(?:www\.|(?!www))[^\s.]+\.\S{2,}|www\.\S+\.\S{2,})", m =>
m.Groups[2].Success ?
string.Format("<a target='_blank' href='{0}'>{0}</a>", m.Groups[1].Value) :
string.Format("<a target='_blank' href='http://{0}'>{0}</a>", m.Groups[1].Value));
Console.WriteLine(text);
See the C# demo, output:
Look at <a target='_blank' href='http://google.com'>http://google.com</a> some more text here possibly,
Look at <a target='_blank' href='http://www.google.com'>www.google.com</a> some more text here possibly
Note I replaced [^\s] with \S everywhere in the pattern to make it look "prettier".
You may also remove the outer capturing group (and use #"(https?|ftp)://(?:www\.|(?!www))[^\s.]+\.\S{2,}|www\.\S+\.\S{2,}" pattern) and then check if m.Groups[1].Success is true and use m.Value in the replacements.
This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 9 years ago.
I have a HTML file and I am trying to retrieve valid innertext from each tag. I am using Regex for this with the following pattern:
(?<=>).*?(?=<)
It works fine for simple innertext. But, I recently encountered following HTML pieces:
<div id="mainDiv"> << Generate Report>> </div>
<input id="name" type="text">Your Name->></input>
I am not sure, how to retrieve these innertexts with regular expressions? Can someone please help?
Thanks
I'd use a parser, but this is possible with RegEx using something like:
<([a-zA-Z0-9]+)(?:\s+[^>]+)?>(.+?)<\/\1>
Then you can grab the inner text with capture group 2.
You can always eliminate HTML tags which can be described by a regular grammar while HTML cannot. Replace "<[a-zA-Z][a-zA-Z0-9]*\s*([a-zA-Z]+\s*=\s*("|')(?("|')(?<=).|.)("|')\s*)*/?>" with string.Empty.
That regex should match any valid HTML tag.
EDIT:
If you do not want to obtain a concatenated result you can use "<" instead of string.Empty and then split by '<' since '<' in HTML always starts a tag and should never be displayed. Or you can use the overload of Regex.Replace that takes a delegate and use match index and match length (it may turn out more optimal that way). Or even better use Regex.Match and go from matched tag to matched tag. substring(PreviousMatchIndex + PreviousMatchLength, CurrentMatchIndex - PreviousMatchIndex + PreviousMatchLength) should provide the inner text.
That's exactly why you don't use regex for parsing html.Although you can get around this problem by using backreference in regex
(?<=<(\w+)[<>]*>).*?(?=/<\1>)
Though that wont work always because
tags wont always have a closing tag
tag attributes can contain <>
arbitrary spaces around tag's name
Use an html parser like htmlagilitypack
Your code would be as simple as this
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
//InnerText of all div's
List<string> divs=doc.DocumentElement
.SelectNodes("//div")
.Select(x=>x.InnerText).ToList();
I want to extract all the texts between the specified opening and closing tags including the tags.
For eg:
Input : I am <NAME>Kai</NAME>
Text Extracted: <NAME>Kai</NAME>
It extract the text based on tag.
What is Regex for the above?
If the tag in question can't be nested (and assuming case insensitivity):
Regex regexObj = new Regex("<NAME>(?:(?!</NAME>).)*</NAME>", RegexOptions.Singleline | RegexOptions.IgnoreCase);
Be advised that this is a quick-and-dirty solution which might work fine for your needs, but might also blow up in your face (for example if tags occur within comments, if there is whitespace inside the tags, if there are any attributes inside the tags etc.). If any of these might be a problem for you, please edit your question with the exact specifications you need the regex to comply with.
Here is a regex which accepts any tag name: <(\w+)>.*?</\1>
\1 is back-referencing the group (\w+) and ensures that the closing tag must have the same name as the opening tag.
If you want to search for the special tag NAME then you could use this regex: <NAME>.*?</NAME>
http://www.regular-expressions.info/reference.html You might find something useful here, they have allot of stuff specially for tags etc. Combine the examples to meet your requirements.
If I use this
string showPattern = #"return new_lightox\(this\);"">[a-zA-Z0-9(\s),!\?\-:'&%]+</a>";
MatchCollection showMatches = Regex.Matches(pageSource, showPattern);
I get some matches but I want to get rid of [a-zA-Z0-9(\s),!\?\-:'&%]+and use any char .+
but if do this I get no match at all.
What am I doing wrong?
By default "." does not match newlines, but the class \s does.
To let . match newline, turn on SingleLine/DOTALL mode - either using a flag in the function call (as Abel's answer shows), or using the inline modifier (?s), like this for the whole expression:
"(?s)return new_lightox\(this\);"">.+</a>"
Or for just the specific part of it:
"return new_lightox\(this\);"">(?s:.+)</a>"
It might be better to take that a step further and do this:
"return new_lightox\(this\);"">(?s:(?!</?a).+)</a>"
Which should prevent the closing </a> from belonging to a different link.
However, you need to be very wary here - it's not clear what you're doing overall, but regex is not a good tool for parsing HTML with, and can cause all sorts of problems. Look at using a HTML DOM parser instead, such as HtmlAgilityPack.
You're matching a tag, so you probably want something along these lines, instead of .+:
string showPattern = #"return new_lightox\(this\);"">[^<]+</a>";
The reason that the match doesn't hit is possibly because you are missing the multiline/singleline flag and the closing tag is on the next line. In other words, this should work too:
// SingleLine option changes the dot (.) to match newlines too
MatchCollection showMatches = Regex.Matches(
pageSource,
showPattern,
RegexOptions.SingleLine);