Retrieving special InnerText from HTML using Regex in C# [duplicate] - c#

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 9 years ago.
I have a HTML file and I am trying to retrieve valid innertext from each tag. I am using Regex for this with the following pattern:
(?<=>).*?(?=<)
It works fine for simple innertext. But, I recently encountered following HTML pieces:
<div id="mainDiv"> << Generate Report>> </div>
<input id="name" type="text">Your Name->></input>
I am not sure, how to retrieve these innertexts with regular expressions? Can someone please help?
Thanks

I'd use a parser, but this is possible with RegEx using something like:
<([a-zA-Z0-9]+)(?:\s+[^>]+)?>(.+?)<\/\1>
Then you can grab the inner text with capture group 2.

You can always eliminate HTML tags which can be described by a regular grammar while HTML cannot. Replace "<[a-zA-Z][a-zA-Z0-9]*\s*([a-zA-Z]+\s*=\s*("|')(?("|')(?<=).|.)("|')\s*)*/?>" with string.Empty.
That regex should match any valid HTML tag.
EDIT:
If you do not want to obtain a concatenated result you can use "<" instead of string.Empty and then split by '<' since '<' in HTML always starts a tag and should never be displayed. Or you can use the overload of Regex.Replace that takes a delegate and use match index and match length (it may turn out more optimal that way). Or even better use Regex.Match and go from matched tag to matched tag. substring(PreviousMatchIndex + PreviousMatchLength, CurrentMatchIndex - PreviousMatchIndex + PreviousMatchLength) should provide the inner text.

That's exactly why you don't use regex for parsing html.Although you can get around this problem by using backreference in regex
(?<=<(\w+)[<>]*>).*?(?=/<\1>)
Though that wont work always because
tags wont always have a closing tag
tag attributes can contain <>
arbitrary spaces around tag's name
Use an html parser like htmlagilitypack
Your code would be as simple as this
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
//InnerText of all div's
List<string> divs=doc.DocumentElement
.SelectNodes("//div")
.Select(x=>x.InnerText).ToList();

Related

C# Regex multiple matching

I have this regex to extract paragraphs that are outside of a table
((?<=<\/w:tbl>)<w:p [^>]*>.*?<\/w:p>(?=<w:tbl>)|(?<=<\/w:tbl>)<w:p [^>]*>.*?<\/w:p>(?=<w:sectPr .*>))
The problem is that it reads all paragraphs as if they are one paragraph (from the first opening tag until the last closing tag without the intermediate paragraphs).
Below is an example of the text. In this case it match one instead of 3
</w:tr></w:tbl><w:p w:rsidR="00F24C60" w:rsidRDefault="00F24C60" w:rsidP="009D46A1"><w:pPr><w:spacing w:before="240" w:after="240"/></w:pPr><w:r><w:t></w:t></w:r></w:p><w:p w:rsidR="00F24C60" w:rsidRDefault="00F24C60" w:rsidP="009D46A1"><w:pPr><w:spacing w:before="240" w:after="240"/></w:pPr><w:r><w:t></w:t></w:r></w:p><w:p w:rsidR="00346D4D" w:rsidRPr="00AC7B53" w:rsidRDefault="00F24C60" w:rsidP="009D46A1"><w:pPr><w:spacing w:before="240" w:after="240"/></w:pPr><w:r><w:t></w:t></w:r></w:p><w:tbl><w:tblPr>
Any help to make it match each paragraph alone (3 paragraphs)?
Thanks.
I think, you can't, because you want to create groups inside another tags, but regex don't know about structures it just looking string from begin to end, assume string: eabcabce if need all abc groups I can do next (abc), however I can't tell that I want all abc groups between e.
You can use some xml parser.
You can try two regexes for this particular case:
Get content of tbl tag with your regex
Get groups from tbl content with this regex (<w:p [^>]*>.*?<\/w:p>)
some links:
why not to parse html with regex (I think your xml is close to html :)) RegEx match open tags except XHTML self-contained tags
https://www.regextester.com/

Regular expression for anchor tag in c# [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 7 years ago.
My anchor tag looks like this:-
<a href="/as" title="asd" page="as" name="asd" reference="Yes" type="relativepath">as
</a>
I tried in this way:-
<a [^>]*?>(?<text>.*?)</a>
It is working fine when the ending anchor tag </a> supposed to be in the same line.
But in my case the ending anchor tag should come in next line.
I need a regular expression that it should supports, if the ending anchor tag is in the next line.
Suggestions welcome.
You should use the (?s) inline option:
(?s)<a [^>]*?>(?<text>.*?)</a>
See demo.
In C#, you can also use RegexOptions.Singleline option the following way:
var input = "as\r\n";
var regex = new Regex(#"<a [^>]*?>(?<text>.*?)</a>", RegexOptions.Singleline);
var result2 = regex.Match(input).Value;
Output:
EDIT:
This is an updated version of the regex that takes into account <a> tags that do not have attributes (which is next to impossible, but let's imagine :)), and also make it case-insensitive (who knows, maybe <A HREF="SOMETHING_HERE"> can also occur):
var regex = new Regex(#"(?i)<a\b[^>]*?>(?<text>.*?)</a>", RegexOptions.Singleline);
Just use DOTALL modifier which makes the DOT present in your regex to match even line breaks.
#"(?s)<a [^>]*?>(?<text>.*?)</a>"
OR
You could use negated character class.
#"<a [^>]*?>(?<text>[^<>]*)</a>"

How to get text "out of" by Regex

I have small problem. I'm trying to get text whitch is out of html elements.
Example input:
I want this text I want this text I want this text <I don't want this text/>
I want this text I wan this text <I don't>want this</text>
Does anybody know how is it possible by regex? I thought that I can make it by deleting element text. So, does anybody know another solution for this problem? Please help me.
Instead of regex, which is not suitable for parsing HTML in general (especially malformed HTML), use an HTML parser like the HTML Agility Pack.
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
I agree that anything not trivial should be done with a HTML parser (Agility pack is excellent if you use .NET) but for small requirements as this its more than likely overkill.
Then again, A HTML parser knows more about the quirks and edge cases that HTML is full of. Be sure to test well before using a regex.
Here you go
<.*?>.*?<.*?>|<.*?/>
It also correctly ignores
<I don't>want this</text>
and not just the tags
In C# this becomes
string resultString = null;
resultString = Regex.Replace(subjectString, "<.*?>.*?<.*?>|<.*?/>", "");
Try this
(?<!<.*?)([^<>]+)
Explanation
#"
(?<! # Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind)
< # Match the character “<” literally
. # Match any single character that is not a line break character
*? # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
)
( # Match the regular expression below and capture its match into backreference number 1
[^<>] # Match a single character NOT present in the list “<>”
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
"

Modifying a regex match value

I want to locate all image tags in my html with src not containing http:// and prepend http:// to the src attribute.
I have got the regex to find all img tags not starting with http://. I'm having some trouble appdening http:// to the src attribute alone. How can I achieve this using regex replace.
<img [^<]*src="(?!http://)(?<source>[^"]*)"[^<]*/>
Source will contain the src value. I just need it to say $2 = "http://" + $2. How can I write this in c# code.
Since you don't want to break existing tags, you will need to assign groups to the parts of the string you are not interested in; in order to be able to include those parts of the match in the replace pattern:
(<img [^<]*src=")(?!http://)(?<source>[^"]*)("[^<]*/>)
Then the replace is trivial:
regex.Replace(input, "$1http://$3$2");
(Also, this might work for your application use case, but I should mention, that in general it is not considered a good idea to parse HTML with regex)

A regular expression for anchor html tag in C#?

I need a regular expression in C# for anchor tag in html source codes as general as it's possible. Consider this html code:
<a id="[constant]"
href="[specific]"
>GlobalPlatform Card Specification 2.2
March, 2006</a>
By [constant] I mean the value is a constant string so there is no problem with it. By [specific] I mean the address is a simple and specific string so the regular expression for it, is simple. The main problem is that I can not handle the newline character in the middle of title of anchor tag. I wrote this regular expression previously that works well except handling the newline character between title of anchor tag.
<a[\\s\\n\\r]+id=\"[constant]"[\\s\\n\\r]+href=\"[specific]"[\\s\\n\\r]*>[\\s\\n\\r]*[^\\n\\r]+[\\s\\n\\r]*</a>
Please help me
You should stay away from regular expressions when it comes to parse HTML and use an HTML parser like the HTML Agility Pack.
And to help you get started check how simple it can be to parse that single anchor tag.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"<a id=""[constant]""
href=""[specific]""
>GlobalPlatform Card Specification 2.2
March, 2006</a>
");
var anchor = doc.DocumentNode.Element("a");
Console.WriteLine(anchor.Id);
Console.WriteLine(anchor.Attributes["href"].Value);
Beats regular expressions, don't you think? :)
if you are using C# you can define option multiline while creating Regex,
Regex r = new Regex(pattern, RegexOptions.Multiline);

Categories