Ignoring   when parsing with HtmlAgilityPack - c#

I'm parsing html table in c# using Html Agility Pack that contains non-breaking space.
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(page);
Where page is string containing table with special characters   within text.
<td> test</td>
<td>number = 123 </td>
Using SelectSingleNode(".//td").InnerText will contains this special characters but i want to ignore them.
Is there some elegant way to ignore this (with or without help of Html Agility Pack) without modifying source table?

You could use HtmlDecode
string foo = HttpUtility.HtmlDecode("Special char:  ");
Will give you a string:
Special char:

The "Special Character" non-breaking-space of which you speak is a valid character which can perfectly legitimately appear in text, just as "fancy quotes", em-dash etc can.
Often we want to treat certain characters as being equivalent.
So you might want to treat an em-dash, en-dash and minus sign/dash as
being the same.
Or fancy quotes as the same as straight quotes.
Or the non-breaking-space as an ordinary space.
However this is not something HTML Agility pack can help with. You need to use something like string.Replace or your own canonicalization function to do this.
I would suggest something like:
static string CleanupStringForMyApp(string s){
// replace characters with their equivalents
s = s.Replace(string.FromCharCode(160), " ");
// Add any more replacements you want to do here
return s;
}

Related

Regex to find iframe tags and retrieve attributes

I am trying to retrieve iframe tags and attributes from an HTML input.
Sample input
<div class="1"><iframe width="100%" height="427px" src="https://www.youtube.com/embed/1" frameborder="0" allowfullscreen=""></iframe></div>
<div class="2"><iframe width="100%" height="427px" src="https://www.youtube.com/embed/2" frameborder="0" allowfullscreen=""></iframe></div>
I have been trying to collect them using the following regex:
<iframe.+?width=[\"'](?<width>.*?)[\"']?height=[\"'](?<height>.*?)[\"']?src=[\"'](?<src>.*?)[\"'].+?>
This results in
This is exactly the format I want.
The problem is, if the HTML attributes are in a different order this regex won't work.
Is there any way to modify this regex to ignore the attribute order and return the iframes grouped in Matches so that I could iterate through them?
Here is a regex that will ignore the order of attributes:
(?<=<iframe[^>]*?)(?:\s*width=["'](?<width>[^"']+)["']|\s*height=["'](?<height>[^'"]+)["']|\s*src=["'](?<src>[^'"]+["']))+[^>]*?>
RegexStorm demo
C# sample code:
var rx = new Regex(#"(?<=<iframe[^>]*?)(?:\s*width=[""'](?<width>[^""']+)[""']|\s*height=[""'](?<height>[^'""]+)[""']|\s*src=[""'](?<src>[^'""]+[""']))+[^>]*?>");
var input = #"YOUR INPUT STRING";
var matches = rx.Matches(input).Cast<Match>().ToList();
Output:
Regular expressions match patterns, and the structure of your string defines which pattern to use, thus, if you want to use regular expressions order is important.
You can deal with this in 2 ways:
The good and recommended way is to not parse HTML with regular expressions (mandatory link), but rather use a parsing framework such as the HTML Agility Pack. This should allow you to process the HTML you need and extract any values you are after.
The 2nd, bad, and non recommended way to do this is to break your matching into 2 parts. You first use something like so: <iframe(.+?)></iframe> to extract the entire iframe decleration and then, use multiple, smaller regular expressions to seek out and find the settings you are after. The above regex obviously fails if your iframe is structured like so: <iframe.../>. This should give you a hint as to why you should not do HTMl parsing through regular expressions.
As stated, you should go with the first option.
You can use this regex
<iframe[ ]+(([a-z]+) *= *['"]*([a-zA-Z0-9\/:\.%]*)['"]*[ ]*)*>
it matches each 'name'='value' pair recursively and stores it in the same order in matches, you can iterate through the mathes to get names and values sequentially. Caters for most chars in value but you may add a few more if needed.
With Html Agility Pack (to be had via nuget):
using System;
using HtmlAgilityPack;
namespace Demo
{
class Program
{
static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
doc.Load("HTMLPage1.html"); //or .LoadHtml(/*contentstring*/);
HtmlNodeCollection iframes = doc.DocumentNode.SelectNodes("//iframe");
foreach (HtmlNode iframe in iframes)
{
Console.WriteLine(iframe.GetAttributeValue("width","null"));
Console.WriteLine(iframe.GetAttributeValue("height", "null"));
Console.WriteLine(iframe.GetAttributeValue("src","null"));
}
}
}
}
You need to use an OR operator (|). See changes below
<iframe.+?width=[\"']((?<width>.*?)[\"']?)|(height=[\"'](?<height>.*?)[\"']?)|(src=[\"'](?<src>.*?)[\"']))*.+?>

Removing HTML from a string

I have a table (a Wijmo Grid). The column Log takes some text.
The user is allowed to write HTML in the text, because the same text is also used when mailed to make it look pretty and well styled.
Let's say the text is:
var text = "Hello friend <br> How are you? <h1> from me </h1>";
Is there any method or JSON.stringify() og HTML.enocde() i can/should use to get:
var textWithoutHtml = magic(text); // "Hello friend How are you? from me"
One of the problems is that if the text include "<br>" it break to next line i the row of the table, and it's possible to see the top-half of the second line in the row, witch doesn't look good.
var text = "Hello friend <br> How are you? <h1> from me </h1>";
var newText = text.replace(/(<([^>]+)>)/ig, "");
fiddle: http://jsfiddle.net/EfRs6/
As far as i understood your question you can encode the values like this in C#
string encodedValue= HttpUtility.HtmlEncode(txtInput.Text);
Note: here txtInput is the id of TextBox on your page.
You may try like this:
string s = Regex.Replace("Hello friend <br> How are you? <h1> from me </h1>", #"<[^>]+>| ", "").Trim();
You can also check the HTML Agility Pack
This is an agile HTML parser that builds a read/write DOM and supports
plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is a .NET code library that allows
you to parse "out of the web" HTML files. The parser is very tolerant
with "real world" malformed HTML. The object model is very similar to
what proposes System.Xml, but for HTML documents (or streams).
<[^>]+>| /
1st Alternative: <[^>]+>
< matches the characters < literally
[^>]+ match a single character not present in the list below
Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
> a single character in the list > literally (case sensitive)
> matches the characters > literally
2nd Alternative:
matches the characters literally (case sensitive)

Regex Strip Span tags completely

I want to strip html string for Span tags.
I have a html string :
<span>Roskilde</span><span>Festival</span>
I need to strip it down to : Roskilde Festival.
Atm, I have a regex string which should be able to find all span tags, but its failing
System.Collections.Specialized.StringCollection sc = new System.Collections.Specialized.StringCollection();
sc.Add(#"/<\s*\/?\s*span\s*.*?>/g");
foreach (string s in sc)
{
k = System.Text.RegularExpressions.Regex.Replace(pContent, s, "", System.Text.RegularExpressions.RegexOptions.IgnoreCase);
}
k = System.Text.RegularExpressions.Regex.Replace(pContent, #" ", #" ");
Any Ideas?
P.S. I don't wnat to use Html Agility Pack
Regexp are not the best way to process HTML. Use a HTML parser that understands nesting, because Regexp do not understand HTML nesting.
Consider looking at inverse charsets, i.e. <whatever[^>]*>
And I guess you copied this from somewhere, but your regexp probably is not the proper C# syntax (extra / and /g). Reread a regexp in C# tutorial! Try this string:
Example /<span>/g does this tag get removed?
What you probably meant to use was:
sc.Add(#"</?span( [^>]*|/)?>");

Use RegEx to Find and Replace Specific HTML Tags

I have a string that contains dynamic HTML content.
I want to be able to find and replace all occurrances of specific HTML tags and replace them, but not the content within them.
The specific HTML tags would be for a table - i.e. TABLE, TR, and TD. The tags may contain attributes, or they may not. How would one go about doing this in C#?
Thanks in advance for any help!
This function might be sufficient:
public static string ReplaceTag(string input, string soughtTag, string replacementTag)
{
return Regex.Replace(input, "(</?)" + soughtTag + #"((?:\s+.*?)?>)", "$1" + replacementTag + "$2");
}
Don't use Regexs. Use the Html Agility Pack.
See this question for why not.
e = "(< *?/*)div( +?|>)";
repl = "\\1boo\\2";
Frankly I am befuddled by this mantra being imposed on everyone to never use regex for html.

Convert > to HTML entity equivalent within HTML string

I'm trying to convert all instances of the > character to its HTML entity equivalent, >, within a string of HTML that contains HTML tags. The furthest I've been able to get with a solution for this is using a regex.
Here's what I have so far:
public static readonly Regex HtmlAngleBracketNotPartOfTag = new Regex("(?:<[^>]*(?:>|$))(>)", RegexOptions.Compiled | RegexOptions.Singleline);
The main issue I'm having is isolating the single > characters that are not part of an HTML tag. I don't want to convert any existing tags, because I need to preserve the HTML for rendering. If I don't convert the > characters, I get malformed HTML, which causes rendering issues in the browser.
This is an example of a test string to parse:
"Ok, now I've got the correct setting.<br/><br/>On 12/22/2008 3:45 PM, jproot#somedomain.com wrote:<br/><div class"quotedReply">> Ok, got it, hope the angle bracket quotes are there.<br/>><br/>> On 12/22/2008 3:45 PM, > sbartfast#somedomain.com wrote:<br/>>> Please someone, reply to this.<br/>>><br/>><br/></div>"
In the above string, none of the > characters that are part of HTML tags should be converted to >. So, this:
<div class"quotedReply">>
should become this:
<div class"quotedReply">>
Another issue is that the expression above uses a non-capturing group, which is fine except for the fact that the match is in group 1. I'm not quite sure how to do a replace only on group 1 and preserve the rest of the match. It appears that a MatchEvaluator doesn't really do the trick, or perhaps I just can't envision it right now.
I suspect my regex could do with some lovin'.
Anyone have any bright ideas?
Why do you want to do this? What harm are the > doing? Most parsers I've come across are quite happy with a > on its own without it needing to be escaped to an entity.
Additionally, it would be more appropriate to properly encode the content strings with HtmlUtilty.HtmlEncode before concatenating them with strings containing HTML markup, hence if this is under your control you should consider dealing with it there.
The trick is to capture everything that isn't the target, then plug it back in along with the changed text, like this:
Regex.Replace(str, #"\G((?>[^<>]+|<[^>]*>)*)>", "$1>");
But Anthony's right: right angle brackets in text nodes shouldn't cause any problems. And matching HTML with regexes is tricky; for example, comments and CDATA can contain practically anything, so a robust regex would have to match them specifically.
Maybe read your HTML into an XML parser which should take care of the conversions for you.
Are you talking about the > chars inside of an HTML tag, (Like in Java's innerText), or in the arguements list of an HTML tag?
If you want to just sanitize the text between the opening and closing tag, that should be rather simple. Just locate any > char, and replace it with the &gt ;. (I'd also do it with the &lt tag), but the HTML render engine SHOULD take care of this for you...
Give an example of what you are trying to sanitize, and maybe we an find the best solution for it.
Larry
Could you read the string into an XML document and look at the values and replace the > with > in the values. This would require recursively going into each node in the document but that shouldn't be too hard to do.
Steve_C, you may try this RegEx. This will give capture any HTML tags in reference 1, and the text between the tags is stored in capture 2. I didn't fully test this, just throwing it out there in case it might help.
<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>

Categories