how to truncate HTML string without leaving it malformated?

how to truncate HTML string without leaving it malformated? - c#

I have to display first N (for example say 50 or 100) characters out of entire html string. I have to display well formated html.If i apply simple substring that will get me a malformated html string
E.g.
Sample string : "<html><body>foo</body></html>"
trucated string: "<html><body><a href="http://foo.com">foo<"
This will get me malformated html :(
Any ideas on how to achieve this ??

You can try using the HTML Agility Pack - it will parse out the HTML for you, but you will need to figure out how to produce a truncated version yourself. It should make things a lot easier though.

Parse the HTML into a DOM tree. Start with the deepest/innermost elements and
remove the content of the innermost node, or the node if it has no content
check the string length.
Rinse, lather, repeat.
This may truncate your string to the empty string, if your desired length is small enough.
For extra kicks, you could try removing attributes of the nodes as you go.

I've seen some forum systems simply append a </b></u></i></s> after every single post. You could approach this in a similar fashion.
Of course, its ugly and it wouldn't fix that trailing <
That is by far the simplest method. Better method would actually be generating a tree and... kicking nodes off until you meet the requirement.

Related

Remove everything expect src in Image Tag using Regex

I want to remove everything expect src in Image tag using regex.I am using C# but I don't want to use HTMLAgilityPack I want it using regex only.
How to get this ?
If String is <img id="image" class="header" src="test.png"> then it returns as <img src="test.png">
Image tag may contain many other extra properties.

To clarify my comments: Normally I wouldn't recommend parsing HTML Using Regex. however, this is one of the few times when it's possible without ending up with a disastrously complicated regex string, because here you have a single node, with 1 pair of matching angle brackets. In addition, the OP only needs a single tag from this string. If he needed to do anything more complicated, I'd agree that he should use HTMLAgilityPack, but this is perfectly doable.
What you do is you extract the tag from the string using this regex: (src=['\"].+?['\"]). Then you take what you extracted from the string and paste it into a new string:
String newImgTag = String.Format("<img {0}>", srcMatch);
Again, if this were any more complicated (or if I had to do other HTML manipulation), I would just skip the regex and go for the established solutions like the aforementioned HTMLAgilityPack, because it offers far more support for HTML manipulation.
However, I don't view this as HTML manipulation, because you got a single tag without even a matching closing tag. This is more like basic string manipulation. It's similar to calculating a number to the second power: I doubt anyone would import the entire math library just for that, they'd just do N * N.
I fully expect and accept that people will downvote me for even considering to use Regex for this. Before you do so, however, read the post and think about it. This is one of those borderline cases where HTMLAgilityPack would make the project far more complicated without actually adding anything except that you're not using Regex. Regex has its uses, it's only when you abuse it that it becomes a monster to work with.

Safely split/paginate an HTML String with ASP.NET

I have rendered one of my controls into a string. I want to safely split the html string. I don't want any hanging html tags. I am working on a pagination control adapter.
How can I split my string, around less than a set number of chars) safely taking HTML into account?

Take a look at HtmlAgilityPack. You can use it to parse and manipulate the html in your string without having to resort to regex.

If you're looking for a nice way to show the HTML code you should try HTML Tidy.
I did not use it with a limitation on the number of charecters per line, but I think HTML Tidy wrap option might get you close to your target.

How to find a repeated string and the value between them using regexes?

How would you find the value of string that is repeated and the data between it using regexes? For example, take this piece of XML:
<tagName>Data between the tag</tagName>
What would be the correct regex to find these values? (Note that tagName could be anything).
I have found a way that works that involves finding all the tagNames that are inbetween a set of < > and then searching for the first instance of the tagName from the opening tag to the end of the string and then finding the closing </tagName> and working out the data from between them. However, this is extremely inefficient and complex. There must be an easier way!
EDIT: Please don't tell me to use XMLReader; I doubt I will ever use my custom class for reading XML, I am trying to learn the best way to do it (and the wrong ways) through attempting to make my own.
Thanks in advance.

You can use: <(\w+)>(.*?)<\/\1>
Group #1 is the tag, Group #2 is the content.

Using regular expressions to parse XML is a terrible error.
This is efficient (it doesn't parse the XML into a DOM) and simple enough:
string s = "<tagName>Data between the tag</tagName>";
using (XmlReader xr = XmlReader.Create(new StringReader(s)))
{
xr.Read();
Console.WriteLine(xr.ReadElementContentAsString());
}
Edit:
Since the actual goal here is to learn something by doing, and not to just get the job done, here's why using regular expressions doesn't work:
Consider this fairly trivial test case:
<a><b><a>text1<b>CDATA<![<a>text2</a>]]></b></a></b>text3</a>
There are two elements with a tag name of "a" in that XML. The first has one text-node child with a value of "text1", and the second has one text-node child with a value of "text3". Also, there's a "b" element that contains a string of text that looks like an "a" element but isn't because it's enclosed in a CDATA section.
You can't parse that with simple pattern-matching. Finding <a> and looking ahead to find </a> doesn't begin to do what you need. You have to put start tags on a stack as you find them, and pop them off the stack as you reach the matching end tag. You have to stop putting anything on the stack when you encounter the start of a CDATA section, and not start again until you encounter the end.
And that's without introducing whitespace, empty elements, attributes, processing instructions, comments, or Unicode into the problem.

You can use a backreference like \1 to refer to an earlier match:
#"<([^>]*)>(.*)</\1>"
The \1 will match what was captured by the first parenthesized group.

with Perl:
my $tagName = 'some tag';
my $i; # some line of XML
$i =~ /\<$tagName\>(.+)\<\/$tagname\>/;
where $1 is now filled with the data you captured

Going forward, if you get stuck check out regexlib.com
It's the first place I go when i get stuck on regex

What is the best way to search through HTML in a C# string for specific text and mark the text?

What would be the best way to search through HTML inside a C# string variable to find a specific word/phrase and mark (or wrap) that word/phrase with a highlight?
Thanks,
Jeff

I like using Html Agility Pack very easy to use, although there hasn't been much updates lately, it is still usable. For example grabbing all the links
HtmlWeb client = new HtmlWeb();
HtmlDocument doc = client.Load("http://yoururl.com");
HtmlNodeCollection Nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var link in Nodes)
{
Console.WriteLine(link.Attributes["href"].Value);
}

Regular Expression would be my way. ;)

If the HTML you're using XHTML compliant, you could load it as an XML document, and then use XPath/XSL - long winded but kind of elegant?
An approach I used in the past is to use HTMLTidy to convert messy HTML to XHTML, and then use XSL/XPath for screen scraping content into a database, to create a reverse content management system.
Regular expressions would do it, but could be complicated once you try stripping out tags, image names etc, to remove false positives.

In simple cases, regular expressions will do.
string input = "ttttttgottttttt";
string output = Regex.Replace(input, "go", "<strong>$0</strong>");
will yield: "tttttt<strong>go</strong>ttttttt"
But when you say HTML, if you're referring to final text rendered, that's a bit of a mess. Say you've got this HTML:
<span class="firstLetter">B</span>ook
To highlight the word 'Book', you would need the help of a proper HTML renderer. To simplify, one can first remove all tags and leave only contents, and then do the usual replace, but it doesn't feel right.

You could look at using Html DOM, an open source project on SourceForge.net.
This way you could programmatically manipulate your text instead of relying regular expressions.

Searching for strings, you'll want to look up regular expressions. As for marking it, once you have the position of the substring it should be simple enough to use that to add in something to wrap around the phrase.

Convert > to HTML entity equivalent within HTML string

I'm trying to convert all instances of the > character to its HTML entity equivalent, >, within a string of HTML that contains HTML tags. The furthest I've been able to get with a solution for this is using a regex.
Here's what I have so far:
public static readonly Regex HtmlAngleBracketNotPartOfTag = new Regex("(?:<[^>]*(?:>|$))(>)", RegexOptions.Compiled | RegexOptions.Singleline);
The main issue I'm having is isolating the single > characters that are not part of an HTML tag. I don't want to convert any existing tags, because I need to preserve the HTML for rendering. If I don't convert the > characters, I get malformed HTML, which causes rendering issues in the browser.
This is an example of a test string to parse:
"Ok, now I've got the correct setting.<br/><br/>On 12/22/2008 3:45 PM, jproot#somedomain.com wrote:<br/><div class"quotedReply">> Ok, got it, hope the angle bracket quotes are there.<br/>><br/>> On 12/22/2008 3:45 PM, > sbartfast#somedomain.com wrote:<br/>>> Please someone, reply to this.<br/>>><br/>><br/></div>"
In the above string, none of the > characters that are part of HTML tags should be converted to >. So, this:
<div class"quotedReply">>
should become this:
<div class"quotedReply">>
Another issue is that the expression above uses a non-capturing group, which is fine except for the fact that the match is in group 1. I'm not quite sure how to do a replace only on group 1 and preserve the rest of the match. It appears that a MatchEvaluator doesn't really do the trick, or perhaps I just can't envision it right now.
I suspect my regex could do with some lovin'.
Anyone have any bright ideas?

Why do you want to do this? What harm are the > doing? Most parsers I've come across are quite happy with a > on its own without it needing to be escaped to an entity.
Additionally, it would be more appropriate to properly encode the content strings with HtmlUtilty.HtmlEncode before concatenating them with strings containing HTML markup, hence if this is under your control you should consider dealing with it there.

The trick is to capture everything that isn't the target, then plug it back in along with the changed text, like this:
Regex.Replace(str, #"\G((?>[^<>]+|<[^>]*>)*)>", "$1>");
But Anthony's right: right angle brackets in text nodes shouldn't cause any problems. And matching HTML with regexes is tricky; for example, comments and CDATA can contain practically anything, so a robust regex would have to match them specifically.

Maybe read your HTML into an XML parser which should take care of the conversions for you.

Are you talking about the > chars inside of an HTML tag, (Like in Java's innerText), or in the arguements list of an HTML tag?
If you want to just sanitize the text between the opening and closing tag, that should be rather simple. Just locate any > char, and replace it with the &gt ;. (I'd also do it with the &lt tag), but the HTML render engine SHOULD take care of this for you...
Give an example of what you are trying to sanitize, and maybe we an find the best solution for it.
Larry

Could you read the string into an XML document and look at the values and replace the > with > in the values. This would require recursively going into each node in the document but that shouldn't be too hard to do.

Steve_C, you may try this RegEx. This will give capture any HTML tags in reference 1, and the text between the tags is stored in capture 2. I didn't fully test this, just throwing it out there in case it might help.
<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.