I have rendered one of my controls into a string. I want to safely split the html string. I don't want any hanging html tags. I am working on a pagination control adapter.
How can I split my string, around less than a set number of chars) safely taking HTML into account?
Take a look at HtmlAgilityPack. You can use it to parse and manipulate the html in your string without having to resort to regex.
If you're looking for a nice way to show the HTML code you should try HTML Tidy.
I did not use it with a limitation on the number of charecters per line, but I think HTML Tidy wrap option might get you close to your target.
Related
I need to retrieve some info from an html doc since the web service to get a json or an xml is still not ready. Im working with c# and using regular expressions to get the data i need from the html string. I've managed to get the div i want to work with from the whole html string but now i'm having trouble getting the info between the first span tag.
I've attempted to retrieve the data between ; and the first closing span tag but what i really want is the content between the first span tag.
Here's the regular expression i've written so far, but it's not working:
".*;(?<Content>(\r|\n|.)*)</span>"
I also tried this but didnt work either:
"<span class=""type"">(?<Content>(\r|\n|.)*)</span>"
Here is the div i want to retrieve the data from:
<div class="main">ABASASDFÓ 18/06/2014 17:38h Blabla Balbal <span class="type">15.80€ </span>+1.94 % +0.30€ | HOME <SPAN class="type2">11,398.70</span> +0.65 % +74.10</div>
EDIT: I can't use Htmlagilitypack since my client does not want us to use any external library. I've also heard about using the XmlReader but i'm not sure the structure of the html will match an xml one accordingly.
This regex will capture the string:
"<span class=\"type\">(?<Content>([^<]*))</span>"
Although, I agree with other answers, you should use something like Path instead of Regexes for parsing html.
Here's how it is done with a regex in Javascript. You should be able to adapt this for C# pretty easily.
var inner = html.match( /<span class="type"(?:\s+[a-z]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s>]+)))*\s*>([\S\s]*)<\/span>/i)[1];
Fiddle: http://jsfiddle.net/GarryPas/uk32r8vz/
You want to use XPath for that. Something like this:
div/span/text()
I understand not wanting some external 3rd party library in your solution, the solution to that is to go fetch the source code of the entire library:
https://htmlagilitypack.codeplex.com/
Now you don't have an external library, you have an internal library and you can use the right tool for the job!
XmlReader is a fairly low-level tool, it could technically do the job for you but what you're more after is "use XmlReader to do XPath" which is talked about here: https://msdn.microsoft.com/en-us/library/ms950778.aspx
The XPathReader class is the result of all that, which has been superseded by LINQ to XML: https://msdn.microsoft.com/en-ca/library/bb387098.aspx
So another option here is to try to use some LINQ to process your HTML file, but that might be tricky since HTML isn't good XML. Still, it's another option if you're looking for those.
I am trying to get a page title from page source of different pages. But lets say some pages have title like this:
"This is an example," ABC.
It has some html in it like """. If i use string in c# to get this title i get the whole thing and while displaying it displays it like above which is wrong. Is there any way to ignore or to take into account html values in c#?
I am also using htmlagilitypack so anything in that will do too.
You can use WebUtility.HtmlDecode to decode html, link on MSDN:
WebUtility.HtmlDecode(""This is an example," ABC.");
just use:
using System.Net;
The result will be: "\"This is an example,\" ABC."
You also can use HtmlEntity.DeEntitize in HTML Agility Pack:
HtmlEntity.DeEntitize(string text)
You don't know what you can find in the page title. Sometimes is a whole mess there. My suggestion is to get the string as it is and process it before to show/save it.
In this case, the solution is simple: replace the
"
with corresponding char.
Each time you read a HTML document to extract some tags, take care to tags never closed. If the user forget to close the title tag... you'll get in that line the whole page!
I have to display first N (for example say 50 or 100) characters out of entire html string. I have to display well formated html.If i apply simple substring that will get me a malformated html string
E.g.
Sample string : "<html><body>foo</body></html>"
trucated string: "<html><body><a href="http://foo.com">foo<"
This will get me malformated html :(
Any ideas on how to achieve this ??
You can try using the HTML Agility Pack - it will parse out the HTML for you, but you will need to figure out how to produce a truncated version yourself. It should make things a lot easier though.
Parse the HTML into a DOM tree. Start with the deepest/innermost elements and
remove the content of the innermost node, or the node if it has no content
check the string length.
Rinse, lather, repeat.
This may truncate your string to the empty string, if your desired length is small enough.
For extra kicks, you could try removing attributes of the nodes as you go.
I've seen some forum systems simply append a </b></u></i></s> after every single post. You could approach this in a similar fashion.
Of course, its ugly and it wouldn't fix that trailing <
That is by far the simplest method. Better method would actually be generating a tree and... kicking nodes off until you meet the requirement.
I want to strip all tags, remove the [show][Hide] stuffs from wikipedia, or is there some website that makes pages in more readable format.
Please I am aware of the Wikipedia printable version, but I don't need any tags in that, as I have some other use. So please answer the original question only, about any website or webservice or code snippets in php/C# to remove the tags from a webpages.
Also like when I copy some list from firefox it replaces <li> with the *, is it possible to set something in firefox to return some other non readable character like some kind of dot
You can start by taking a look at the strip_tags function.
You could use an HTML parser, BeautifulSoup (Python) or Simple HTML DOM for example. Or you could try using an XML parser.
I want to strip all tags, remove the
[show][Hide] stuffs from wikipedia, or
is there some website that makes pages
in more readable format.
You should take a look at DBpedia, Wikipedia, but just the data.
http://dbpedia.org/About
What about htmlagilitypack
htmlagilitypackt
Similar thread available in stackoverflow
Is there a Wikipedia API?
Try this function.
Dim pattern As String = "<(.|\n)*?>"
Return System.Text.RegularExpressions.Regex.Replace(strHtmlString, pattern, String.Empty).Trim()
What would be the best way to search through HTML inside a C# string variable to find a specific word/phrase and mark (or wrap) that word/phrase with a highlight?
Thanks,
Jeff
I like using Html Agility Pack very easy to use, although there hasn't been much updates lately, it is still usable. For example grabbing all the links
HtmlWeb client = new HtmlWeb();
HtmlDocument doc = client.Load("http://yoururl.com");
HtmlNodeCollection Nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var link in Nodes)
{
Console.WriteLine(link.Attributes["href"].Value);
}
Regular Expression would be my way. ;)
If the HTML you're using XHTML compliant, you could load it as an XML document, and then use XPath/XSL - long winded but kind of elegant?
An approach I used in the past is to use HTMLTidy to convert messy HTML to XHTML, and then use XSL/XPath for screen scraping content into a database, to create a reverse content management system.
Regular expressions would do it, but could be complicated once you try stripping out tags, image names etc, to remove false positives.
In simple cases, regular expressions will do.
string input = "ttttttgottttttt";
string output = Regex.Replace(input, "go", "<strong>$0</strong>");
will yield: "tttttt<strong>go</strong>ttttttt"
But when you say HTML, if you're referring to final text rendered, that's a bit of a mess. Say you've got this HTML:
<span class="firstLetter">B</span>ook
To highlight the word 'Book', you would need the help of a proper HTML renderer. To simplify, one can first remove all tags and leave only contents, and then do the usual replace, but it doesn't feel right.
You could look at using Html DOM, an open source project on SourceForge.net.
This way you could programmatically manipulate your text instead of relying regular expressions.
Searching for strings, you'll want to look up regular expressions. As for marking it, once you have the position of the substring it should be simple enough to use that to add in something to wrap around the phrase.