Parsing HTML string usiing C# - c#

I have a string with html text as shown below.
string htmlText = "<h1>This is heading 1</h1><p>This is some text.</p>
<hr><h2>This is heading 2</h2><p>This is some other text.</p><hr>";
Can we convert this html string as we see it in browser after it has been parsed so that later we can use this parsed string where ever required.
Later I want to copy this data to a sharepoint list multiline rich text column. There I dont need these tags to come, but

This answer provides an example using HtmlAgilityPack, which is much more robust than rolling your own parsing or regular expressions.
XPATH is your friend :)
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>");
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
Console.WriteLine("text=" + node.InnerText);
}

Your question isn't entirely clear and cuts off at the end. But you can actually parse the data if you want. Just examine each character to find the tags using string indexes (e.g. htmlText[i]).
If you need something a little more robust, use HtmlMonkey or HtmlAgilityPack to parse it for you.

The best way is using regular expression to extract inner next between html tags
some. Something like this might does work:
((.+?)</h.?>)+((.+?)</p.?>)

Related

Parsing Html tags using c#

I have html code:
<p>Answer1</p>
<h2>Category1</h2>
<p>Answer2</p>
<p>Answer3</p>
I need to do parsing so that each answer (p) belongs to the category(h2) above.
If nothing is above, then the category will be null.
Look like this :
obj1.category = null; obj1.answer = "Answer1";
obj2.category ="Category1"; obj2.answer = "Answer2";
obj3.category ="Category1"; obj3.answer = "Answer3";
I tried to solve this, but it was useless.
Use HTMLAgilityPack. It will parse HTML and allow you do use LINQ to SELECT whatever you need from the DOM structure.
In addition to HTMLAgilityPack, I've also written a light weight HTML parse for C#.
There's no big secret to the technique, but it's sort of detailed work. You just go through the text character by character and pull out HTML elements.
My parser is on Github as HtmlMonkey.
UPDATE:
I just added support for fairly advanced selectors to easily find nodes within a parsed document.

Regex to find iframe tags and retrieve attributes

I am trying to retrieve iframe tags and attributes from an HTML input.
Sample input
<div class="1"><iframe width="100%" height="427px" src="https://www.youtube.com/embed/1" frameborder="0" allowfullscreen=""></iframe></div>
<div class="2"><iframe width="100%" height="427px" src="https://www.youtube.com/embed/2" frameborder="0" allowfullscreen=""></iframe></div>
I have been trying to collect them using the following regex:
<iframe.+?width=[\"'](?<width>.*?)[\"']?height=[\"'](?<height>.*?)[\"']?src=[\"'](?<src>.*?)[\"'].+?>
This results in
This is exactly the format I want.
The problem is, if the HTML attributes are in a different order this regex won't work.
Is there any way to modify this regex to ignore the attribute order and return the iframes grouped in Matches so that I could iterate through them?
Here is a regex that will ignore the order of attributes:
(?<=<iframe[^>]*?)(?:\s*width=["'](?<width>[^"']+)["']|\s*height=["'](?<height>[^'"]+)["']|\s*src=["'](?<src>[^'"]+["']))+[^>]*?>
RegexStorm demo
C# sample code:
var rx = new Regex(#"(?<=<iframe[^>]*?)(?:\s*width=[""'](?<width>[^""']+)[""']|\s*height=[""'](?<height>[^'""]+)[""']|\s*src=[""'](?<src>[^'""]+[""']))+[^>]*?>");
var input = #"YOUR INPUT STRING";
var matches = rx.Matches(input).Cast<Match>().ToList();
Output:
Regular expressions match patterns, and the structure of your string defines which pattern to use, thus, if you want to use regular expressions order is important.
You can deal with this in 2 ways:
The good and recommended way is to not parse HTML with regular expressions (mandatory link), but rather use a parsing framework such as the HTML Agility Pack. This should allow you to process the HTML you need and extract any values you are after.
The 2nd, bad, and non recommended way to do this is to break your matching into 2 parts. You first use something like so: <iframe(.+?)></iframe> to extract the entire iframe decleration and then, use multiple, smaller regular expressions to seek out and find the settings you are after. The above regex obviously fails if your iframe is structured like so: <iframe.../>. This should give you a hint as to why you should not do HTMl parsing through regular expressions.
As stated, you should go with the first option.
You can use this regex
<iframe[ ]+(([a-z]+) *= *['"]*([a-zA-Z0-9\/:\.%]*)['"]*[ ]*)*>
it matches each 'name'='value' pair recursively and stores it in the same order in matches, you can iterate through the mathes to get names and values sequentially. Caters for most chars in value but you may add a few more if needed.
With Html Agility Pack (to be had via nuget):
using System;
using HtmlAgilityPack;
namespace Demo
{
class Program
{
static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
doc.Load("HTMLPage1.html"); //or .LoadHtml(/*contentstring*/);
HtmlNodeCollection iframes = doc.DocumentNode.SelectNodes("//iframe");
foreach (HtmlNode iframe in iframes)
{
Console.WriteLine(iframe.GetAttributeValue("width","null"));
Console.WriteLine(iframe.GetAttributeValue("height", "null"));
Console.WriteLine(iframe.GetAttributeValue("src","null"));
}
}
}
}
You need to use an OR operator (|). See changes below
<iframe.+?width=[\"']((?<width>.*?)[\"']?)|(height=[\"'](?<height>.*?)[\"']?)|(src=[\"'](?<src>.*?)[\"']))*.+?>

How can I grab text before a tag with HTMLAgilityPack

Suppose I have this HTML string:
These are some links<br>1234 - <a id="1234" href="#">My Number 1</a><br>4321 - My Number 2...
I want to extract the text after the <br> tag (1234 -), the inner text of the <a> tag (My Number 1), and the id attribute of the <a> tag (1234) as well. I am using the HTMLAgilityPack to help parse the HTML data that I get.
So far I have tried doing this:
// mNodes = code to get html string I want to parse
HtmlNode mNumberListNodes = mNodes[1]; // mNodes[1] is equal to a string as shown above
List<HtmlNode> mNumberNodes = mNumberListNodes.Descendants("a").ToList();
I am using debugging points to stop and view the previous child nodes in each of the HtmlNode's, but I am not having any luck finding the corresponding number text.
Anyone have any experience using the HTMLAgilityPack in C# that could help me?
I believe the
mNodes.InnerText
property will give you all the text that is not in html tags, specifically the "1234" you want. Text itself is not a node in the DOM.
Assuming the code above is correct, to get the id value, use:
mNumberListNodes.Descendants("a").ToList()[0].Attributes["id"].Value
I've had pretty good success using XPath with this library, and also regular expressions.

Parse CSS out from <style> elements

Can someone tell me an efficient method of retrieving the CSS between tags on a page of markup in .NET?
I've come up with a method which uses recursion, Split() and CompareTo() but is really long-winded, and I feel sure that there must be a far shorter (and more clever) method of doing the same.
Please keep in mind that it is possible to have more than one element on a page, and that the element can be either or .
I'd probably go for HTML Agility Pack which gives you DOM style access to pages. It would be able to pick out your chunks of CSS data, but not actually parse this data into key/value pairs. You can get the relevant pieces of HTML using X-Path style expressions.
Edit: An example of typical usage of Html Agility Pack is shown below.
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
var nodes = doc.DocumentElement.SelectNodes("//a[#style"]);
//now you can iterate over the selected nodes etc
Here's a C# CSS parser. Should do what you need.
http://www.codeproject.com/KB/recipes/CSSParser.aspx
Try Regex.
goto:http://gskinner.com/RegExr/
paste html with css, and use this expression at the top:
<style type=\"text/css\">(.*?)</style>
here is the c# version:
using System.Text.RegularExpressions;
Match m = Regex.Match(this.textBox1.Text, "<style type=\"text/css\">(.*?)</style>", RegexOptions.Singleline);
if (m.Success)
{
string css = m.Groups[1].Value;
//do stuff
}

What is the best way to search through HTML in a C# string for specific text and mark the text?

What would be the best way to search through HTML inside a C# string variable to find a specific word/phrase and mark (or wrap) that word/phrase with a highlight?
Thanks,
Jeff
I like using Html Agility Pack very easy to use, although there hasn't been much updates lately, it is still usable. For example grabbing all the links
HtmlWeb client = new HtmlWeb();
HtmlDocument doc = client.Load("http://yoururl.com");
HtmlNodeCollection Nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var link in Nodes)
{
Console.WriteLine(link.Attributes["href"].Value);
}
Regular Expression would be my way. ;)
If the HTML you're using XHTML compliant, you could load it as an XML document, and then use XPath/XSL - long winded but kind of elegant?
An approach I used in the past is to use HTMLTidy to convert messy HTML to XHTML, and then use XSL/XPath for screen scraping content into a database, to create a reverse content management system.
Regular expressions would do it, but could be complicated once you try stripping out tags, image names etc, to remove false positives.
In simple cases, regular expressions will do.
string input = "ttttttgottttttt";
string output = Regex.Replace(input, "go", "<strong>$0</strong>");
will yield: "tttttt<strong>go</strong>ttttttt"
But when you say HTML, if you're referring to final text rendered, that's a bit of a mess. Say you've got this HTML:
<span class="firstLetter">B</span>ook
To highlight the word 'Book', you would need the help of a proper HTML renderer. To simplify, one can first remove all tags and leave only contents, and then do the usual replace, but it doesn't feel right.
You could look at using Html DOM, an open source project on SourceForge.net.
This way you could programmatically manipulate your text instead of relying regular expressions.
Searching for strings, you'll want to look up regular expressions. As for marking it, once you have the position of the substring it should be simple enough to use that to add in something to wrap around the phrase.

Categories