Parse CSS out from <style> elements - c#

Can someone tell me an efficient method of retrieving the CSS between tags on a page of markup in .NET?
I've come up with a method which uses recursion, Split() and CompareTo() but is really long-winded, and I feel sure that there must be a far shorter (and more clever) method of doing the same.
Please keep in mind that it is possible to have more than one element on a page, and that the element can be either or .

I'd probably go for HTML Agility Pack which gives you DOM style access to pages. It would be able to pick out your chunks of CSS data, but not actually parse this data into key/value pairs. You can get the relevant pieces of HTML using X-Path style expressions.
Edit: An example of typical usage of Html Agility Pack is shown below.
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
var nodes = doc.DocumentElement.SelectNodes("//a[#style"]);
//now you can iterate over the selected nodes etc

Here's a C# CSS parser. Should do what you need.
http://www.codeproject.com/KB/recipes/CSSParser.aspx

Try Regex.
goto:http://gskinner.com/RegExr/
paste html with css, and use this expression at the top:
<style type=\"text/css\">(.*?)</style>
here is the c# version:
using System.Text.RegularExpressions;
Match m = Regex.Match(this.textBox1.Text, "<style type=\"text/css\">(.*?)</style>", RegexOptions.Singleline);
if (m.Success)
{
string css = m.Groups[1].Value;
//do stuff
}

Related

Parsing Html tags using c#

I have html code:
<p>Answer1</p>
<h2>Category1</h2>
<p>Answer2</p>
<p>Answer3</p>
I need to do parsing so that each answer (p) belongs to the category(h2) above.
If nothing is above, then the category will be null.
Look like this :
obj1.category = null; obj1.answer = "Answer1";
obj2.category ="Category1"; obj2.answer = "Answer2";
obj3.category ="Category1"; obj3.answer = "Answer3";
I tried to solve this, but it was useless.
Use HTMLAgilityPack. It will parse HTML and allow you do use LINQ to SELECT whatever you need from the DOM structure.
In addition to HTMLAgilityPack, I've also written a light weight HTML parse for C#.
There's no big secret to the technique, but it's sort of detailed work. You just go through the text character by character and pull out HTML elements.
My parser is on Github as HtmlMonkey.
UPDATE:
I just added support for fairly advanced selectors to easily find nodes within a parsed document.

Parsing HTML string usiing C#

I have a string with html text as shown below.
string htmlText = "<h1>This is heading 1</h1><p>This is some text.</p>
<hr><h2>This is heading 2</h2><p>This is some other text.</p><hr>";
Can we convert this html string as we see it in browser after it has been parsed so that later we can use this parsed string where ever required.
Later I want to copy this data to a sharepoint list multiline rich text column. There I dont need these tags to come, but
This answer provides an example using HtmlAgilityPack, which is much more robust than rolling your own parsing or regular expressions.
XPATH is your friend :)
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(#"<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>");
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
Console.WriteLine("text=" + node.InnerText);
}
Your question isn't entirely clear and cuts off at the end. But you can actually parse the data if you want. Just examine each character to find the tags using string indexes (e.g. htmlText[i]).
If you need something a little more robust, use HtmlMonkey or HtmlAgilityPack to parse it for you.
The best way is using regular expression to extract inner next between html tags
some. Something like this might does work:
((.+?)</h.?>)+((.+?)</p.?>)

Regex to find iframe tags and retrieve attributes

I am trying to retrieve iframe tags and attributes from an HTML input.
Sample input
<div class="1"><iframe width="100%" height="427px" src="https://www.youtube.com/embed/1" frameborder="0" allowfullscreen=""></iframe></div>
<div class="2"><iframe width="100%" height="427px" src="https://www.youtube.com/embed/2" frameborder="0" allowfullscreen=""></iframe></div>
I have been trying to collect them using the following regex:
<iframe.+?width=[\"'](?<width>.*?)[\"']?height=[\"'](?<height>.*?)[\"']?src=[\"'](?<src>.*?)[\"'].+?>
This results in
This is exactly the format I want.
The problem is, if the HTML attributes are in a different order this regex won't work.
Is there any way to modify this regex to ignore the attribute order and return the iframes grouped in Matches so that I could iterate through them?
Here is a regex that will ignore the order of attributes:
(?<=<iframe[^>]*?)(?:\s*width=["'](?<width>[^"']+)["']|\s*height=["'](?<height>[^'"]+)["']|\s*src=["'](?<src>[^'"]+["']))+[^>]*?>
RegexStorm demo
C# sample code:
var rx = new Regex(#"(?<=<iframe[^>]*?)(?:\s*width=[""'](?<width>[^""']+)[""']|\s*height=[""'](?<height>[^'""]+)[""']|\s*src=[""'](?<src>[^'""]+[""']))+[^>]*?>");
var input = #"YOUR INPUT STRING";
var matches = rx.Matches(input).Cast<Match>().ToList();
Output:
Regular expressions match patterns, and the structure of your string defines which pattern to use, thus, if you want to use regular expressions order is important.
You can deal with this in 2 ways:
The good and recommended way is to not parse HTML with regular expressions (mandatory link), but rather use a parsing framework such as the HTML Agility Pack. This should allow you to process the HTML you need and extract any values you are after.
The 2nd, bad, and non recommended way to do this is to break your matching into 2 parts. You first use something like so: <iframe(.+?)></iframe> to extract the entire iframe decleration and then, use multiple, smaller regular expressions to seek out and find the settings you are after. The above regex obviously fails if your iframe is structured like so: <iframe.../>. This should give you a hint as to why you should not do HTMl parsing through regular expressions.
As stated, you should go with the first option.
You can use this regex
<iframe[ ]+(([a-z]+) *= *['"]*([a-zA-Z0-9\/:\.%]*)['"]*[ ]*)*>
it matches each 'name'='value' pair recursively and stores it in the same order in matches, you can iterate through the mathes to get names and values sequentially. Caters for most chars in value but you may add a few more if needed.
With Html Agility Pack (to be had via nuget):
using System;
using HtmlAgilityPack;
namespace Demo
{
class Program
{
static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
doc.Load("HTMLPage1.html"); //or .LoadHtml(/*contentstring*/);
HtmlNodeCollection iframes = doc.DocumentNode.SelectNodes("//iframe");
foreach (HtmlNode iframe in iframes)
{
Console.WriteLine(iframe.GetAttributeValue("width","null"));
Console.WriteLine(iframe.GetAttributeValue("height", "null"));
Console.WriteLine(iframe.GetAttributeValue("src","null"));
}
}
}
}
You need to use an OR operator (|). See changes below
<iframe.+?width=[\"']((?<width>.*?)[\"']?)|(height=[\"'](?<height>.*?)[\"']?)|(src=[\"'](?<src>.*?)[\"']))*.+?>

Extract id style from html page using Html agility pack

I have a c# application. I need to extract data from a html page and add it to my database. The html page contains some css code and I am interested in all of the id's attributes from the css. How can I pull out the id's info into my code? I tried something like this but it doesn't seem to work:
var styles = document.DocumentNode.SelecNodes("//style");
foreach(HtmlNode node in styles)
{
var text = node.Attributes["id"];
}
I really appreciate any help!
More of a fishing rod than a fish, but that's all I got time to do ATM.
First, look at this tutorial: xpath on w3schools. I've done some work with XPath, and it was only after going through their tutorial that things started to make a bit of sense.
Then, please get this html agility test pack, it will let you quickly test your code against the page you're trying to parse.
From here, it should be a short way to get what you want.
Try this, access Id property directly :
var styles = document.DocumentNode.SelecNodes("//*[#style]");
foreach(HtmlNode node in styles)
{
var text = node.Id;
}
Edit: expression changed to "//*[#style]" which gets you only elements with style attribute.

What is the best way to search through HTML in a C# string for specific text and mark the text?

What would be the best way to search through HTML inside a C# string variable to find a specific word/phrase and mark (or wrap) that word/phrase with a highlight?
Thanks,
Jeff
I like using Html Agility Pack very easy to use, although there hasn't been much updates lately, it is still usable. For example grabbing all the links
HtmlWeb client = new HtmlWeb();
HtmlDocument doc = client.Load("http://yoururl.com");
HtmlNodeCollection Nodes = doc.DocumentNode.SelectNodes("//a[#href]");
foreach (var link in Nodes)
{
Console.WriteLine(link.Attributes["href"].Value);
}
Regular Expression would be my way. ;)
If the HTML you're using XHTML compliant, you could load it as an XML document, and then use XPath/XSL - long winded but kind of elegant?
An approach I used in the past is to use HTMLTidy to convert messy HTML to XHTML, and then use XSL/XPath for screen scraping content into a database, to create a reverse content management system.
Regular expressions would do it, but could be complicated once you try stripping out tags, image names etc, to remove false positives.
In simple cases, regular expressions will do.
string input = "ttttttgottttttt";
string output = Regex.Replace(input, "go", "<strong>$0</strong>");
will yield: "tttttt<strong>go</strong>ttttttt"
But when you say HTML, if you're referring to final text rendered, that's a bit of a mess. Say you've got this HTML:
<span class="firstLetter">B</span>ook
To highlight the word 'Book', you would need the help of a proper HTML renderer. To simplify, one can first remove all tags and leave only contents, and then do the usual replace, but it doesn't feel right.
You could look at using Html DOM, an open source project on SourceForge.net.
This way you could programmatically manipulate your text instead of relying regular expressions.
Searching for strings, you'll want to look up regular expressions. As for marking it, once you have the position of the substring it should be simple enough to use that to add in something to wrap around the phrase.

Categories