There are many times I need to extract value of an element from a HTML page. Something like this:
<!-- many html here -->
<input type="hidden" name="id" value="ExtractMe!">
<!-- many html here -->
How can extract the value easily?
Have a look at the HTMLAgility pack, it makes this type of task very easy and regex-free.
If you need to parse HTML within your C# application consider using HTMLAgilityPack from here http://htmlagilitypack.codeplex.com/
If you just want to pluck values you're probably best to parse this as XML.
You have a choice of standard XML or LINQ.
have a look here or here for some examples.
Why don't you use regular expressions? This the MSDN Regular Expression Documentation, in there you can look for The section Extracting a Single Match or the First Match.
Related
I have html code:
<p>Answer1</p>
<h2>Category1</h2>
<p>Answer2</p>
<p>Answer3</p>
I need to do parsing so that each answer (p) belongs to the category(h2) above.
If nothing is above, then the category will be null.
Look like this :
obj1.category = null; obj1.answer = "Answer1";
obj2.category ="Category1"; obj2.answer = "Answer2";
obj3.category ="Category1"; obj3.answer = "Answer3";
I tried to solve this, but it was useless.
Use HTMLAgilityPack. It will parse HTML and allow you do use LINQ to SELECT whatever you need from the DOM structure.
In addition to HTMLAgilityPack, I've also written a light weight HTML parse for C#.
There's no big secret to the technique, but it's sort of detailed work. You just go through the text character by character and pull out HTML elements.
My parser is on Github as HtmlMonkey.
UPDATE:
I just added support for fairly advanced selectors to easily find nodes within a parsed document.
I want to download a html source, then search for the username and other information, and then display this in my program.
I'm pretty new to programming, but a straight noob when it comes to things like this (Regex) so I hope you can explain it to me.
I used Regex before extracting a K/D ratio from a html source, for that I used this code:
string pattern = #"<span class=""kdratio"">\d+\.\d+";
But I have no idea how to start on this one...
This is the line of the source that contains the information:
<section class="profile-header" profile="true" motto="user's motto" user="User" figure="hr-3322-45.hd-190-1.ch-3342-64-66.lg-285-64.sh-3068-82-66.ea-1404-64">
I only need the parts user="User" and figure="x", I couldn't try anything because I really wouldn't know how to start, because the html line looks so different from what I have experience with.
Regular expressions are not a good idea for matching HTML unless it's very simple, single, tag matching. See here: RegEx match open tags except XHTML self-contained tags
I recommend using an HTML DOM-parsing library and use XPath or CSS selectors to get the information you want. For .NET, HtmlAgilityPack is recommended. For CSS Selectors you'll want Fizzler (an add-on for HtmlAgilityPack).
In JavaScript (easily rewritten to C# and HtmlAgilityPack) it would be this:
document.querySelector(
"section[class=profile-header][profile=true][user=User]"
).textContent
HtmlAgilityPack: http://html-agility-pack.net
Fizzler: https://www.nuget.org/packages/Fizzler.Systems.HtmlAgilityPack/
Generally for parsing HTML, Regex is not a good choice! HTML tends to be so complicated and it is so hard to write a single Regex to be able to match everything! Instead use a parser like Html Agility Pack.
I need to retrieve some info from an html doc since the web service to get a json or an xml is still not ready. Im working with c# and using regular expressions to get the data i need from the html string. I've managed to get the div i want to work with from the whole html string but now i'm having trouble getting the info between the first span tag.
I've attempted to retrieve the data between ; and the first closing span tag but what i really want is the content between the first span tag.
Here's the regular expression i've written so far, but it's not working:
".*;(?<Content>(\r|\n|.)*)</span>"
I also tried this but didnt work either:
"<span class=""type"">(?<Content>(\r|\n|.)*)</span>"
Here is the div i want to retrieve the data from:
<div class="main">ABASASDFÓ 18/06/2014 17:38h Blabla Balbal <span class="type">15.80€ </span>+1.94 % +0.30€ | HOME <SPAN class="type2">11,398.70</span> +0.65 % +74.10</div>
EDIT: I can't use Htmlagilitypack since my client does not want us to use any external library. I've also heard about using the XmlReader but i'm not sure the structure of the html will match an xml one accordingly.
This regex will capture the string:
"<span class=\"type\">(?<Content>([^<]*))</span>"
Although, I agree with other answers, you should use something like Path instead of Regexes for parsing html.
Here's how it is done with a regex in Javascript. You should be able to adapt this for C# pretty easily.
var inner = html.match( /<span class="type"(?:\s+[a-z]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s>]+)))*\s*>([\S\s]*)<\/span>/i)[1];
Fiddle: http://jsfiddle.net/GarryPas/uk32r8vz/
You want to use XPath for that. Something like this:
div/span/text()
I understand not wanting some external 3rd party library in your solution, the solution to that is to go fetch the source code of the entire library:
https://htmlagilitypack.codeplex.com/
Now you don't have an external library, you have an internal library and you can use the right tool for the job!
XmlReader is a fairly low-level tool, it could technically do the job for you but what you're more after is "use XmlReader to do XPath" which is talked about here: https://msdn.microsoft.com/en-us/library/ms950778.aspx
The XPathReader class is the result of all that, which has been superseded by LINQ to XML: https://msdn.microsoft.com/en-ca/library/bb387098.aspx
So another option here is to try to use some LINQ to process your HTML file, but that might be tricky since HTML isn't good XML. Still, it's another option if you're looking for those.
I'm trying to match tags with C# and I'm having some trouble getting it to work. I have these tags:
<categories=1></categories=1>
The =1 could be really any number. It could be 1, 2, 3 or any other given number. Is there a way to match this tag in C# using IndexOf or RegEx or a better method.
So to give an example of how I want to use it. I would have something like:
if (PUT WORKING CODE HERE ONCE FIGURED OUT)
{
Do Something
}
Is there an easy way to do this?
Thanks!
I would suggest to first make the document valid XML by replacing those equation signs, then use any XML parser.
there is only one valid answer to this need, unless you are doing homeworks and need to learn how to code this yourself...
avoid reinventing things from scratch and use Html Agility Pack
it is called Html but also handles XML files, in case you have to do more complex things, like parsing, and don't want or cannot use pure XPath and XML related .NET Framework classes.
see here for some examples: How to use HTML Agility pack
I've been trying to formulate a regular expression to remove any attributes that may be present in html tags but I'm having trouble doing this and Google doesn't seem to provide any answers either.
Basically my input string looks something like
<p style="font-family:Arial;" class="x" onclick="doWhatever();">this text</p>
<img style="border:0px" src="pic.gif" />
and I would like to remove any attributes inside the tag to produce a string like:
<p>this text</p>
<img src="pic.gif" />
Does anybody know a regex for doing this? I'm using Regex.Replace in C# by the way.
There are really excellent tools for handling this sort of task in .NET without having to resort to the regex hammer. This will also be more reliable than a regular expression based solution.
I'd suggest that you take a look at HTML Agility Pack.
HTML is easiest interfaced with using a DOM, but if you really want to do this using a regex you could probably take advantage of that you want to remove all attributes, e.g. leave nothing left but the tag. IMO you should use a DOM parser instead.
either that or using jquery each to go trough all html elements and remove attr. or from particular element. Why would you be doing that anyway?