I am trying to get a page title from page source of different pages. But lets say some pages have title like this:
"This is an example," ABC.
It has some html in it like """. If i use string in c# to get this title i get the whole thing and while displaying it displays it like above which is wrong. Is there any way to ignore or to take into account html values in c#?
I am also using htmlagilitypack so anything in that will do too.
You can use WebUtility.HtmlDecode to decode html, link on MSDN:
WebUtility.HtmlDecode(""This is an example," ABC.");
just use:
using System.Net;
The result will be: "\"This is an example,\" ABC."
You also can use HtmlEntity.DeEntitize in HTML Agility Pack:
HtmlEntity.DeEntitize(string text)
You don't know what you can find in the page title. Sometimes is a whole mess there. My suggestion is to get the string as it is and process it before to show/save it.
In this case, the solution is simple: replace the
"
with corresponding char.
Each time you read a HTML document to extract some tags, take care to tags never closed. If the user forget to close the title tag... you'll get in that line the whole page!
Related
I need to retrieve some info from an html doc since the web service to get a json or an xml is still not ready. Im working with c# and using regular expressions to get the data i need from the html string. I've managed to get the div i want to work with from the whole html string but now i'm having trouble getting the info between the first span tag.
I've attempted to retrieve the data between ; and the first closing span tag but what i really want is the content between the first span tag.
Here's the regular expression i've written so far, but it's not working:
".*;(?<Content>(\r|\n|.)*)</span>"
I also tried this but didnt work either:
"<span class=""type"">(?<Content>(\r|\n|.)*)</span>"
Here is the div i want to retrieve the data from:
<div class="main">ABASASDFÓ 18/06/2014 17:38h Blabla Balbal <span class="type">15.80€ </span>+1.94 % +0.30€ | HOME <SPAN class="type2">11,398.70</span> +0.65 % +74.10</div>
EDIT: I can't use Htmlagilitypack since my client does not want us to use any external library. I've also heard about using the XmlReader but i'm not sure the structure of the html will match an xml one accordingly.
This regex will capture the string:
"<span class=\"type\">(?<Content>([^<]*))</span>"
Although, I agree with other answers, you should use something like Path instead of Regexes for parsing html.
Here's how it is done with a regex in Javascript. You should be able to adapt this for C# pretty easily.
var inner = html.match( /<span class="type"(?:\s+[a-z]+(?:\s*=\s*(?:"[^"]*"|'[^']*'|[^\s>]+)))*\s*>([\S\s]*)<\/span>/i)[1];
Fiddle: http://jsfiddle.net/GarryPas/uk32r8vz/
You want to use XPath for that. Something like this:
div/span/text()
I understand not wanting some external 3rd party library in your solution, the solution to that is to go fetch the source code of the entire library:
https://htmlagilitypack.codeplex.com/
Now you don't have an external library, you have an internal library and you can use the right tool for the job!
XmlReader is a fairly low-level tool, it could technically do the job for you but what you're more after is "use XmlReader to do XPath" which is talked about here: https://msdn.microsoft.com/en-us/library/ms950778.aspx
The XPathReader class is the result of all that, which has been superseded by LINQ to XML: https://msdn.microsoft.com/en-ca/library/bb387098.aspx
So another option here is to try to use some LINQ to process your HTML file, but that might be tricky since HTML isn't good XML. Still, it's another option if you're looking for those.
I don't know what it called, but i think this is possible
I am looking to write something(don't know the exact name) that will,
go to a webpage and select a value from drop-down box on that page and read values from that page after selection, I am not sure weather it called crawler or activity, i am new to this but i heard long time back from one of my friend this can be done,
can any one please give me a head start
Thanks
You need an HTTP client library (perhaps libcurl in C, or some C# wrapper for it, or some native C# HTTP client library like this).
You also need to parse the retrieved HTML content. So you probably need an HTML parsing library (maybe HTML agility pack).
If the targeted webpage is nearly fixed and has e.g. some comments to ease finding the relevant part, you might use simpler or ad-hoc parsing techniques.
Some sites might send a nearly empty static HTML client, with the actual page being dynamically constructed by Javascript scripts (Ajax). In that case, you are unlucky.
Maybe you want some web service ....
One simple way (but not the most efficient way) is to simply read the webpage as String using the WebClient, for example:
WebClient Web = new WebClient();
String Data = Web.DownloadString("Address");
Now since HTML is simply an XML document you can parse the string to a XDocument and look up the tag that represents the dropdown box. Parsing the string to XDocument is done this way:
XDocument xdoc = XDocument.Pase(Data);
Update:
If you want to read the result of the selected value, and that result is displayed within the page do this:
Get all the items as I explained.
If the page does not make use of models, then you can use your selected value as an argument for example :
www.somepage.com/Name=YourItem?
Read the page again and find the value
Within a c# project I'm sending a WebRequest to a php website, which takes the values and uses a select statement to query the DB and return an HTML page. Since there is only one value that comes back from that query, I need to assign this value to my c# code.
The source of the body-tag of the returned HTML-page (and with the StreamReader in my c#) looks like this:
<table border='1'>
<tr><td>ValueINeed</td></tr>
</table>
How do I access the value inside this in order to assign it to a string in my c# code?
thank you.
If you are the author of the PHP code as well, I would suggest that you make another page that returns json or something instead, this way you would be able to avoid parsing HTML.
But if this really is what you are stuck with, I would suggest that you take a look at Html Agility Pack. Here is another quesiton here on StackOverflow that are about how to use the Html Agility Pack.
If the result is always the same you could just split the string or use regex.
If not you may use a html parser: http://htmlagilitypack.codeplex.com/
I have the following two examples of html-
User: <a style="color:#333" href="http://foo.com/word"></a> blue elephant ·
User: <a style="color:#333" href="http://foo.com/word">#<b>word</b></a> blue elephant ·
I am trying to parse this using C# to put into a csv file and it is working to an extent however, when the html contains the '#' symbol in it, it will either leave the csv cell blank or not include the word with '#' before it. The main part I am trying to get is #word blue elephant however this is bringing back a blank cell, whereas the first html example brings back blue elephant as desired.
I am using the following technique to do this-
string[] comm = System.Text.RegularExpressions.Regex.Split(content[1], "<a");
How can I alter this to work for the second html example?
You want to use a proper HTML parser like the one in HTML agility pack in this situation (and save yourself from invoking the wrath of Cthulhu)
Some examples of how to use it
Getting started
Easily extracting links from a snippet of html with HtmlAgilityPack
I have rendered one of my controls into a string. I want to safely split the html string. I don't want any hanging html tags. I am working on a pagination control adapter.
How can I split my string, around less than a set number of chars) safely taking HTML into account?
Take a look at HtmlAgilityPack. You can use it to parse and manipulate the html in your string without having to resort to regex.
If you're looking for a nice way to show the HTML code you should try HTML Tidy.
I did not use it with a limitation on the number of charecters per line, but I think HTML Tidy wrap option might get you close to your target.