I am sending 1 httpWebRequest and reading the response.
I am getting full page in the response.
I want to get 1 div which is names ad Rate from the response.
So how can I match that pattern?
My code is like:
HttpWebRequest WebReq = (HttpWebRequest)WebRequest.Create("http://www.domain.com/");
HttpWebResponse WebResp = (HttpWebResponse)WebReq.GetResponse();
Stream response = WebResp.GetResponseStream();
StreamReader data = new StreamReader(response);
string result = data.ReadToEnd();
I am getting response like:
<HTML><BODY><div id="rate">Todays rate 55 Rs.</div></BODY></HTML>
I want to read data of div rate. i.e. I should get Content "Todays rate 55 Rs."
So how can I make regex for this???
The HTML Agility Pack can load and parse the file for you, no need for messy streams and responses:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://jsbin.com/owobe3");
HtmlNode rateNode = doc.DocumentNode.SelectSingleNode("//div[#id='rate']");
string rate = rateNode.InnerText;
You should read the entire response and then use something like the Html Agility Pack to parse the response and extract the bits you want in an xpath-like syntax:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(result);
var output = doc.DocumentNode.SelectSingleNode("//div[#id='rate']").InnerHtml;
Dont use regular expressions!
If you have only one Todays rate text then you can do it like this:
Todays rate \d+ Rs.
In other case you can add div tag in your regex.
Edit: Sorry, haven't installed regex locally
You need to use grouping and get value from the group. It will look like this
<div id="rate">(?<group>[^<])</div>
Don't know if it works, however use this idea.
Related
I searched but could not find anything that worked for me.
A while ago I started with C# and my first personal project was a simple WebCrawler.
It should check the sourcecode for special Strings to identify if for example Google Analytics or something similar is included.
So it works fine but of course I'm missing the JS and Iframes since HttpWebRequest does not render the website as I know.
So I wanted to check for "<script src="" for example and then get the URL through a split.
But this does not work as expected and I don't think this is a clean and good way.
Since I'm checking for strings it could be destroyed by simply changing the string from "<script" to "< script" as example so I have no idea how to get a specific string from a big string.
I found regular expressions (rex) and split but I'm not sure if rex and split would be good since there could be more types of "src=" or split("\"", "\"", text)
I don't want a "here you go" of course I want to understand and to do it myself but I have no idea where to go from here..
Sorry for the long text and no examples but at the moment I have no access and there is not really much except for rex and split's
EDIT: I think I'll create a class which checks every char for a special row like "
Best,
Mike
Try Html agility pack
I haven't used it personally, but something like this should work (i haven't tested it):
string url = "some/url";
var request = (HttpWebRequest)HttpWebRequest.Create(url);
var webResponse = (HttpWebResponse)request.GetResponse();
var responseStream = webResponse.GetResponseStream();
var streamReader = new StreamReader(responseStream);
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(streamReader.ReadToEnd());
var scripts = doc.DocumentNode.Descendants()
.Where(n => n.Name == "script");
this should get you all script nodes to do with them what you want =)
So I found a way to get the JS URL here is my code
List<string> srcurl = new List<string>();
HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = hw.Load("some/url");
HtmlNodeCollection linkNodes = doc.DocumentNode.SelectNodes("//script[#src]");
foreach (HtmlNode linkNode in linkNodes)
{
HtmlAttribute link = linkNode.Attributes["src"];
srcurl.Add(link.Value);
}
Regarding the code from #avidenic, if you want to use it be aware to use
doc.LoadHtml(streamReader.ReadToEnd());
Best,
Mike
I'll try to explain what exactly I mean. I'm working on a program and I'm trying to download a bunch of images automatically from this site.
Namely, I want to download the big square icons from the page you get when you click on a hero name there, for example on the Darius page the image in the top left with the name DariusSquare.png and save that into a folder.
Is this possible or am I asking too much from C#?
Thank you very much!
In general, everything is possible given enough time and money. In your case, you need very little of former and none of latter :)
What you need to do can be described in following high-level steps:
Get all <a> tags within the table with heroes.
Use WebClient class to navigate to URL these <a> tags point to (i.e. to value of href attributes) and download the HTML
You will need to find some wrapper element that is present on each page with hero and that contains his image. Then, you should be able to get to the image src attribute and download it. Alternatively, perhaps each image has an common ID you can use?
I don't think anyone will provide you with an exact code that will perform these steps for you. Instead, you need to do some research of your own.
Yes it's possible, do a C# Web request and use the C# HTML Agility Pack to find the image url.
The you can use another web request to download the image:
Example downloading image from url:
public static Image LoadImage(string url)
{
var backgroundUrl = url;
var request = WebRequest.Create(backgroundUrl);
var response = request.GetResponse();
var stream = response.GetResponseStream();
return Image.FromStream(stream);
}
Example using html agility pack and getting some other data:
var request = (HttpWebRequest)WebRequest.Create(profileurl);
request.Method = "GET";
using (var response = request.GetResponse())
{
using (var stream = response.GetResponseStream())
{
using (var reader = new StreamReader(stream, Encoding.UTF8))
{
result = reader.ReadToEnd();
}
var doc = new HtmlDocument();
doc.Load(new StringReader(result));
var root = doc.DocumentNode;
HtmlNode profileHeader = root.SelectSingleNode("//*[#id='profile-header']");
HtmlNode profileRight = root.SelectSingleNode("//*[#id='profile-right']");
string rankHtml = profileHeader.SelectSingleNode("//*[#id='best-team-1']").OuterHtml.Trim();
#region GetPlayerAvatar
var avatarMatch = Regex.Match(profileHeader.SelectSingleNode("/html/body/div/div[2]/div/div/div/div/div/span").OuterHtml, #"(portraits[^(h3)]+).*no-repeat;", RegexOptions.IgnoreCase);
if (avatarMatch.Success)
{
battleNetPlayerFromDB.PlayerAvatarCss = avatarMatch.Value;
}
#endregion
}
}
I'm just trying to learn about HTMLAgilityPack and XPath, I'm attempting to get a list of (HTML Links) companies from the NASDAQ website;
http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx
I currently have the following code;
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// Create a request for the URL.
WebRequest request = WebRequest.Create("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx");
// Get the response.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// Get the stream containing content returned by the server.
Stream dataStream = response.GetResponseStream();
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader(dataStream);
// Read the content.
string responseFromServer = reader.ReadToEnd();
// Read into a HTML store read for HAP
htmlDoc.LoadHtml(responseFromServer);
HtmlNodeCollection tl = htmlDoc.DocumentNode.SelectNodes("//*[#id='indu_table']/tbody/tr[*]/td/b/a");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
Debug.Write(node.InnerText);
}
// Cleanup the streams and the response.
reader.Close();
dataStream.Close();
response.Close();
I've used an XPath addon for Chrome to get the XPath of;
//*table[#id='indu_table']/tbody/tr[*]/td/b/a
When running my project, I get an xpath unhandled exception about it being an invalid token.
I'm a little unsure what's wrong with it, i've tried to put a number in the tr[*] section above but i still get the same error.
I've been looking at this for the last hour, is it anything simple?
thanks
Since the data comes from javascript you have to parse the javascript and not the html, so the Agility Pack doesn't help that much, but it makes things a bit easier. The following is how it could be done using Agility Pack and Newtonsoft JSON.Net to parse the Javascript.
HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load(new WebClient().OpenRead("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx"));
List<string> listStocks = new List<string>();
HtmlNode scriptNode = htmlDoc.DocumentNode.SelectSingleNode("//script[contains(text(),'var table_body =')]");
if (scriptNode != null)
{
//Using Regex here to get just the array we're interested in...
string stockArray = Regex.Match(scriptNode.InnerText, "table_body = (?<Array>\\[.+?\\]);").Groups["Array"].Value;
JArray jArray = JArray.Parse(stockArray);
foreach (JToken token in jArray.Children())
{
listStocks.Add("http://www.nasdaq.com/symbol/" + token.First.Value<string>().ToLower());
}
}
To explain a bit more in detail, the data comes from one big javascript array on the page var table_body = [....
Each stock is one element in the array and is an array itself.
["ATVI", "Activision Blizzard, Inc", 11.75, 0.06, 0.51, 3058125, 0.06, "N", "N"]
So by parsing the array and taking the first element and appending the fix url we get the same result as the javascript.
Why won't you just use Descendants("a") method?
It's much simplier and is more object oriented. You'll just get a bunch of objects.
The you can just get the "href" attribute from those objects.
Sample code:
htmlDoc.DocumentNode.Descendants("a").Attributes["href"].Value
If you just need list of links from certain webpage, this method will do just fine.
If you look at the page source for that URL, there's not actually an element with id=indu_table. It appears to be generated dynamically (i.e. in javascript); the html that you get when loading directly from the server will not reflect anything that's changed by client script. This is probably why it's not working.
I am working on a simple web crawler to get a URL, Crawl first level links on the site and extract mails from all pages using RegEx...
I know it's kinda sloppy and it's just the beginning, but i always get "operation Timed Out" after 2 minutes of running the script..
private void button1_Click(object sender, System.EventArgs e)
{
string url = textBox1.Text;
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream());
string code = sr.ReadToEnd();
string re = "href=\"(.*?)\"";
MatchCollection href = Regex.Matches(code, #re, RegexOptions.Singleline);
foreach (Match h in href)
{
string link = h.Groups[1].Value;
if (!link.Contains("http://"))
{
HttpWebRequest request2 = (HttpWebRequest)WebRequest.Create(url + link);
HttpWebResponse response2 = (HttpWebResponse)request2.GetResponse();
StreamReader sr2 = new StreamReader(response.GetResponseStream());
string innerlink = sr.ReadToEnd();
MatchCollection m2 = Regex.Matches(code, #"([\w-]+(\.[\w-]+)*#([a-z0-9-]+(\.[a-z0-9-]+)*?\.[a-z]{2,6}|(\d{1,3}\.){3}\d{1,3})(:\d{4})?)", RegexOptions.Singleline);
foreach (Match m in m2)
{
string email = m.Groups[1].Value;
if (!listBox1.Items.Contains(email))
{
listBox1.Items.Add(email);
}
}
}
}
sr.Close();
}
Don't parse Html using Regex. Use the Html Agility Pack for that.
What is exactly the Html Agility Pack (HAP)?
This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).
More Information
Html Agility Pack on Codeplex
How to use HTML Agility pack
The comment by Oded is correct, we need to know what you need help with specifically; however, I can at least point you to the HtmlAgility Pack as it will solve most of your web scraping woes.
Good Luck!
Matt
I am trying to grab a data from a WEBPAGE , <DIV>particular class <DIV class="personal_info"> it has 10 similar <DIV>S and is of same Class "Personal_info" ( as shown in HTML Code and now i want to extract all the DIVs of Class personal_info which are in 10 - 15 in every webpage .
<div class="personal_info"><span class="bold">Rama Anand</span><br><br> Mobile: 9916184586<br>rama_asset#hotmail.com<br> Bangalore</div>
to do the needful i started using HTML AGILE PACK as suggested by some one in Stack overflow
and i stuck at the beginning it self bcoz of lack of knowledge in HtmlAgilePack my C# code goes like this
HtmlAgilityPack.HtmlDocument docHtml = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlWeb docHFile = new HtmlWeb();
docHtml = docHFile.Load("http://127.0.0.1/2.html");
then how to code further so that data from DIV whose class is "personal_info" can be grabbed ... suggestion with example will be appreciated
I can't check this right now, but isn't it:
var infos = from info in docHtml.DocumentNode.SelectNodes("//div[#class='personal_info']") select info;
To get a url loaded you can do something like:
var document = new HtmlAgilityPack.HtmlDocument();
var url = "http://www.google.com";
var request = (HttpWebRequest)WebRequest.Create(url);
using (var responseStream = request.GetResponse().GetResponseStream())
{
document.Load(responseStream, Encoding.UTF8);
}
Also note there is a fork to let you use jquery selectors in agility pack.
IEnumerable<HtmlNode> myList = document.QuerySelectorAll(".personal_info");
http://yosi-havia.blogspot.com/2010/10/using-jquery-selectors-on-server-sidec.html
What happened to Where?
node.DescendantNodes().Where(node_it => node_it.Name=="div");
if you want top node (root) you use page.DocumentNode as "node".