Grabbing content from a website in C#

Grabbing content from a website in C# - c#

New to C# here, but I've used Java for years. I tried googling this and got a couple of answers that were not quite what I need. I'd like to grab the (X)HTML from a website and then use DOM (actually, CSS selectors are preferable, but whatever works) to grab a particular element. How exactly is this done in C#?

To get the HTML you can use the WebClient object.
To parse the HTML you can use HTMLAgility librrary.

// prepare the web page we will be asking for
HttpWebRequest request = (HttpWebRequest)
WebRequest.Create("http://www.stackoverflow.com");
// execute the request
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// we will read data via the response stream
Stream resStream = response.GetResponseStream();
string tempString = null;
int count = 0;
do
{
// fill the buffer with data
count = resStream.Read(buf, 0, buf.Length);
// make sure we read some data
if (count != 0)
{
// translate from bytes to ASCII text
tempString = Encoding.ASCII.GetString(buf, 0, count);
// continue building the string
sb.Append(tempString);
}
}
while (count > 0); // any more data to read?
Then use Xquery expressions or Regex to grab the element you need

You could use System.Net.WebClient or System.Net.HttpWebrequest to fetch the page but parsing for the elements is not supported by the classes.
Use HtmlAgilityPack (http://html-agility-pack.net/)
HtmlWeb htmlWeb = new HtmlWeb();
htmlWeb.UseCookies = true;
HtmlDocument htmlDocument = htmlWeb.Load(url);
// after getting the document node
// you can do something like this
foreach (HtmlNode item in htmlDocument.DocumentNode.Descendants("input"))
{
// item mathces your req
// take the item.
}

I hear you want to use the HtmlAgilityPack for working with HTML files. This will give you Linq access, with is A Good Thing (tm). You can download the file with System.Net.WebClient.

You can use Html Agility Pack to load html and find the element you need.

To get you started, you can fairly easily use HttpWebRequest to get the contents of a URL. From there, you will have to do something to parse out the HTML. That is where it starts to get tricky. You can't use a normal XML parser, because many (most?) web site HTML pages aren't 100% valid XML. Web browsers have specially implemented parsers to work around the invalid portions. In Ruby, I would use something like Nokogiri to parse the HTML, so you might want to look for a .NET port of it, or another parser specificly designed to read HTML.
Edit:
Since the topic is likely to come up: WebClient vs. HttpWebRequest/HttpWebResponse
Also, thanks to the others that answered for noting HtmlAgility. I didn't know it existed.

Look into using the html agility pack, which is one of the more common libraries for parsing html.
http://htmlagilitypack.codeplex.com/

Related

How to read only a small part of a .XML

I built an application in order to read a file, but even with the fact that my connection is fast, the page takes several seconds to load, I would like to know how to read only the first records of this .xml
string rssURL = "http://www.cnt.org.br/Paginas/feed.aspx?t=n";
System.Net.WebRequest myRequest = System.Net.WebRequest.Create(rssURL);
System.Net.WebResponse myResponse = myRequest.GetResponse();
System.IO.Stream rssStream = myResponse.GetResponseStream();
System.Xml.XmlDocument rssDoc = new System.Xml.XmlDocument();
rssDoc.Load(rssStream);
System.Xml.XmlNodeList rssItems = rssDoc.SelectNodes("rss/channel/item");
Tks..

As the fore posters mention you can’t download part of a web request. But you can start parsing Xml before the request finished. Using XmlDocument is the wrong approach for your use case, because it needs the complete request to create the object. Try using XmlTextReader.

There is no easy way to download part of a web request and ensure it is what you want. One workaround would be to use the Google Feed API.
You'd have to use the JSON interface since they don't provide a library for C#, but since it's going through Google's servers it will be much faster. You'd have to modify your code a little bit, since it returns JSON by default instead of XML, but that is a trivial change to make. You can also change the parameter output=xml to retrieve the XML representation of the data.
Try going to this page, that is your same feed, with fewer elements and loads much faster. That only returns a few elements, but if you want 10 elements, all you have to do is add num=10 to the URL. For example, this url has 10 elements. Read the API documentation a little more to see what variables you can add to cater the request to what you want to do.

Reading information from a website c#

In the project I have in mind I want to be able to look at a website, retrieve text from that website, and do something with that information later.
My question is what is the best way to retrieve the data(text) from the website. I am unsure about how to do this when dealing with a static page vs dealing with a dynamic page.
From some searching I found this:
WebRequest request = WebRequest.Create("anysite.com");
// If required by the server, set the credentials.
request.Credentials = CredentialCache.DefaultCredentials;
// Get the response.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// Display the status.
Console.WriteLine(response.StatusDescription);
Console.WriteLine();
// Get the stream containing content returned by the server.
using (Stream dataStream = response.GetResponseStream())
{
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader(dataStream, Encoding.UTF8);
// Read the content.
string responseString = reader.ReadToEnd();
// Display the content.
Console.WriteLine(responseString);
reader.Close();
}
response.Close();
So from running this on my own I can see it returns the html code from a website, not exactly what I'm looking for. I eventually want to be able to type in a site (such as a news article), and return the contents of the article. Is this possible in c# or Java?
Thanks

I hate to brake this to you but that's how webpages looks, it's a long stream of html markup/content. This gets rendered by the browser as what you see on your screen. The only way I can think of is to parse to html by yourself.
After a quick search on google I found this stack overflow article.
What is the best way to parse html in C#?
but I'm betting you figured this would be a bit easier than you expected, but that's the fun in programming always challenging problems

You can just use a WebClient:
using(var webClient = new WebClient())
{
string htmlFromPage = webClient.DownloadString("http://myurl.com");
}
In the above example htmlFromPage will contain the HTML which you can then parse to find the data you're looking for.

What you are describing is called web scraping, and there are plenty of libraries that do just that for both Java and C#. It doesn't really matter if the target site is static or dynamic since both output HTML in the end. JavaScript or Flash heavy sites on the other hand tend to be problematic.

Please try this,
System.Net.WebClient wc = new System.Net.WebClient();
string webData = wc.DownloadString("anysite.com");

get html elements of <tr>....</tr> by creating a httpwebrequest in c#

I want to extract some html elements from the "tablerow" contents of a html code and create a automated application.Can httpwebrequest and httpwebresponse help me doing that ? if yes then
could any one show me the sample of doing that...Thanking you in advance

I would go get HtmlAgilityPack from nuget. WebClient is easier, but HttpWebRequest is more powerful and allows for more control. Regex can work, but is generally a pain. If you think this document will be well enough formatted a quick XPath to the elements in question is usually much easier and cleaner, so try something like this:
var client = new WebClient();
//var html = client.DownloadString("YOURURL");
var html = "<html><body><table><tr><td></td></tr></table></body></html>";
var document = new HtmlDocument();
document.LoadHtml(html);
var nodes = document.DocumentNode.SelectNodes("//body/table");
Console.WriteLine(nodes[0].InnerHtml);
Console.ReadLine();

Read only the title and/or META tag of HTML file, without loading complete HTML file

Scenario :
I need to parse millions of HTML files/pages (as fact as I can) & then read only only Title or Meta part of it & Dump it to Database
What I am doing is using System.Net.WebClient Class's DownloadString(url_path) to download & then Saving it to Database by LINQ To SQL
But this DownloadString function gives me complete html source, I just need only Title part & META tag part.
Any ideas, to download only that much content?

I think you can open a stream with this url and use this stream to read the first x bytes, I can't tell the exact number but i think you can set it to reasonable number to get the title and the description.
HttpWebRequest fileToDownload = (HttpWebRequest)HttpWebRequest.Create("YourURL");
using (WebResponse fileDownloadResponse = fileToDownload.GetResponse())
{
using (Stream fileStream = fileDownloadResponse.GetResponseStream())
{
using (StreamReader fileStreamReader = new StreamReader(fileStream))
{
char[] x = new char[Number];
fileStreamReader.Read(x, 0, Number);
string data = "";
foreach (char item in x)
{
data += item.ToString();
}
}
}
}

I suspect that WebClient will try to download the whole page first, in which case you'd probably want a raw client socket. Send the appropriate HTTP request (manually, since you're using raw sockets), start reading the response (which will not be immediately) and kill the connection when you've read enough. However, the rest will have probably already been sent from the server and winging its way to your PC whether you want it or not, so you might not save much - if anything - of the bandwidth.
Depending on what you want it for, many half decent websites have a custom 404 page which is a lot simpler than a known page. Whether that has the information you're after is another matter.

You can use the verb "HEAD" in a HttpWebRequest to return the the response headers (not element. To get the full element with the meta data you'll need to download the page and parse out the meta data you want.
System.Net.WebRequest.Create(uri) { Method = "HEAD" };

How can I import a raw RSS feed in C#?

Does anyone know an easy way to import a raw, XML RSS feed into C#? Am looking for an easy way to get the XML as a string so I can parse it with a Regex.
Thanks,
-Greg

This should be enough to get you going...
using System.Net
WebClient wc = new WebClient();
Stream st = wc.OpenRead(“http://example.com/feed.rss”);
using (StreamReader sr = new StreamReader(st)) {
string rss = sr.ReadToEnd();
}

If you're on .NET 3.5 you now got built-in support for syndication feeds (RSS and ATOM). Check out this MSDN Magazine Article for a good introduction.
If you really want to parse the string using regex (and parsing XML is not what regex was intended for), the easiest way to get the content is to use the WebClient class.It got a download string which is straight forward to use. Just give it the URL of your feed. Check this link for an example of how to use it.

I would load the feed into an XmlDocument and use XPATH instead of regex, like so:
XmlDocument doc = new XmlDocument();
HttpWebRequest request = WebRequest.Create(feedUrl) as HttpWebRequest;
using (HttpWebResponse response = request.GetResponse() as HttpWebResponse)
{
StreamReader reader = new StreamReader(response.GetResponseStream());
doc.Load(reader);
<parse with XPATH>
}

What are you trying to accomplish?
I found the System.ServiceModel.Syndication classes very helpful when working with feeds.

You might want to have a look at this: http://www.codeproject.com/KB/cs/rssframework.aspx

XmlDocument (located in System.Xml, you will need to add a reference to the dll if it isn't added for you) is what you would use for getting the xml into C#. At that point, just call the InnerXml property which gives the inner Xml in string format then parse with the Regex.

The best way to grab an RSS feed as the requested string would be to use the System.Net.HttpWebRequest class. Once you've set up the HttpWebRequest's parameters (URL, etc.), call the HttpWebRequest.GetResponse() method. From there, you can get a Stream with WebResponse.GetResponseStream(). Then, you can wrap that stream in a System.IO.StreamReader, and call the StreamReader.ReadToEnd(). Voila.

The RSS is just xml and can be streamed to disk easily. Go with Darrel's example - it's all you'll need.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Grabbing content from a website in C# - c#

To get the HTML you can use the WebClient object. To parse the HTML you can use HTMLAgility librrary.

I hear you want to use the HtmlAgilityPack for working with HTML files. This will give you Linq access, with is A Good Thing (tm). You can download the file with System.Net.WebClient.

You can use Html Agility Pack to load html and find the element you need.

Look into using the html agility pack, which is one of the more common libraries for parsing html. http://htmlagilitypack.codeplex.com/

Related

How to read only a small part of a .XML

Reading information from a website c#

get html elements of <tr>....</tr> by creating a httpwebrequest in c#

Read only the title and/or META tag of HTML file, without loading complete HTML file

How can I import a raw RSS feed in C#?

Categories

Resources