I have to download and parse a website which is rendered by ASP.NET. If I use the code below I only get half of the page without the rendered "content" that I need. I would like to get the full content that I can see with Firebug or the IE Developer Tool.
How can I do this. I didn#t find a solution.
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(URL);
HttpWebResponse response = (HttpWebResponse)req.GetResponse();
StreamReader streamReader = new StreamReader(response.GetResponseStream());
string code = streamReader.ReadToEnd();
Thank you!
UPDATE
I tried the webcontrol solution. But it didn't work. I have in a WPF Project and use the following code and don't even get the content of a website. I don't see my mistake right now :( .
System.Windows.Forms.WebBrowser webBrowser = new System.Windows.Forms.WebBrowser();
Uri uri = new Uri(myAdress);
webBrowser.AllowNavigation = true;
webBrowser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(wb_DocumentCompleted);
webBrowser.Navigate(uri);
private void wb_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
System.Windows.Forms.WebBrowser wb = sender as System.Windows.Forms.WebBrowser;
string tmp = wb.DocumentText;
}
UPDATE 2
That's the code I came up with in the meantime.
However I don't get any output. My elementCollection doesn't return any values.
If I can get the html source as a string I'd be happy and parse it with the HtmlAgilityPack.
(I don't want to incoporate the browser into my XMAL code)
Sorry for getting on your nerves!
Thank you!
WebBrowser wb = new WebBrowser();
wb.Source = new Uri(MyURL);
HTMLDocument doc = (HTMLDocument)wb.Document;
IHTMLElementCollection elementCollection = doc.getElementsByName("body");
foreach (IHTMLElementCollection element in elementCollection)
{
tb.Text = element.toString();
}
If the page you're referring to has IFrames or other dynamic loading mechanisms, the use of HTTPWebRequest would'nt be enough. a better solution would be (if possible) to use a WebBrowser control
The answer might be that the content of the web site is rendered with JavaScript - probably with some AJAX calls that fetch additional data from the server to build the content. Firebug and IE Developer Tool will show you the rendered html code, but if you choose 'view source', you should see the same same html as the one that you fetch with the code.
I would use a tool like the Fiddler Web Debugger to monitor what the page downloads when it is rendered. You might be able to get the needed content by simulating the AJAX requests that the page makes.
Note that it can be a b*tch to simulate browsing ASP.NET web site if the navigation has been made with post backs, because you will need to include the value of all the form elements (including the hidden view state) when simulation clicks on links.
Probably not an answer, but you might use the WebClient class to simplify your code:
WebClient client = new WebClient();
string html = client.DownloadString(URL);
Your code should be downloading the entire page. However, the page may, through JavaScript, add content after it's been loaded. Unless you actually run that JavaScript in a web browser, you won't see the entire DOM you see in Firebug.
You can try this:
public override void Render(HtmlTextWriter writer):
{
StringBuilder renderedOutput = new StringBuilder();
Streamwriter strWriter = new StringWriter(renderedOutput);
HtmlTextWriter tWriter = new HtmlTextWriter(strWriter);
base.Render(tWriter);
string html = tWriter.InnerWriter.ToString();
string filename = Server.MapPath(".") + "\\data.txt";
outputStream = new FileStream(filename, FileMode.Create);
StreamWriter sWriter = new StreamWriter(outputStream);
sWriter.Write(renderedOutput.ToString());
sWriter.Flush();
//render for output
writer.Write(renderedOutput.ToString());
}
I will recommend you to use following rendering engine instead of the Web Browser
https://github.com/cefsharp/CefSharp
Related
I recently started working with Cefsharp browser in winforms by using the Load method some time its working fine but some times iam not able to render my html file Can some please help me.
BrowserSettings settings = new BrowserSettings();
Cef.Initialize(new CefSettings());
CefSharp.WinForms.ChromiumWebBrowser webBrowser = new CefSharp.WinForms.ChromiumWebBrowser(string.Empty);
webBrowser.Load(#"C:\kiranprac\CEFExample\CEFExample\HTMLResources\html\RTMTables_GetOrder.html");
OrderDetailsPnl.Controls.Add(webBrowser);
This is one of many timing issues in Chromium. You sometimes have to wait until the browser finishes the previous step before issuing another command.
In this case, you are constructing the browser with "about:blank", and then changing URL straight afterwards.
The easiest solution here is to supply your URL in the ChromiumWebBrowser constructor instead of calling Load separately.
when you create browser obj, give a valid url.
then load your html text right after. it works at cef v49!.
this works:
var browser = new ChromiumWebBrowser("http://google.com"); //workaround!! yess!!!
var htmlText = "<html>hello world- this my html</html>"
browser.LoadHtml(htmlText, "http://example/");
this doesnt work:
var browser = new ChromiumWebBrowser("randomstring"); // silent failll
var htmlText = "<html>hello world- this my html</html>"
browser.LoadHtml(htmlText, "http://example/");
I am developing an application which is showing web pages through a web browser control.
When I click the save button, the web page with images should be stored in local storage. It should be save in .html format.
I have the following code:
WebRequest request = WebRequest.Create(txtURL.Text);
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();
string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
html = sr.ReadToEnd();
}
Now string html contains the webpage content. I need to save this into D:\Cache\
How do i save the html contents to disk?
You can use this code to write your HTML string to a file:
var path= #"D:\Cache\myfile.html";
File.WriteAllText(path, html);
Further refinement: Extract the filename from your (textual) URL.
Update:
See Get file name from URI string in C# for details. The idea is:
var uri = new Uri(txtUrl.Text);
var filename = uri.IsFile
? System.IO.Path.GetFileName(uri.LocalPath)
: "unknown-file.html";
you have to write below code on save button
File.WriteAllText(path, browser.Document.Body.Parent.OuterHtml, Encoding.GetEncoding(browser.Document.Encoding));
Now the 'Body.parent' must save whole the page instead of just saving only part.
check it.
There is nothing built-in to the .NET Framework as far I know.
So my approach would be like below:
Use System.NET.HttpWebRequest to get the main HTML document as a
string or stream (easy). (Which you have done already)
Load this into a HTMLAgilityPack document where you can now easily
query the document to get lists of all image elements, stylesheet
links, etc.
Then make a separate web request for each of these files and save
them to a subdirectory.
Finally update all relevent links in the main page to point to the
items in the subdirectory.
In the project I have in mind I want to be able to look at a website, retrieve text from that website, and do something with that information later.
My question is what is the best way to retrieve the data(text) from the website. I am unsure about how to do this when dealing with a static page vs dealing with a dynamic page.
From some searching I found this:
WebRequest request = WebRequest.Create("anysite.com");
// If required by the server, set the credentials.
request.Credentials = CredentialCache.DefaultCredentials;
// Get the response.
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// Display the status.
Console.WriteLine(response.StatusDescription);
Console.WriteLine();
// Get the stream containing content returned by the server.
using (Stream dataStream = response.GetResponseStream())
{
// Open the stream using a StreamReader for easy access.
StreamReader reader = new StreamReader(dataStream, Encoding.UTF8);
// Read the content.
string responseString = reader.ReadToEnd();
// Display the content.
Console.WriteLine(responseString);
reader.Close();
}
response.Close();
So from running this on my own I can see it returns the html code from a website, not exactly what I'm looking for. I eventually want to be able to type in a site (such as a news article), and return the contents of the article. Is this possible in c# or Java?
Thanks
I hate to brake this to you but that's how webpages looks, it's a long stream of html markup/content. This gets rendered by the browser as what you see on your screen. The only way I can think of is to parse to html by yourself.
After a quick search on google I found this stack overflow article.
What is the best way to parse html in C#?
but I'm betting you figured this would be a bit easier than you expected, but that's the fun in programming always challenging problems
You can just use a WebClient:
using(var webClient = new WebClient())
{
string htmlFromPage = webClient.DownloadString("http://myurl.com");
}
In the above example htmlFromPage will contain the HTML which you can then parse to find the data you're looking for.
What you are describing is called web scraping, and there are plenty of libraries that do just that for both Java and C#. It doesn't really matter if the target site is static or dynamic since both output HTML in the end. JavaScript or Flash heavy sites on the other hand tend to be problematic.
Please try this,
System.Net.WebClient wc = new System.Net.WebClient();
string webData = wc.DownloadString("anysite.com");
HtmlDocument doc = webBrowser1.Document;
I can only get the Html document if I browse to a page.
Is it possible to get Html document:
without navigating the webpage?
Without Using Html Agility Pack?
This is one way of doing that
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
WebResponse response = request.GetResponse();
WebBrowser wb = new WebBrowser();
wb.DocumentStream = response.GetResponseStream();
wb.ScriptErrorsSuppressed = true;
HtmlDocument doc = wb.Document;
Same as the WebBrowser control it takes a few seconds for the contents of the stream to populate the control. Also make sure to do proper disposing after you are done.
You need a documented loaded for there to be a root element. Try loading "about:blank" to get an empty document without relying on any other URL or file.
I've been working on a WebCrawler written in C# using System.Windows.Forms.WebBrowser. I am trying to download a file off a website and save it on a local machine. More importantly, I would like this to be fully automated. The file download can be started by clicking a button that calls a javascript function that sparks the download displaying a “Do you want to open or save this file?” dialog. I definitely do not want to be manually clicking “Save as”, and typing in the file name.
I am aware of HttpWebRequest and WebClient’s download functions, but since the download is started with a javascript, I do now know the URL of the file. Fyi, the javascript is a doPostBack function that changes some values and submits a form.
I’ve tried getting focus on the save as dialog from WebBrowser to automate it from in there without much success. I know there’s a way to force the download to save instead of asking to save or open by adding a header to the http request, but I don’t know how to specify the filepath to download to.
I think you should prevent the download dialog from even showing. Here might be a way to do that:
The Javascript code causes your WebBrowser control to navigate to a specific Url (what would cause the download dialog to appear)
To prevent the WebBrowser control from actually Navigating to this Url, attach a event handler to the Navigating event.
In your Navigating event you'd have to analyze if this is the actual Navigation action you'd want to stop (is this one the download url, perhaps check for a file extension, there must be a recognizable format). Use the WebBrowserNavigatingEventArgs.Url to do so.
If this is the right Url, stop the Navigation by setting the WebBrowserNavigatingEventArgs.Cancel property.
Continue the download yourself with the HttpWebRequest or WebClient classes
Have a look at this page for more info on the event:
http://msdn.microsoft.com/en-us/library/system.windows.forms.webbrowser.navigating.aspx
A similar solution is available at
http://social.msdn.microsoft.com/Forums/en/csharpgeneral/thread/d338a2c8-96df-4cb0-b8be-c5fbdd7c9202/?prof=required
This work perfectly if there is direct URL including downloading file-name.
But sometime some URL generate file dynamically. So URL don't have file name but after requesting that URL some website create file dynamically and then open/save dialog comes.
for example some link generate pdf file on the fly.
How to handle such type of URL?
Take a look at Erika Chinchio article on http://www.codeproject.com/Tips/659004/Download-of-file-with-open-save-dialog-box
I have successfully used it for downloading dynamically generated pdf urls.
Assuming the System.Windows.Forms.WebBrowswer was used to access a protected page with a protected link that you want to download:
This code retrieves the actual link you want to download using the web browser. This code will need to be changed for your specific action. The important part is this a field documentLinkUrl that will be used below.
var documentLinkUrl = default(Uri);
browser.DocumentCompleted += (object sender, WebBrowserDocumentCompletedEventArgs e) =>
{
var aspForm = browser.Document.Forms[0];
var downloadLink = browser.Document.ActiveElement
.GetElementsByTagName("a").OfType<HtmlElement>()
.Where(atag =>
atag.GetAttribute("href").Contains("DownloadAttachment.aspx"))
.First();
var documentLinkString = downloadLink.GetAttribute("href");
documentLinkUrl = new Uri(documentLinkString);
}
browser.Navigate(yourProtectedPage);
Now that the protected page has been navigated to by the web browser and the download link has been acquired, This code downloads the link.
private static async Task DownloadLinkAsync(Uri documentLinkUrl)
{
var cookieString = GetGlobalCookies(documentLinkUrl.AbsoluteUri);
var cookieContainer = new CookieContainer();
using (var handler = new HttpClientHandler() { CookieContainer = cookieContainer })
using (var client = new HttpClient(handler) { BaseAddress = documentLinkUrl })
{
cookieContainer.SetCookies(this.documentLinkUrl, cookieString);
var response = await client.GetAsync(documentLinkUrl);
if (response.IsSuccessStatusCode)
{
var responseAsString = await response.Content.ReadAsStreamAsync();
// Response can be saved from Stream
}
}
}
The code above relies on the GetGlobalCookies method from Erika Chinchio which can be found in the excellent article provided by #Pedro Leonardo (available here),
[System.Runtime.InteropServices.DllImport("wininet.dll", CharSet = System.Runtime.InteropServices.CharSet.Auto, SetLastError = true)]
static extern bool InternetGetCookieEx(string pchURL, string pchCookieName,
System.Text.StringBuilder pchCookieData, ref uint pcchCookieData, int dwFlags, IntPtr lpReserved);
const int INTERNET_COOKIE_HTTPONLY = 0x00002000;
private string GetGlobalCookies(string uri)
{
uint uiDataSize = 2048;
var sbCookieData = new System.Text.StringBuilder((int)uiDataSize);
if (InternetGetCookieEx(uri, null, sbCookieData, ref uiDataSize,
INTERNET_COOKIE_HTTPONLY, IntPtr.Zero)
&&
sbCookieData.Length > 0)
{
return sbCookieData.ToString().Replace(";", ",");
}
return null;
}