HtmlDocument doc = webBrowser1.Document;
I can only get the Html document if I browse to a page.
Is it possible to get Html document:
without navigating the webpage?
Without Using Html Agility Pack?
This is one way of doing that
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
WebResponse response = request.GetResponse();
WebBrowser wb = new WebBrowser();
wb.DocumentStream = response.GetResponseStream();
wb.ScriptErrorsSuppressed = true;
HtmlDocument doc = wb.Document;
Same as the WebBrowser control it takes a few seconds for the contents of the stream to populate the control. Also make sure to do proper disposing after you are done.
You need a documented loaded for there to be a root element. Try loading "about:blank" to get an empty document without relying on any other URL or file.
Related
I want to extract a couple of links from an html page downloaded from the internet, I think that using linq to XML would be a good solution for my case.
My problem is that I can't create an XmlDocument from the HTML, using Load(string url) didn't work so I downloaded the html to a string using:
public static string readHTML(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse res = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(res.GetResponseStream());
string html = sr.ReadToEnd();
sr.Close();
return html;
}
When I try to load that string using LoadXml(string xml) I get the exception
'--' is an unexpected token. The expected token is '>'
What way should I take to read the html file to a parsable XML
HTML simply isn’t the same as XML (unless the HTML actually happens to be conforming XHTML or HTML5 in XML mode). The best way is to use a HTML parser to read the HTML. Afterwards you may transform it to Linq to XML – or process it directly.
I haven't used it myself, but I suggest you take a look at SgmlReader. Here's a sample from their home page:
// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader()
{
DocType = "HTML",
WhitespaceHandling = WhitespaceHandling.All,
CaseFolding = Sgml.CaseFolding.ToLower,
InputStream = reader
};
// create document
XmlDocument doc = new XmlDocument()
{
PreserveWhitespace = true,
XmlResolver = null
};
doc.Load(sgmlReader);
return doc;
If you want to extract some links from a page, as you mentioned, try using HTML Agility Pack.
This code gets a page from the web and extracts all links:
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load("http://www.stackoverflow.com");
HtmlNode[] links = document.DocumentNode.SelectNodes("//a").ToArray();
Open an html file from disk and get URL for specific link:
HtmlDocument document2 = new HtmlDocument();
document2.Load(#"C:\Temp\page.html")
HtmlNode link = document2.DocumentNode.SelectSingleNode("//a[#id='myLink']");
Console.WriteLine(link.Attributes["href"].Value);
HTML is not XML. HTML is based on SGML, and as such does not ensure that the markup is well-formed XML (XML is a subset of SGML itself). You can only parse XHTML, i.e. XML compatible HTML, as XML. But of course that is not the case for most of the websites.
To work with HTML, you need to use a HTML parser.
If you know the nodes you're interested in I would use regex to extract the links from the string.
I am developing an application which is showing web pages through a web browser control.
When I click the save button, the web page with images should be stored in local storage. It should be save in .html format.
I have the following code:
WebRequest request = WebRequest.Create(txtURL.Text);
WebResponse response = request.GetResponse();
Stream data = response.GetResponseStream();
string html = String.Empty;
using (StreamReader sr = new StreamReader(data))
{
html = sr.ReadToEnd();
}
Now string html contains the webpage content. I need to save this into D:\Cache\
How do i save the html contents to disk?
You can use this code to write your HTML string to a file:
var path= #"D:\Cache\myfile.html";
File.WriteAllText(path, html);
Further refinement: Extract the filename from your (textual) URL.
Update:
See Get file name from URI string in C# for details. The idea is:
var uri = new Uri(txtUrl.Text);
var filename = uri.IsFile
? System.IO.Path.GetFileName(uri.LocalPath)
: "unknown-file.html";
you have to write below code on save button
File.WriteAllText(path, browser.Document.Body.Parent.OuterHtml, Encoding.GetEncoding(browser.Document.Encoding));
Now the 'Body.parent' must save whole the page instead of just saving only part.
check it.
There is nothing built-in to the .NET Framework as far I know.
So my approach would be like below:
Use System.NET.HttpWebRequest to get the main HTML document as a
string or stream (easy). (Which you have done already)
Load this into a HTMLAgilityPack document where you can now easily
query the document to get lists of all image elements, stylesheet
links, etc.
Then make a separate web request for each of these files and save
them to a subdirectory.
Finally update all relevent links in the main page to point to the
items in the subdirectory.
I want to extract some html elements from the "tablerow" contents of a html code and create a automated application.Can httpwebrequest and httpwebresponse help me doing that ? if yes then
could any one show me the sample of doing that...Thanking you in advance
I would go get HtmlAgilityPack from nuget. WebClient is easier, but HttpWebRequest is more powerful and allows for more control. Regex can work, but is generally a pain. If you think this document will be well enough formatted a quick XPath to the elements in question is usually much easier and cleaner, so try something like this:
var client = new WebClient();
//var html = client.DownloadString("YOURURL");
var html = "<html><body><table><tr><td></td></tr></table></body></html>";
var document = new HtmlDocument();
document.LoadHtml(html);
var nodes = document.DocumentNode.SelectNodes("//body/table");
Console.WriteLine(nodes[0].InnerHtml);
Console.ReadLine();
I have to download and parse a website which is rendered by ASP.NET. If I use the code below I only get half of the page without the rendered "content" that I need. I would like to get the full content that I can see with Firebug or the IE Developer Tool.
How can I do this. I didn#t find a solution.
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(URL);
HttpWebResponse response = (HttpWebResponse)req.GetResponse();
StreamReader streamReader = new StreamReader(response.GetResponseStream());
string code = streamReader.ReadToEnd();
Thank you!
UPDATE
I tried the webcontrol solution. But it didn't work. I have in a WPF Project and use the following code and don't even get the content of a website. I don't see my mistake right now :( .
System.Windows.Forms.WebBrowser webBrowser = new System.Windows.Forms.WebBrowser();
Uri uri = new Uri(myAdress);
webBrowser.AllowNavigation = true;
webBrowser.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(wb_DocumentCompleted);
webBrowser.Navigate(uri);
private void wb_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
System.Windows.Forms.WebBrowser wb = sender as System.Windows.Forms.WebBrowser;
string tmp = wb.DocumentText;
}
UPDATE 2
That's the code I came up with in the meantime.
However I don't get any output. My elementCollection doesn't return any values.
If I can get the html source as a string I'd be happy and parse it with the HtmlAgilityPack.
(I don't want to incoporate the browser into my XMAL code)
Sorry for getting on your nerves!
Thank you!
WebBrowser wb = new WebBrowser();
wb.Source = new Uri(MyURL);
HTMLDocument doc = (HTMLDocument)wb.Document;
IHTMLElementCollection elementCollection = doc.getElementsByName("body");
foreach (IHTMLElementCollection element in elementCollection)
{
tb.Text = element.toString();
}
If the page you're referring to has IFrames or other dynamic loading mechanisms, the use of HTTPWebRequest would'nt be enough. a better solution would be (if possible) to use a WebBrowser control
The answer might be that the content of the web site is rendered with JavaScript - probably with some AJAX calls that fetch additional data from the server to build the content. Firebug and IE Developer Tool will show you the rendered html code, but if you choose 'view source', you should see the same same html as the one that you fetch with the code.
I would use a tool like the Fiddler Web Debugger to monitor what the page downloads when it is rendered. You might be able to get the needed content by simulating the AJAX requests that the page makes.
Note that it can be a b*tch to simulate browsing ASP.NET web site if the navigation has been made with post backs, because you will need to include the value of all the form elements (including the hidden view state) when simulation clicks on links.
Probably not an answer, but you might use the WebClient class to simplify your code:
WebClient client = new WebClient();
string html = client.DownloadString(URL);
Your code should be downloading the entire page. However, the page may, through JavaScript, add content after it's been loaded. Unless you actually run that JavaScript in a web browser, you won't see the entire DOM you see in Firebug.
You can try this:
public override void Render(HtmlTextWriter writer):
{
StringBuilder renderedOutput = new StringBuilder();
Streamwriter strWriter = new StringWriter(renderedOutput);
HtmlTextWriter tWriter = new HtmlTextWriter(strWriter);
base.Render(tWriter);
string html = tWriter.InnerWriter.ToString();
string filename = Server.MapPath(".") + "\\data.txt";
outputStream = new FileStream(filename, FileMode.Create);
StreamWriter sWriter = new StreamWriter(outputStream);
sWriter.Write(renderedOutput.ToString());
sWriter.Flush();
//render for output
writer.Write(renderedOutput.ToString());
}
I will recommend you to use following rendering engine instead of the Web Browser
https://github.com/cefsharp/CefSharp
I want to extract a couple of links from an html page downloaded from the internet, I think that using linq to XML would be a good solution for my case.
My problem is that I can't create an XmlDocument from the HTML, using Load(string url) didn't work so I downloaded the html to a string using:
public static string readHTML(string url)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse res = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(res.GetResponseStream());
string html = sr.ReadToEnd();
sr.Close();
return html;
}
When I try to load that string using LoadXml(string xml) I get the exception
'--' is an unexpected token. The expected token is '>'
What way should I take to read the html file to a parsable XML
HTML simply isn’t the same as XML (unless the HTML actually happens to be conforming XHTML or HTML5 in XML mode). The best way is to use a HTML parser to read the HTML. Afterwards you may transform it to Linq to XML – or process it directly.
I haven't used it myself, but I suggest you take a look at SgmlReader. Here's a sample from their home page:
// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader()
{
DocType = "HTML",
WhitespaceHandling = WhitespaceHandling.All,
CaseFolding = Sgml.CaseFolding.ToLower,
InputStream = reader
};
// create document
XmlDocument doc = new XmlDocument()
{
PreserveWhitespace = true,
XmlResolver = null
};
doc.Load(sgmlReader);
return doc;
If you want to extract some links from a page, as you mentioned, try using HTML Agility Pack.
This code gets a page from the web and extracts all links:
HtmlWeb web = new HtmlWeb();
HtmlDocument document = web.Load("http://www.stackoverflow.com");
HtmlNode[] links = document.DocumentNode.SelectNodes("//a").ToArray();
Open an html file from disk and get URL for specific link:
HtmlDocument document2 = new HtmlDocument();
document2.Load(#"C:\Temp\page.html")
HtmlNode link = document2.DocumentNode.SelectSingleNode("//a[#id='myLink']");
Console.WriteLine(link.Attributes["href"].Value);
HTML is not XML. HTML is based on SGML, and as such does not ensure that the markup is well-formed XML (XML is a subset of SGML itself). You can only parse XHTML, i.e. XML compatible HTML, as XML. But of course that is not the case for most of the websites.
To work with HTML, you need to use a HTML parser.
If you know the nodes you're interested in I would use regex to extract the links from the string.