C# WebClient only downloads partial html - c#

I am working on some scraping app, i wanted to try to get it to work but ran into a problem. I have replaced the original scraping destination in the below code with googles webpage, just for testing. It seems that my download doesnt get everything, i note that the body and the html tags are missing their close tags. How do i get it to download everything? Whats wrong with my sample code:
string filename = "test.html";
WebClient client = new WebClient();
string searchTerm = HttpUtility.UrlEncode(textBox2.Text);
client.QueryString.Add("q", searchTerm);
client.QueryString.Add("hl", "en");
string data = client.DownloadString("http://www.google.com/search");
StreamWriter writer = new StreamWriter(filename, false, Encoding.Unicode);
writer.Write(data);
writer.Flush();
writer.Close();

Google's web pages are now in HTML 5, meaning the BODY and HTML tags can be self-closed - which is why Google omits them (believe it or not, it saves them bandwidth.)
See this article.
You can write HTML5 in either "HTML/SGML" mode (which allows the omitting of closing tags like HTML did prior to XHTML) or in "XHTML" which follows the rules of XML, requiring all tags to be closed.
Which the browser chooses to parse the page depends on whether you send a Content-type header of text/html for HTML/SGML syntax or application/xhtml+xml for XHTML syntax. (Source: HTML5 syntax - HTML vs XHTML)

...Google's page doesn't have the closing tags for <body> and <html>. Talk about crazy optimization...

http://www.google.com/search doesn't have closing tags.

Related

C# convert string to HTML

I have a string that has HTML formatted content.
Now I want to convert that string to HTML, May I use HtmlElementCollection
Is it possible? If yes, then how?
Kindly explain. Thanks!
Take a look at the HtmlAgilityPack. More information can be found on the answers of other similar questions.
A string will be handled as HTML when you push this string into an environment that will render the content as HTML.
When you push the content of the string to an environment that doesn't handle HTML or you explicitly say that you don't want it to render as HTML. It will be rendered as plain text.
Use HTMLAGILITYPACK and use the following:
var st1 = stringdata; // your html formatted string
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(st1); // this is now html doc
If someone still locking for this
For that just type code below in the place you what to show the HTML
#((MarkupString)htmlString)
//htmlString =>string that has HTML formatted content
//tested with blazor server last ver of dotnet(now 6.0)

how to save xml file from html page

I have requirement to save an XML document which is displayed in a HTML page.
The scenario is like this: I am sending search request to the server and in return I am getting xml file but displayed in html page. What I want is to save client-side the xml file displayed inside the html form, using javascript, asp.net (C#).
Please see this link (server return file like this)
http://www.w3schools.com/dom/books.xml
You need to save the stream from your request into a xmldoc.
Should be really straight forward.
Have a look here:
http://msdn.microsoft.com/en-us/library/system.xml.xmldocument.save(v=vs.100).aspx
Also, here there is a microsoft article about it:
http://support.microsoft.com/kb/307643
The sent file should have a HTTP header indicating it's meant to be saved, not shown.
Do it this way:
context.Response.ContentType = "application/force-download";
Assuming you are writing a client app to consume this file
using System.Net;
new WebClient().DownloadFile("http://www.w3schools.com/dom/books.xml", #"c:\books.xml");
As Rutesh said you can't go with javascript. Save it on the server with C#.NET and then access it from server. Secondly, the page you are talking about is not an html page its an xml file. I hope you know how to build an asp.net page, if yes then use this:
XmlDocument xdoc = new XmlDocument();
xdoc.Load("http://www.w3schools.com/dom/books.xml");
xdoc.Save("myfilename.xml");
Happy coding!
sourceResponse.ContentType = "application/octet-stream";
byte[] responseContent = "XML File content"
sourceResponse.AddHeader("content-disposition", String.Format("attachment; filename={0}.xml", "yourFileName"));
sourceResponse.BinaryWrite(responseContent);
sourceResponse.End();

Grabbing content from a website in C#

New to C# here, but I've used Java for years. I tried googling this and got a couple of answers that were not quite what I need. I'd like to grab the (X)HTML from a website and then use DOM (actually, CSS selectors are preferable, but whatever works) to grab a particular element. How exactly is this done in C#?
To get the HTML you can use the WebClient object.
To parse the HTML you can use HTMLAgility librrary.
// prepare the web page we will be asking for
HttpWebRequest request = (HttpWebRequest)
WebRequest.Create("http://www.stackoverflow.com");
// execute the request
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
// we will read data via the response stream
Stream resStream = response.GetResponseStream();
string tempString = null;
int count = 0;
do
{
// fill the buffer with data
count = resStream.Read(buf, 0, buf.Length);
// make sure we read some data
if (count != 0)
{
// translate from bytes to ASCII text
tempString = Encoding.ASCII.GetString(buf, 0, count);
// continue building the string
sb.Append(tempString);
}
}
while (count > 0); // any more data to read?
Then use Xquery expressions or Regex to grab the element you need
You could use System.Net.WebClient or System.Net.HttpWebrequest to fetch the page but parsing for the elements is not supported by the classes.
Use HtmlAgilityPack (http://html-agility-pack.net/)
HtmlWeb htmlWeb = new HtmlWeb();
htmlWeb.UseCookies = true;
HtmlDocument htmlDocument = htmlWeb.Load(url);
// after getting the document node
// you can do something like this
foreach (HtmlNode item in htmlDocument.DocumentNode.Descendants("input"))
{
// item mathces your req
// take the item.
}
I hear you want to use the HtmlAgilityPack for working with HTML files. This will give you Linq access, with is A Good Thing (tm). You can download the file with System.Net.WebClient.
You can use Html Agility Pack to load html and find the element you need.
To get you started, you can fairly easily use HttpWebRequest to get the contents of a URL. From there, you will have to do something to parse out the HTML. That is where it starts to get tricky. You can't use a normal XML parser, because many (most?) web site HTML pages aren't 100% valid XML. Web browsers have specially implemented parsers to work around the invalid portions. In Ruby, I would use something like Nokogiri to parse the HTML, so you might want to look for a .NET port of it, or another parser specificly designed to read HTML.
Edit:
Since the topic is likely to come up: WebClient vs. HttpWebRequest/HttpWebResponse
Also, thanks to the others that answered for noting HtmlAgility. I didn't know it existed.
Look into using the html agility pack, which is one of the more common libraries for parsing html.
http://htmlagilitypack.codeplex.com/

c# find image in html and download them

i want download all images stored in html(web page) , i dont know how much image will be download , and i don`t want use "HTML AGILITY PACK"
i search in google but all site make me more confused ,
i tried regex but only one result ... ,
People are giving you the right answer - you can't be picky and lazy, too. ;-)
If you use a half-baked solution, you'll deal with a lot of edge cases. Here's a working sample that gets all links in an HTML document using HTML Agility Pack (it's included in the HTML Agility Pack download).
And here's a blog post that shows how to grab all images in an HTML document with HTML Agility Pack and LINQ
// Bing Image Result for Cat, First Page
string url = "http://www.bing.com/images/search?q=cat&go=&form=QB&qs=n";
// For speed of dev, I use a WebClient
WebClient client = new WebClient();
string html = client.DownloadString(url);
// Load the Html into the agility pack
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
// Now, using LINQ to get all Images
List<HtmlNode> imageNodes = null;
imageNodes = (from HtmlNode node in doc.DocumentNode.SelectNodes("//img")
where node.Name == "img"
&& node.Attributes["class"] != null
&& node.Attributes["class"].Value.StartsWith("img_")
select node).ToList();
foreach(HtmlNode node in imageNodes)
{
Console.WriteLine(node.Attributes["src"].Value);
}
First of all I just can't leave this phrase alone:
images stored in html
That phrase is probably a big part of the reason your question was down-voted twice. Images are not stored in html. Html pages have references to images that web browsers download separately.
This means you need to do this in three steps: first download the html, then find the image references inside the html, and finally use those references to download the images themselves.
To accomplish this, look at the System.Net.WebClient() class. It has a .DownloadString() method you can use to get the html. Then you need to find all the <img /> tags. You're own your own here, but it's straightforward enough. Finally, you use WebClient's .DownloadData() or DownloadFile() methods to retrieve the images.
You can use a WebBrowser control and extract the HTML from that e.g.
System.Windows.Forms.WebBrowser objWebBrowser = new System.Windows.Forms.WebBrowser();
objWebBrowser.Navigate(new Uri("your url of html document"));
System.Windows.Forms.HtmlDocument objDoc = objWebBrowser.Document;
System.Windows.Forms.HtmlElementCollection aColl = objDoc.All.GetElementsByName("IMG");
...
or directly invoke the IHTMLDocument family of COM interfaces
In general terms
You need to fetch the html page
Search for img tags and extract the src="..." portion out of them
Keep a list of all these extracted image urls.
Download them one by one.
Maybe this question about C# HTML parser will help you a little bit more.

Generate PDF from ASP.NET from raw HTML/CSS content?

I'm sending emails that have invoices attached as PDFs. I'm already - elsewhere in the application - creating the invoices in an .aspx page. I'd like to use Server.Execute to return the output HTML and generate a PDF from that. Otherwise, I'd have to use a reporting tool to "draw" the invoice on a PDF. That blows for lots of reasons, not the least of which is that I'd have to update both the .aspx page and the report for every minor change. What to do...
There is no way to generate a PDF from an HTML string directly within .NET, but there are number of third party controls that work well.
I've had success with this one: http://www.html-to-pdf.net
and this: http://www.htmltopdfasp.net
The important questions to ask are:
Does it render correctly as compared to the 3 major browsers: IE, FF and Safari/Chrome?
Does it handle CSS fine?
Does the control have it's own rendering engine? If so, bounce it. You don't want to trust a home grown rendering engine - the browsers have a hard enough problem getting everything pixel perfect.
What dependencies does the third party control require? The fewer, the better.
There are a few others but they deal with ActiveX displays and such.
We use a product called ABCPDF for this and it works fantastic.
http://www.websupergoo.com/abcpdf-1.htm
This sounds like a job for Prince. It can take HTML and CSS and generate a PDF, which you can then present to your users. It supports CSS3 better than most web browsers (staff include HÃ¥kon Wium Lie, the inventor of CSS).
See the samples, especially the ones for Wikipedia pages, for the beautiful output it can generate. There's also an interesting Google Tech Talk with the authors.
Edit: There is a .NET wrapper available.
wkhtmltopdf is a free and cool exe to generate pdf from html. Its written in c++. But nReco htmltopdf is a wrapper dotnet library for this awesome tool. I implemented using this dotnet library and it was just so good it does everything by its own you just need to give html as a data source.
/// <summary>
/// Converts html into PDF using nReco dll and wkhtmltopdf.exe.
/// </summary>
private byte[] ConvertHtmlToPDF()
{
HtmlToPdfConverter nRecohtmltoPdfObj = new HtmlToPdfConverter();
nRecohtmltoPdfObj.Orientation = PageOrientation.Portrait;
nRecohtmltoPdfObj.PageFooterHtml = CreatePDFFooter();
nRecohtmltoPdfObj.CustomWkHtmlArgs = "--margin-top 35 --header-spacing 0 --margin-left 0 --margin-right 0";
return nRecohtmltoPdfObj.GeneratePdf(CreatePDFScript() + ShowHtml() + "</body></html>");
}
The above function is an excerpt from the below link post which explains it in detail.
HTML to PDF in ASP.Net
The initial question is about converting another aspx page containing an invoice to a PDF document. The invoice is probably using some session data and the user suggests to use Server.Execute() to obtain the invoice page HTML code and then to convert that code to PDF. Converting the invoice page URL directly is not possible because a new session would be created during conversion and the session data would be lost.
This is actually a good technique to preserve session data during conversion which is applied in Convert a HTML Page to PDF in Same Session ASP.NET Demo of the EvoPdf library. The complete C# code to get the HTML string rendered by the invoice page and to convert that string to PDF is:
// Execute the invoice page and get the HTML string rendered by this page
TextWriter outTextWriter = new StringWriter();
Server.Execute("Invoice.aspx", outTextWriter);
string htmlStringToConvert = outTextWriter.ToString();
// Create a HTML to PDF converter object with default settings
HtmlToPdfConverter htmlToPdfConverter = new HtmlToPdfConverter();
// Use the current page URL as base URL
string baseUrl = HttpContext.Current.Request.Url.AbsoluteUri;
// Convert the page HTML string to a PDF document in a memory buffer
byte[] outPdfBuffer = htmlToPdfConverter.ConvertHtml(htmlStringToConvert, baseUrl);
// Send the PDF as response to browser
// Set response content type
Response.AddHeader("Content-Type", "application/pdf");
// Instruct the browser to open the PDF file as an attachment or inline
Response.AddHeader("Content-Disposition", String.Format("attachment; filename=Convert_Page_in_Same_Session.pdf; size={0}", outPdfBuffer.Length.ToString()));
// Write the PDF document buffer to HTTP response
Response.BinaryWrite(outPdfBuffer);
// End the HTTP response and stop the current page processing
Response.End();
As long as you can make sure to use proper XHTML, you could also use a product like Alt-Soft's Xml2PDF to convert XML (XHTML) into PDF by means of XSLT/XSL-FO.
It takes a bit of a learning curve to master, but it works very well once you've "got" it!
Marc
Since you are producing the answer, you can use a tool like Report.NET:
http://sourceforge.net/projects/report/
I disagree with the answers that say you cannot convert directly from output to PDF, however, as you can "re-call" the page and get the HTML as a stream and convert it. I am not sure what tool you would want to use to do this, however. In other words, it is possible, but I am not sure it is worth it. The PDF creation libs, like Report.NET, even though they force reusing some logic and no automagic converrsion, it is easier.
I have not tried this component, but I have heard good things about it from those who have. The model is more like HTML, but I am not sure you can simply send a rendered ASPX to it to create PDF:
http://www.websupergoo.com/abcpdf-8.htm
If you try to find some html to pdf software via GOOGLE you'll get a pile of this stuff.
There are about 10 leaders but most of them use IE dlls in background mode.
Just couple of them use their own parsing engine.
Please try PDF Duo .NET component in your ASP.NET project if you wish to create a PDF programaticaly.
It is light component for a cool generating of PDF invoces, reports e.g.
I'd go a different route. Assuming you are using SQL Server, use SSRS and generate the PDF that way.
A possible minimal solution to use Server.Execute() to obtain the HTML of the invoice page and convert that code to a PDF using winnovative html to pdf api for .net is:
TextWriter outTextWriter = new StringWriter();
Server.Execute("Invoice.aspx", outTextWriter);
HtmlToPdfConverter htmlToPdfConverter = new HtmlToPdfConverter();
byte[] pdfBytes = htmlToPdfConverter.ConvertHtml(outTextWriter.ToString(),
httpContext.Current.Request.Url.AbsoluteUri);
You can use PDFSharp or iTextSharp to convert html to pdf. PDFSharp is not free.

Categories