I have some code that connects to an HTTP API, and is supposed to get an XML response. When the API link is placed in a browser, the browser downloads the XML as a file. However, when the code connects to the same API, HTML is returned. I've told the API owner but they don't think anything is wrong. Is there a way to capture the downloaded file instead of the HTML?
I've tried setting the headers to make my code look like a browser. Also tried using WebRequest instead of WebClient. But nothing works.
Here is the code, the URL works in the browser (file downloaded) but doesn't work for WebClient:
WebClient webClient = new WebClient();
string result = webClient.DownloadString(url);
The code should somehow get the XML file instead of the page HTML (actually the HTML doesn't appear in the browser, only the file).
The uri that you access may be a HTML page that have its own mechanism (like generating the actual download address which may be dynamically generated by the server and redirecting to it, in order to prevent external linking access) to access the real file.
It is supposed to use a browser background browser core like CefSharp to run the HTML and its javascript to let it navigate, and probably you may want to hook the download event to handle the downloading.
I think you need to add accept header to the WebClient object.
using (var client = new WebClient())
{
client.Headers[HttpRequestHeader.Accept] = "application/xml;q=1";
string result = webClient.DownloadString(url);
}
Thank you all for your input. In the end, it was caused by our vendor switching to TLS 1.2. I just had to force my code to use 1.2 and then it worked.
Related
I am learning to build a windows service, but it takes requests (e.g. localhost:8081/index) from the browser to. Therefore, the HTTP response should contain an HTML page.
The HTML page looks okay when I double click the index.html file, but it lost all the CSS and js files when I request from the web browser. And I open the developer's tool in chrome and found out that all the CSS and JS files were corrupted and contain the code from my HTML page (weird).
I used HttpListenerContext class to listen for http://localhost/index request, and then open index.html file and used File.ReadAllBytes(file). When composing the response, I used the following code:
responseBytes = File.ReadAllBytes(file);
response.ContentLength64 = responseBytes.Length;
await response.OutputStream.WriteAsync(responseBytes, 0, responseBytes.Length);
response.OutputStream.Close();
Can anyone help me to figure out why this is happening?
So I figured out the answer by myself when I traced the code. When request localhost/index comes in from the browser, the GET method will first acquire index.html first, then there are other GET requests come in for the CSS, image and JS as well (It's weird that I didn't see the later requests before.)
Then I updated the handler to compose responses contains CSS, image and JS files. Everything works perfectly now.
I need to be able to download a file from url. The url does not link to an actual file instead it generates the file on the server first and then gives a download dialog. Probably returning an mvc FileResult.
I'm just interested in getting the byte[] from the file.
I've tried:
using (var webClient = new WebClient())
{
System.Uri uri = new System.Uri(Document.Url);
bytes = await webClient.DownloadDataTaskAsync(uri);
}
This works but I get a corrupted file as expected.
I do not have control over how the server generates or serves the file.
Any way to wait for the file to complete generating and then get the file content?
TIA
Never mind. Turns out the link returns some javascript that auto authenticates a user and then does a jquery get to another url and port to generate the file. So I was basically downloading that script and saving it to pdf. doh.
So a work around would be to mimic that in some way.
I have 100 pages on my site, but I want download only part a page instead of all page content.
I want just one box of each page to download, the file size is 10 KB.
For this I Use WebClient and htmlagilitypack .
WebClient Client = new WebClient();
var result = Encoding.GetEncoding("UTF-8").GetString(Client.DownloadData(URL));
Unfortunately, that's not possible, because HTTP is not designed to deliver a specific part of a web page. It does support range requests, but for that you would need to know where exactly (in terms of bytes) the desired content is located.
You can
download the whole page and then
use a HTML parsing library to extract the part you need.
You cannot achieve this.
The only solution is changing the website structure itself. if you have control of the server -
Change the architecture of your website, making the data in the box accessible via an ajax call.
Now you can get the data via the WebClient.
If that data is already served via a API call, you can point your WebClient to that URI Instead.
Here is an example of structuring you website based on ajax -
AJAX with jQuery and ASP.NET
I am currently trying to do a screen scrape using the following code:
HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
HttpWebResponse theResponse = (HttpWebResponse) request.GetResponse();
using (StreamReader reader = new StreamReader(theResponse.GetResponseStream(), Encoding.UTF8))
{
string s = reader.ReadToEnd();
}
However, the data I am concerned with (an HTML table) is not part of the result. When I right click the page and ViewSource, I also do not see the HTML table I care about - however I do see it in the DOM when I use Firebug to inspect it.
It doesn't seem to be loaded via ajax either.
So - is there another way, using C#, to get the DOM as it exists in the Developer Tool view, rather than the ViewSource result?
Unfortunately, this page is not publicly available so I can't paste the URL.
It doesn't seem to be loaded via ajax either.
You don't need to use AJAX in order to dynamically add data to the DOM. You could perfectly fine use standard javascript.
To scrape such page you need a scraper that processes javascript. The WebBrowser control in WinForms does that. It allows you to load a web page and explore the DOM, just as you do in FireBug (except that the snapshot comes from IE because the WebBrowser is just a wrapper around IE).
But since the WebBrowser control is not designed to be used in a multithreaded environment (such as a web application) you will have to use a third party library to achieve that scraping task.
Have you used Fiddler or Ethereal to see what URL's are being connected to in the background? If you find the HTML table in the response from one of the URL's called in the background, you can scrape the data from that URL. Which URL/table are you trying to parse?
I´m sending the value of a variable via POST to a PHP page in C#. I get the data stream from the server that has all the web page in HTML with the value of the POST. This information is stored in a string variable.
I would like to open a browser and show the web page (maybe using System.Diagnostics.Process.Start("URL")), without having to save it in a file, this is showing the page in the moment and, when the browser is closed, no file is stored in the server.
Any idea?
Drop a WebBrowser control into a new form webBrowser1 and set its DocumentTextProperty to your result html
webBrowser1.DocumentText = ("<html><body>hello world</body></html>");
source:
<html><body>hello world</body></html>
You aren't going to be able to do that in an agnostic way.
If you simply wanted to open the URL in a browser, then using the Process class would work.
Unfortunately, in your case, you already have the content from creating the POST to the server, and you really want to stream that response in your application to the browser.
It's possible among the some browsers, but it's not able to be done in an agnostic way (and it's complicated even when targeting a specific browser).
To complicate matters, you want the browser to believe that the stream you are sending it is really coming from the server, when in reality, it's not.
I believe that your best bet would be to save the response to the file system in a temp file. However, before you do, add the <base> tag to the file with the URL that the file came from. This way, relative URLs will resolve correctly when rendered in the browser.
Then, just open the temporary file in the browser using the Process class.