Simulating a web browser C# - c#

I'm trying to extract data with HtmlAgilityPack from various websites by downloading their HTML code with client.DownloadStringAsync(). This has been working flawlessy so far, but I recently encountered a problem with one of the websites I tried to download this way. Instead of the actual content of the site which can be seen in a browser, it just said "Loading...". After checking networking in chrome, it seems that the GET request just gets a raw HTML, which doesn't actually include any of the data I need and instead it is mostly initialized by Javascript. I found the Selenium WebDriver which should do what I want it to, however I think it will conflict if I try to run all of that on a Raspberry Pi (which is what I'm designing this project for). If there are any alternatives, let me know or if it works on unix based systems too. Here is the Http Getter I used so far:
private async Task<string> HttpGet(string URL) {
using var client = new HttpClient();
using var request = new HttpRequestMessage {RequestUri = new Uri(URL), Method = HttpMethod.Get};
using var response = await client.SendAsync(request, HttpCompletionOption.ResponseContentRead);
if (response.StatusCode == HttpStatusCode.NotFound) return "404";
return await response.Content.ReadAsStringAsync();
}
After attempting to use the aformentioned NuGet package with Chrome, I managed to get heaps of different errors which were all foreign to me. Here's my code:
private string SeleniumGet(string URL) {
ChromeOptions options = new ChromeOptions();
options.AddArgument("headless");
options.BinaryLocation = Directory.GetCurrentDirectory() + #"\chromedriver.exe"; //Probably not gonna work on an RPi
using ChromeDriver driver = new ChromeDriver(options);
driver.Navigate().GoToUrl(URL);
return driver.PageSource; //Unsure if this is source code, doesn't really matter for now - program crashes when initiating chrome driver
}
Currently am stuck on it saying Invalid --log-level value., but I had other ones like unknown error: DevToolsActivePort file doesn't exist. however, I can't reproduce that one. Currently only the invalid log level error pops up, which crashes when constructing the ChromeDriver. Thanks for the help in advance. If the site I try to access matters, This is it. Looking at the GET request in networking shows the blank page with loading on it.

Related

Adding SSL cert causes 404 only in browser calls

I am working in an internal corporate environment. We have created a webapi installed on iis on port 85. We call this from another MVC HelperApp on port 86. It all works as expected. Now we want to tighten security and add an SSL cert to iis on port 444 and bind it to our API.
Initially we test it with Postman, SoapUI, and a C# console app and it all works. Now we try calling it from our MVC HelperApp and it returns a 404 sometimes.
Deeper debugging; I put the code into a C# DLL (see below). Using the console app I call the Dll.PostAPI and it works as expected. Now I call that same Dll.PostAPI from the MVC HelperApp and it won't work. When I step through the code I make it as far as this line await client.PostAsync(url, data); and the code bizarrely ends, it doesn't return and it doesn't throw an exception. Same for Post and Get. I figure it makes the call and nothing is returned, no response and no error.
Also, if I change the url to "https://httpbin.org/post" or to the open http port85 on iss it will work. I have concluded that the C# code is not the problem (but I'm open to being wrong).
Therefore I have come to the conclusion that for some reason the port or cert is refusing calls from browsers.
We are looking at:
the "Subject Alternative Name" but all the examples show
WWW.Addresses which we are not using.
the "Friendly Name" on the cert creation.
and CORS Cross-Origin Resource Sharing.
These are all subjects we lack knowledge in.
This is the calling code used exactly the same in the console app and the web app:
var lib = new HttpsLibrary.ApiCaller();
lib.makeHttpsCall();
This is what's in the DLL that gets called:
public async Task<string> makeHttpsCall()
{
try
{
List<Quote> quotes = new List<Quote>();
quotes.Add(CreateDummyQuote());
var json = JsonConvert.SerializeObject(quotes);
var data = new StringContent(json, Encoding.UTF8, "application/json");
var url = "https://httpbin.org/post"; //this works in Browser
//url = "https://thepath:444//api/ProcessQuotes"; //444 DOES NOT WORK in browsers only. OK in console app.
//url = "http://thepath:85/api/ProcessQuotes"; //85 works.
var client = new HttpClient();
var response = await client.PostAsync(url, data); //<<<this line never returns when called from browser.
//var response = await client.GetAsync(url); //same outcome for Get or Post
var result = await response.Content.ReadAsStringAsync();
return result;
}
catch (Exception ex)
{
throw;
}
}

In .NET, failure to retrieve HTTP resource from W3C web site

Retrieving the resource at http://www.w3.org/TR/xmlschema11-1/XMLSchema.xsd takes around 10 seconds using the following mechanisms:
web browser
curl
Java URL.openConnection()
It's possible that the W3C site is applying some "throttling" - deliberately slowing the response to discourage bulk requests.
Trying to retrieve the same resource from a C# application on .NET, I get a timeout after about 60-70 seconds. I've tried a couple of different approaches, both with the same result:
System.Xml.XmlUrlResolver.GetEntity()
new WebClient().OpenRead(uri)
Anyone have any idea what's going on? Would another API, or some configuration options, solve the problem?
The problem is they are (probably) checking for a User-Agent string. If it's not present, they send you to purgatory. .NET's http clients do not set this by default.
So, give this a shot:
private static readonly HttpClient _client = new HttpClient();
public static async Task TestMe()
{
using (var req = new HttpRequestMessage(HttpMethod.Get,
"http://www.w3.org/TR/xmlschema11-1/XMLSchema.xsd"))
{
req.Headers.Add("user-agent",
"Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X)");
using (var resp = await _client.SendAsync(req))
{
resp.EnsureSuccessStatusCode();
var data = await resp.Content.ReadAsStringAsync();
}
}
}
No idea why they do this; Maybe it's a bug in their back-end? (I sure wouldn't want to leave a socket open longer than it needs to be for no good reason). The request still takes 10-15 seconds, but it's better than the 120+ second timeout.

Autodesk Forge Error trying to access the API online

I have a problem loading a 3D model on an online server, the error shown is related to accessing the Forge API, locally works smoothly however when mounted on the server or a website is made marks the following error "Failed to load resource: the server responded with a status of 404 (Not Found)", then "onDocumentLoadFailure() - errorCode:7".
As I comment, what I find stranger is that, locally, it works. Attached the segment of the code where it displays the error.
function getAccessToken() {
var xmlHttp = null;
xmlHttp = new XMLHttpRequest();
xmlHttp.open("GET", '/api/forge/toke', false); //Address not found
xmlHttp.send(null);
return xmlHttp.responseText;
}
Thank you very much in advance.
Are you sure the code you're running locally and the code you've deployed are really the same?
The getAccessToken function doesn't seem to be correct, for several reasons:
First of all, there seems to be a typo in the URL - shouldn't it be /api/forge/token instead of /api/forge/toke?
More importantly, the HTTP request is asynchronous, meaning that it cannot return the response immediately after calling xmlHttp.send(). You can find more details about the usage of XMLHttpRequest in https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest/Using_XMLHttpRequest.
And finally, assuming that the function is passed to Autodesk.Viewing.Initializer options, it should return the token using a callback parameter passed to it (as shown in https://forge.autodesk.com/en/docs/viewer/v7/developers_guide/viewer_basics/initialization/#example).
With that, your getAccessToken should probably look more like this (using the more modern fetch and async/await):
async function getAccessToken(callback) {
const resp = await fetch('/api/forge/token');
const json = await resp.json();
callback(json.access_token, json.expires_in);
}
I've already found the issue. When I make the deploy I have to change the url where the request is made for the public or the name of the domain. For example: mywebsite.com/aplication-name/api/forge/token.

HttpClient GET request fails, POST Succeeds

In a Xamarin application (backed by ASP.NET WebApi), I'm having trouble getting [all of] my GET requests to succeed -- they return 404. In fact, when watching network traffic in Fiddler, I don't even see the request happen.
Here is [basically] what I'm doing:
public async Task<bool> ValidateSponsor(string attendeeId, string sponsorId)
{
string address = String.Format("{0}/Sponsors/?attendeeId={1}&sponsorId={2}", BASE_URI, attendeeId, sponsorId);
var response = await client.GetAsync(address);
var content = response.content;
if (!response.IsSuccessStatusCode)
throw new HttpRequestException("Check your network connection and try again.");
string result = await content.ReadAsStringAsync();
return Convert.ToBoolean(result);
}
If I copy out the address variable and paste it into a browser, it succeeds. POST requests (to different methods, of course) succeed. I've also tried using the PCL version RestSharp but get the same results -- POST succeeds and GET fails.
Edit:
This looks like it also may only be a problem when deployed to Azure, it works fine locally.

System.Net.WebClient unreasonably slow

When using the System.Net.WebClient.DownloadData() method I'm getting an unreasonably slow response time.
When fetching an url using the WebClient class in .NET it takes around 10 sec before I get a response, while the same page is fetched by my browser in under 1 sec.
And this is with data that's 0.5kB or smaller in size.
The request involves POST/GET parameters and a user agent header if perhaps that could cause problems.
I haven't (yet) tried if other ways to download data in .NET gives me the same problems, but I'm suspecting I might get similar results. (I've always had a feeling web requests in .NET are unusually slow...)
What could be the cause of this?
Edit:
I tried doing the exact thing using System.Net.HttpWebRequest instead, using the following method, and all requests finish in under 1 sec.
public static string DownloadText(string url)
var request = (HttpWebRequest)WebRequest.Create(url);
var response = (HttpWebResponse)request.GetResponse();
using (var reader = new StreamReader(response.GetResponseStream()))
{
return reader.ReadToEnd();
}
}
While this (old) method using System.Net.WebClient takes 15-30s for each request to finish:
public static string DownloadText(string url)
{
var client = new WebClient();
byte[] data = client.DownloadData(url);
return client.Encoding.GetString(data);
}
I had that problem with WebRequest. Try setting Proxy = null;
WebClient wc = new WebClient();
wc.Proxy = null;
By default WebClient, WebRequest try to determine what proxy to use from IE settings, sometimes it results in like 5 sec delay before the actual request is sent.
This applies to all classes that use WebRequest, including WCF services with HTTP binding.
In general you can use this static code at application startup:
WebRequest.DefaultWebProxy = null;
Download Wireshark here http://www.wireshark.org/
Capture the network packets and filter the "http" packets.
It should give you the answer right away.
There is nothing inherently slow about .NET web requests; that code should be fine. I regularly use WebClient and it works very quickly.
How big is the payload in each direction? Silly question maybe, but is it simply bandwidth limitations?
IMO the most likely thing is that your web-site has spun down, and when you hit the URL the web-site is slow to respond. This is then not the fault of the client. It is also possible that DNS is slow for some reason (in which case you could hard-code the IP into your "hosts" file), or that some proxy server in the middle is slow.
If the web-site isn't yours, it is also possible that they are detecting atypical usage and deliberately injecting a delay to annoy scrapers.
I would grab Fiddler (a free, simple web inspector) and look at the timings.
WebClient may be slow on some workstations when Automatic Proxy Settings in checked in the IE settings (Connections tab - LAN Settings).
Setting WebRequest.DefaultWebProxy = null; or client.Proxy = null didn't do anything for me, using Xamarin on iOS.
I did two things to fix this:
I wrote a downloadString function which does not use WebRequest and System.Net:
public static async Task<string> FnDownloadStringWithoutWebRequest(string url)
{
using (var client = new HttpClient())
{
//Define Headers
client.DefaultRequestHeaders.Accept.Clear();
client.DefaultRequestHeaders.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));
var response = await client.GetAsync(url);
if (response.IsSuccessStatusCode)
{
string responseContent = await response.Content.ReadAsStringAsync();
//dynamic json = Newtonsoft.Json.JsonConvert.DeserializeObject(responseContent);
return responseContent;
}
Logger.DefaultLogger.LogError(LogLevel.NORMAL, "GoogleLoginManager.FnDownloadString", "error fetching string, code: " + response.StatusCode);
return "";
}
}
This is however still slow with Managed HttpClient.
So secondly, in Visual Studio Community for Mac, right click on your Project in the Solution -> Options -> set HttpClient implementation to NSUrlSession, instead of Managed.
Screenshot: Set HttpClient implementation to NSUrlSession instead of Managed
Managed is not fully integrated into iOS, doesn't support TLS 1.2, and thus does not support the ATS standards set as default in iOS9+, see here:
https://learn.microsoft.com/en-us/xamarin/ios/app-fundamentals/ats
With both these changes, string downloads are always very fast (<<1s).
Without both of these changes, on every second or third try, downloadString took over a minute.
Just FYI, there's one more thing you could try, though it shouldn't be necessary anymore:
//var authgoogle = new OAuth2Authenticator(...);
//authgoogle.Completed...
if (authgoogle.IsUsingNativeUI)
{
// Step 2.1 Creating Login UI
// In order to access SFSafariViewController API the cast is neccessary
SafariServices.SFSafariViewController c = null;
c = (SafariServices.SFSafariViewController)ui_object;
PresentViewController(c, true, null);
}
else
{
PresentViewController(ui_object, true, null);
}
Though in my experience, you probably don't need the SafariController.
Another alternative (also free) to Wireshark is Microsoft Network Monitor.
What browser are you using to test?
Try using the default IE install. System.Net.WebClient uses the local IE settings, proxy etc. Maybe that has been mangled?
Another cause for extremely slow WebClient downloads is the destination media to which you are downloading. If it is a slow device like a USB key, this can massively impact download speed. To my HDD I could download at 6MB/s, to my USB key, only 700kb/s, even though I can copy files to this USB at 5MB/s from another drive. wget shows the same behavior. This is also reported here:
https://superuser.com/questions/413750/why-is-downloading-over-usb-so-slow
So if this is your scenario, an alternative solution is to download to HDD first and then copy files to the slow medium after download completes.

Categories