Scraping htmlagilitypack - c#

I am using HtmlAgilityPack to perform Scraping in C # Asp.Net, so far I have not had problems when doing Scratch from several web, however, trying to eject the following code I get an error
Var getHtmlWeb = new HtmlWeb ();
Var home page = getHtmlWeb.Load ("https://www.corfo.cl/sites/cpp/home");
The error that appears is:
"Connection terminated: Unexpected sending error."
The only web that is giving me problems is Corfo and not how to solve this.
I appreciate your help

This site relies on cookie to work, e.g. one of the URL it requested is
https://www.corfo.cl/sites/Satellite;jsessionid=T8w78ZolfWgr3ZoEBBvE81nBiXbXIdjfF1In3bgpZiYvL_w8TF4p!1081543155!-596930586?c=Page&cid=1456408322328&pagename=CorfoPortalPublico/Page/corfoListadoOfertaInteligenteWebLayout
So, when you request www.corfo.cl, first it forward to www.corfo.cl/sites/cpp/home, then on /sites/ folder, it set cookie jsessionid=OHS_1~T8w78ZolfWgr3ZoEBBvE81nBiXbXIdjfF1In3bgpZiYvL_w8TF4p!1081543155!-596930586 etc.
With this cookie, this page build itself with all/some components related with this jsessionid.
If client code doesn't handle these logic, as above two lines, the server reset the connection as expected, because server doesn't know how to build this page without jsessionid.
The inner exception from System.Net.WebException is
{"Authentication failed because the remote party has closed the transport stream."}
Hope this helps!

Related

intermittent 412 error when calling Google Directory API to patch a user's password

My application's main function is to change a Google G-Suite user's password using Google's Google.Apis.Admin.Directory.directory_v1 nuget package.
The API call works 95% of the time (and resets a target user's password), but intermittently, the API call throws an exception with the Message text:
Precondition Failed [412] Errors [ Message[Precondition Failed] Location[If-Match - header] Reason[conditionNotMet] Domain[global] ]
I've done lots of research and it seems that there is a client-specified pre-condition being included in a (REST?) call that the API is making toward the Google API server and the server is determining that the condition is not being met (see https://www.rfc-editor.org/rfc/rfc7232#section-4.2) or the state of the object being changed is bad (https://developers.google.com/calendar/v3/errors ). The strange thing is, everything does work nearly all of the time, but then fails every now and then. It really seems like it is some kind of a resource based error (too many calls submitted recently, too many users licensed in the domain) or maybe bad data (bad or missing password, bad user) or even permissions (user is in a group/OU that can'b be managed). But the error message gives nothing to go on and I've mostly ruled out the most obvious of the possibilities. I've googled the exact message and found numerous people with similar complaints, but no documented causes.
Correction from original: I am able to capture REST calls with Fiddler (with https capture configured), but I can't reproduce the original error while capturing, so it doesn't help much.
Any suggestions for how to reproduce and/or troubleshoot the issue?
Here is the code (please ignore any obvious typos-I had to cut/paste/merge from a few sources to assemble a small simple example)--the real code definitely works nearly all of the time:
{
userEmail = googleUser + "#" + domain; // e.g. BobSmith#myGoogleDomain.com
// service is an instance of Google.Apis.Admin.Directory.directory_v1.DirectoryService
var userget = service.Users.Get(userEmail);
User userob = userget.Execute();
userob.ChangePasswordAtNextLogin = false;
userob.Password = password;
patchRequest=service.Users.Patch(userob, userEmail);
patchRequest.Execute();
}
catch (Exception e)
{}

C# Selenium, can't find any elements after logging in

I use Selenium with ChromeDriver. When I open up a driver and manually go to a webpage, and then start running my code, everything works fine. However, if I log in to the website and manually go to the same page as before (it's the exact same page but logged in), I won't even find the html element:
var el = driver.FindElement(By.CssSelector("html"));
returns
OpenQA.Selenium.WebDriverException: The HTTP request to the remote WebDriver server for URL http://localhost:........../elements timed out after 60 seconds.
The assignment
var el = driver.FindElements(By.CssSelector("*"));
returns the same thing, there's literally nothing. Shouldn't I at least get a "cannot find element" exception?
Now if I log out it works again (everything is done in the same instance of ChromeDriver)

Detect Internet Connectivity on client machine

I have this web application (MVC using C#) that serves like an advertisement in my client's office. My client will open this "advertisement page" and display it on a big screen to their customers.
What happen is, every 30 minutes or so, the page will automatically refresh to fetch latest data from the database, however, they are using WIFI to connect to our server and sometimes the connection is very slow (or lost connection completely). My client requested me to write a code to prevent the page from refreshing if the connectivity is bad or no internet connection. (They do not want to show "No Internet Connection" on their advertisement TV)
I know I cannot do anything from the server side code because it is the client's machine that want to detect the internet connection, so leaving client side code as the only option. I am not good at this, can anyone help me out?
I'd suggest a "ping" sent via ajax:
var timeStart= new Date().getTime();
$.ajax({
url:"url-to-ping-response-file",
success:function(){
var timeNow = new Date().getTime();
var ping = timeNow - timeStart;
//less than one second
if(ping < 1000){
window.location.reload();
}
}
});
You can use the Circuit Breaker Pattern to gracefully handle intermittently connected environments.
Here are 2 open source JavaScript implementations. I have never used either of them, so I cannot attest to their quality.
https://github.com/yammer/circuit-breaker-js
https://github.com/mweagle/circuit-breaker
You can also make use of
if (navigator.onLine) {
location.reload();
}
This will not detect slow internet. Now, I don't understand your web layout but for sites that I work on I tend to get HTML content and DATA as separate calls. I do this with a MVVM/MVC pattern which is worth learning. I use angularjs it is very awesome.
Now.. you can also use good old jQuery to replace the content have a read of this Replace HTML page with contents retrieved via AJAX you could couple this with the .onLine check.
http://www.w3schools.com/jsref/prop_nav_online.asp

How to catch 500 internal server error in c#

Im using Google Analytics Dashboard Control which are available at
http://gadashboardcontrols.codeplex.com/
Issue is its working fine when im connected to internet but if im using it on a machine that doesnt have internet access then it shows
Server Error in '/' Application.
The remote name could not be resolved: 'www.google.com'
I want to catch this exception and shows a friendly message to user. Im calling these dashboard control on my View in an iframe like this
<iframe src="../../GoogleAnalytics/Visitor.aspx" height="275"></iframe>
and if i place try catch in Visitor.aspx page it doesnot catch the exception. How should i catch this exception, Im using asp.net mvc 2 with c#
You cannot catch this exception because the problem occurs in the browser and not on the server. You do not have control over this from the aspx code.
What you can do instead is check network connectivity and serve alternate content in the page if the user is offline. Look into
System.Net.NetworkInformation.NetworkInterface.GetIsNetworkAvailable() for this.
There's no good way to catch this exception since you're using an iframe and the page is loaded in the browser and not via code. There are some tricks for this, but not as reliable.
The error is server-side so should actually try to fix it in your code.

Accessing Google Spreadsheets with C# using Google Data API fails with Mono

I'm trying to access my Google spreadsheets using the GData API. I have followed the example which looks like:
var service = new SpreadsheetsService("myTest");
service.setUserCredentials(username, password);
var query = new SpreadsheetQuery();
var feed = service.Query(query);
This should return a feed with a list of spreadsheets. However this fails with:
Google.GData.Client.GDataRequestException: Execution of request failed: http://spreadsheets.google.com/feeds/spreadsheets/private/full ---> System.Net.WebException: The remote server returned an error: (404) Not Found.
When I try the above link directly in my browser I'm able to download the feed, as long as I'm logged in into my Google account.
Some further information:
I'm not behind a firewall
I have checked my username (maurits.rijk at gmail.com) and password several times
I am using Mandriva in VirtualBox on a MacBook
All my code is compiled with Mono
I tried the same functionality in Java on OS-X. That code runs as expected.
Looks like a Mono problem to me.
Could you test with fiddler to test if your call reach the server?
I found the problem and solution on Google code, Issue 88 as comment 8.
In short, using
mozroots --import --sync --quiet
solves this problem. For me it now works.

Categories