So Im still trying to find the feature (if it is there) in Cefsharp 3, where one can inspect the headers from the response of a request. In case its not there, is it because it is not there in CEF 3 ? and or, where should i start looking, if Im to implement it ?
This feature is not in CEF 3 yet. Here's the outstanding issue for it:
https://code.google.com/p/chromiumembedded/issues/detail?id=515
There is a workaround noted...
There's no great way to filter response contents with CEF3 currently. You can use CefResourceHandler via CefRequestHandler::GetResourceHandler and execute the request/return the response contents yourself using CefURLRequest.
... however this workaround is not possible in CefSharp 3 because CefURLRequestClient and friends are not implemented.
At this stage, depending on how comfortable you are with C++ you might consider:
contributing to the (C++) CEF project and implement the response filtering feature - this will be all C++.
contributing C# wrappers of CefURLRequestClient and friends to the CefSharp project - which is a combination of light C++ and C#.
You might also be interested that there is a way to get HTTP headers in JavaScript, as long as you have initiated the request yourself using AJAX:
Accessing the web page's HTTP Headers in JavaScript
This type of solution could easily be done with CefSharp 3 by injecting JavaScript into the current page.
An alternative that provides more control is to use schemehandlers (it's cleaner IMO).
Add a scheme handler that intercepts your resource loading:
CEF.RegisterScheme("ascheme", new HandlerFactory());
then (once you've created a trivial factory or 2) you have this override available:
public bool ProcessRequestAsync(IRequest request, ISchemeHandlerResponse response, OnRequestCompletedHandler requestCompletedCallback)
The Response contains Headers/MimeType and Stream to allow more control. I hope this helps.
Related
I am using the Abot library to crawl a web page. The crawler can request the pages correctly but the problem is that almost all of the content is loaded dynamically through knockout.js. The crawler currently has no way of requesting this content which results in only a small part of the page being loaded.
I've tried making the program wait in hopes of the requests for the dynamic being sent anyways but that doesn't seem to work.
I want the entire page to be loaded but instead, only the base of the page is loaded.
What can I do to make the crawler request all data?
Thanks!
Short answer:
Not possible this way, you need something that can handle the JS for you like browsers do.
I would recommend Splash from Scrapy (It can be integrated with any language through it's REST API).
But in my humble opinion, if you don't need an enterprise solution don't use C# for web crawling, there are easiest solutions and more complete libraries in python for example.
Anyone has success with making scraping software in an azure function? I needs to be performed with some kind dynamic content loading like the web browser control or selenium where all content is loaded before scraping starts. Seems like Selenium is not an option due to the nature of azure functions.
I am trying to scrape some web pages and extract content. The pages are pretty dynamic. So first HTML is loaded and then through javascript data i lazy loaded. If using a standard http request I will not get the data. I could use the BrowserControl in .NET and wait for the Ready state, but the Browser control requires a browser and cannot be used in an Azure Function. Could be HtmlAgilityPack is the right answer. I tried it 5 years ago, and at the point it was pretty terrible in formatting html. I can see the have some kind of javascript library could be worth a try. Have you tried using that part of HtmlAgilityPack?
Your question is purely .NET-C#-ish (at least I assume you use .net c#).
Refer to this answer, please. If you achieve your goal in some way via .NET, you can do it in an Azure function - no restrictions on this side of the road.
For sure you will need an external third-party library that somehow simulates a web browser. I know that Selenium in a way uses browser "drivers" (not sure) - this could be an idea to research more thoroughly.
I was (and soon will be again) challenged with a similar request and I found no obvious solution. My personal expectations are that an external service (or something) should be developed and dedicated that then could send the result to an Azure HTTP Trigger function, which will proceed with the analysis. Even this so called "service" could have a Web API interface to be consumed from anywhere (e.g. Azure Function).
We are downloading a full web page using System.Net.WebClient class. But we only want less than half of the page. So is there a way to download a portion of the page, say 1/3rd, half etc of a page using .net library so that We can save the network bandwidth and the space? If so, please throw your ideas, thanks.
You need to provide an "Accept-Ranges header" to your GET or POST request. That can be done by using the AddRange method of your HttpWebRequest:
HttpWebRequest myHttpWebRequest =
(HttpWebRequest)WebRequest.Create("http://www.foo.com");
myHttpWebRequest.AddRange(0,100);
That would yield the first 100 bytes. The server, however, needs to support this.
The sort answer is not unless the web app supports some way to tailor it's response to what you want it to return.
This could take the form of a
query string parameter
header field value
The simplest way to add this would be a query string parameter and when detected write out the necessary HTML to the response object. If you are unable to make changes to the web app then you won't be able to control how much of a page is returned to you.
You might want to read up on how HTTP works since the question and it's answer relies upon this. Specifically the Header Definition should be helpful.
I want to find a decent solution to track URLs and html content that users are visiting and provide more information to user. The solution should bring minimum impacts to end users.
I don't want to write plugins for different browsers. It's hard to maintain.
I don't accept proxy method, since I don't want to change any of user's proxy settings.
My application is writen in C# and targeting to Windows. It's best if the solution can support other OS as well.
Based on my research, I found following methods that looks working for me, but all of them have their drawbacks, I can't determine which one is the best.
Use WinPcap
WinPcap sniffers all TCP packets without changing any of user settings but only requires to install the WinPcap setup, which is acceptable to me. But I have two questions:
a. how to convert TCP packet into URL and HTML
b. Does it really impact the performance? I don't know if sniffer all TCP traffic is overhead for this requirment.
Find history files for different browsers
This way looks like the easist one, but I wonder if the solution is stable. I am not sure if the browser will stably write the history and when it writes to. My application will popup information before the user leave the current page. The solution won't work for me if browser writes to history file when user close the browser.
Use FindWindow or accessiblity object or COM interface to find the UI element which contains the URL
I find this way is not complete, for example, Chrome will only show the active tab's URL but not all of them.
Another drawback is that I have to request the URL another time to get its HTML content.
Any comment or suggestion is welcome.
BTW, I am not doing any spyware. The application is trying to find all RSS feeds from web page and show them to end users. I can easily do that in a browser plugin but I really want to support multiple broswers with single UI. Thanks.
Though this is very old post, I thought to just give an input.
Approach 1 of WinPcap is the best one. This will work for any browser, even builtin browser of any other installed application. The approach will be less resource consuming too.
There is a library Pcap.Net that has HTTP parser. You can construct http stream and use its httpresponsedatagram to parse the body that can be consumed by your application.
This link helped giving more insight to me -
Tcp Session Reconstruction with Winpcap
I Want Create A Software to Input Data in WebForms Automatically (like Robot) And Accept Input Data.
How I Can Create this Software in C# (Windows Application)?
what Technologies Must Be Used?
What OpenSource Project Exist for use?
Sample Code And etc...
Please Help Me
I hope you're doing something within the acceptable terms of use with the content you automatically post. Ie. you do not ask how to create yet another spam bot...
To grab the HTTP form you can use WebRequest. This returns the content of the page (including the form) as a response stream. You can then parse the response using HtmlAgility pack, for the forms you are interested. Once you know the forms and fields in the page, you can set values for the fields and post a response, again using a WebRequest but changing the method to POST and encoding the reponse fields as application/x-www-form-urlencoded content, see How to: Send Data Using the WebRequest Class.
This method is using almost the most basic building blocks, going lower level than this would mean using sockets and formating the HTTP request yourself. At this low level you'll have a great deal of freedom and flexibility on how to parse the form and send back the request, at the cost of actually having to understand how WebForms and HTTP work.