Scraping with cefsharp

Scraping with cefsharp - c#

I have a problem I want to scraping with CefSharp, with what statement can I insert data into an input type = text in its property "value" and how can I get the contents of a table, all this using CefSharp
I have scraping with webbrowser native of .net but in the page that I want to realize the data collection when I load the page this hangs my program and closes it, the solution I found is to change the browser and use CefSharp since it does the task without problems, but now I do not know how to get the data, much less insert.
The code that I use with webbroser is the following
Insert information
foreach (HtmlElement etiqueta in webInterno.Document.GetElementsByTagName("input"))
{
//MessageBox.Show("input");
if (etiqueta.GetAttribute("name").Contains("strCurp"))
{
etiqueta.SetAttribute("value", txtCurp.Text);
}
}
Get information
webInterno.Document.GetElementsByTagName("input"))
{
//MessageBox.Show("input");
if (etiqueta.GetAttribute("name").Contains("strCurp"))
{
Info = etiqueta.InnerText;
}
}
this same thing that I do with the webbrowser control as I can do with cefSharp??
Thanks

Related

c# fail to get element in content page with getelementbyid, getelementsbytagname

I have posted the same question but I post it again since I haven't got any answers to that post yet.
I am trying to get some information (such as tagName, id using GetElementsByTagName method or GetElementById method) from a content page in a website using winforms.
as you see the pictures attached, no matter which selection you make (select1, select2, select3 etc) web address stays same. however, contents under those selections are different in content page.
I am trying to access to a tagName(or id) from one of them(not selections but contents under a specific selection).
I have debugged and figured out(or seems like) I can not access to tagName(or id) from any of those contents under a specific selection.
It seems like I can only access tagName(or id) from main page. picture 3 will help better explanation of some terms such as main page, content page.
I tried to explain in detail, if my question seems still not clear, let me know plz.
My code looks like this.
var countGetFile = webBrowser1.Document.GetElementsByTagName("IFRAME");
foreach (HtmlElement l in countGetFile)
{
if (l.GetAttribute("width").Equals("100%"))
{
MessageBox.Show(l.GetAttribute("height").ToString());
MessageBox.Show(l.GetAttribute("outerText").ToString());
}
}

I was not able to grab information under 2 down level of #document from html.
html looks something like
...
<src="..." id="A" ... >
#document
...
<src="..." id="B" ... >
#document
...
<span="C" ...>
...
I could grab span information (third curly brackets) with codes looking like
HtmlWindow frame1 = webBrowser1.Document.GetElementById("A").Document.Window.Frames["A"];
HtmlWindow frame2 = frame1.Document.GetElementById("B").Document.Window.Frames["B"];
foreach (HtmlElement elm in frame2.Document.All)
{
if (elm.GetAttribute("tagName").Equals("C"))
{
// your command
}
}
to use Document.Window.Frames you need a header using "System.Collections";
btw, there is a problem. When I try to access to the information in third curly bracket, I need to do some kinds of work between frame1 and frame2 such as delaying for frame2 to have enough time to be able to access to next level after frame1.
I figured a kind of hack to get it through. Place a messagebox to pop up for short time delay, or place a delay function( not freeze ) with async code looking like,
async Task PutTaskDelay()
{
await Task.Delay(5000);//5 secs
}
I just found a temporary solution for accessing to second level. I will appreciate anyone who knows some ways to solve this problem.

Automating Google Maps in C# Web Browser (issue executing javascript properly)

NOT using API
I am currently attempting to use a web browser in C# to load google maps and automatically focus on my current location, however, for some reason I cannot get this to work properly. The idea is simple. Load Google maps, and either execute the script to focus on my current location:
mapBrowser.Document.InvokeScript("mylocation.onButtonClick");
Or, invoke the button click through an HtmlElement:
HtmlElement myLocationButton = mapBrowser.Document.GetElementById("mylocation");
myLocationButton.InvokeMember("click");
But, of course neither of these methods actually work correctly, the coordinates returned are incorrent and the map never actually focuses. Any ideas on how I can fix this issue properly? The scripts aren't invoked until after the document is actually loaded:
private void mapBrowser_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
if(mapBrowser.Url.ToString() == "https://www.google.com/maps/preview/")
{
try
{
//HtmlElement myLocationButton = mapBrowser.Document.GetElementById("mylocation");
//myLocationButton.InvokeMember("click");
mapBrowser.Document.InvokeScript("mylocation.onButtonClick");
//mapBrowser.Document.InvokeScript("focus:mylocation.main");
}
catch (Exception ex)
{
MessageBox.Show("Error Invoking Script: " + ex.Message);
}
}
}
so I don't believe that is the cause of my problem. Even more frustratingly, the auto-focus works fine if I click the button manually.
Any help is appreciated, thank you!
(NOTE, you may have to go into IE and allow Google maps access to your location in order to replicate this issue properly)

I've had problem few times that WebBrowser control uses too old version of IE. You need to modify registry to get it to use newer version of IE.
I tried "https://www.google.com/maps/preview/" with both IE 8 and 9 and it gave me an error, but it works on IE 10.
See: http://weblog.west-wind.com/posts/2011/May/21/Web-Browser-Control-Specifying-the-IE-Version

How to get JavaScript stack trace with WebBrowser control?

Knowing how to get notified about script errors when hosting a WebBrowser control using OLECMDID_SHOWSCRIPTERROR inside my WinForms C# application, I currently do it successfully this way:
private void handleError(mshtml.IHTMLDocument2 htmlDocument)
{
var htmlWindow = htmlDocument.parentWindow;
var htmlEventObject = htmlWindow.#event as mshtml.IHTMLEventObj2;
_lineNumber = (int)htmlEventObject.getAttribute(#"errorLine");
_characterNumber = (int)htmlEventObject.getAttribute(#"errorCharacter");
_errorCode = (int)htmlEventObject.getAttribute(#"errorCode");
_errorMessage = htmlEventObject.getAttribute(#"errorMessage") as string;
_url = htmlEventObject.getAttribute(#"errorUrl") as string;
}
This works as expected.
What I currently cannot solve is to get the JavaScript call stack.
I've tried several things in the example above:
_callStack = htmlEventObject.getAttribute(#"stack") as string;
_callStack = htmlEventObject.getAttribute(#"errorStack") as string;
_callStack = htmlEventObject.getAttribute(#"stackTrace") as string;
...
All those do return an empty/NULL string.
Whether I'm unsure if this information can be retrieved at all, still my question is:
How to get the call stack of a JavaScript error from within the application hosting an Internet Explorer web browser control?

I'm not entirely sure if that's possible at all either, but I might have some useful information related to your question. Back in IE7 days I worked on a custom host for WebBrowser control in C++, and I still keep the list of service GUIDs the control was requesting from my OLE site object through IServiceProvider. One of those interfaces was IDebugApplication, which might open a door to access JavaScript stack frame via IDebugApplication::AddStackFrameSniffer. I had not tried it back then. If you're ready to do further research, you could use this project as a starting point to implement a custom WebBrowser host in C#.

Searching browser contents after automated form submission

Im using WatiN to automate form submissions on my website. The problem Im having is I cant find any documentation on how to search the page after submission and verify if a string is present or not. I need it to look for a string in the resulting page, and if it is present continue inserting the next variable in a list, if it is NOT present I need it to log what string was inserted that caused the text not to show up, preferably in a text file. Heres my code so far.
browser.GoTo("https://site.com");
// locate the form by name and fill it
browser.TextField(Find.ByName("form")).TypeText("example");
// click the submit button
browser.Button(Find.ByName("submit")).Click();
Unfortunately that's all I've gotten. Help is appreciated, I just need to be pushed in the right direction.

Use browser.ContainsText("text to find") to verify if a string is present or not.
string[] textToAddList = File.ReadAllLines(#"C:\Users\Syd\Desktop\list.txt");
browser.GoTo("https://site.com");
foreach (string textToAdd in textToAddList)
{
browser.TextField(Find.ByName("form")).TypeText(textToAdd);
browser.Button(Find.ByName("submit")).Click();
browser.WaitUntilComplete();
if (!browser.ContainsText("text to find"))
{
//Log here
break;
}
}

Webbrowser control is not showing Html but shows webpage

I am automating a task using webbrowser control , the site display pages using frames.
My issue is i get to a point , where i can see the webpage loaded properly on the webbrowser control ,but when it gets into the code and i see the html i see nothing.
I have seen other examples here too , but all of those do no return all the browser html.
What i get by using this:
HtmlWindow frame = webBrowser1.Document.Window.Frames[1];
string str = frame.Document.Body.OuterHtml;
Is just :
The main frame tag with attributes like SRC tag etc, is there any way how to handle this?Because as i can see the webpage completely loaded why do i not see the html?AS when i do that on the internet explorer i do see the pages source once loaded why not here?
ADDITIONAL INFO
There are two frames on the page :
i use this to as above:
HtmlWindow frame = webBrowser1.Document.Window.Frames[0];
string str = frame.Document.Body.OuterHtml;
And i get the correct HTMl for the first frame but for the second one i only see:
<FRAMESET frameSpacing=1 border=1 borderColor=#ffffff frameBorder=0 rows=29,*><FRAME title="Edit Search" marginHeight=0 src="http://web2.westlaw.com/result/dctopnavigation.aspx?rs=WLW12.01&ss=CXT&cnt=DOC&fcl=True&cfid=1&method=TNC&service=Search&fn=_top&sskey=CLID_SSSA49266105122&db=AK-CS&fmqv=s&srch=TRUE&origin=Search&vr=2.0&cxt=RL&rlt=CLID_QRYRLT803076105122&query=%22LAND+USE%22&mt=Westlaw&rlti=1&n=1&rp=%2fsearch%2fdefault.wl&rltdb=CLID_DB72585895122&eq=search&scxt=WL&sv=Split" frameBorder=0 name=TopNav marginWidth=0 scrolling=no><FRAME title="Main Document" marginHeight=0 src="http://web2.westlaw.com/result/dccontent.aspx?rs=WLW12.01&ss=CXT&cnt=DOC&fcl=True&cfid=1&method=TNC&service=Search&fn=_top&sskey=CLID_SSSA49266105122&db=AK-CS&fmqv=s&srch=TRUE&origin=Search&vr=2.0&cxt=RL&rlt=CLID_QRYRLT803076105122&query=%22LAND+USE%22&mt=Westlaw&rlti=1&n=1&rp=%2fsearch%2fdefault.wl&rltdb=CLID_DB72585895122&eq=search&scxt=WL&sv=Split" frameBorder=0 borderColor=#ffffff name=content marginWidth=0><NOFRAMES></NOFRAMES></FRAMESET>
UPDATE
The two url of the frames are as follows :
Frame1 whose html i see
http://web2.westlaw.com/nav/NavBar.aspx?RS=WLW12.01&VR=2.0&SV=Split&FN=_top&MT=Westlaw&MST=
Frame2 whose html i do not see:
http://web2.westlaw.com/result/result.aspx?RP=/Search/default.wl&action=Search&CFID=1&DB=AK%2DCS&EQ=search&fmqv=s&Method=TNC&origin=Search&Query=%22LAND+USE%22&RLT=CLID%5FQRYRLT302424536122&RLTDB=CLID%5FDB6558157526122&Service=Search&SRCH=TRUE&SSKey=CLID%5FSSSA648523536122&RS=WLW12.01&VR=2.0&SV=Split&FN=_top&MT=Westlaw&MST=
And the properties of the second frame whose html i do not get are in the picture below:
Thank you

I paid for the solution of the question above and it works 100 %.
What i did was use this function below and it returned me the count to the tag i was seeking which i could not find :S.. Use this to call the function listed below:
FillFrame(webBrowser1.Document.Window.Frames);
private void FillFrame(HtmlWindowCollection hwc)
{
if (hwc == null) return;
foreach (HtmlWindow hw in hwc)
{
HtmlElement getSpanid = hw.Document.GetElementById("mDisplayCiteList_ctl00_mResultCountLabel");
if (getSpanid != null)
{
doccount = getSpanid.InnerText.Replace("Documents", "").Replace("Document", "").Trim();
break;
}
if (hw.Frames.Count > 0) FillFrame(hw.Frames);
}
}
Hope it helps people .
Thank you

For taking html you have to do it that way:
WebClient client = new WebClient();
string html = client.DownloadString(#"http://stackoverflow.com");
That's an example of course, you can change the address.
By the way, you need using System.Net;

This works just fine...gets BODY element with all inner elements:
Somewhere in your Form code:
wb.Url = new Uri("http://stackoverflow.com");
wb.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(wbDocumentCompleted);
And here is wbDocumentCompleted:
void wb1DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
var yourBodyHtml = wb.Document.Body.OuterHtml;
}
wb is System.Windows.Forms.WebBrowser
UPDATE:
The same as for the document, I think that your second frame is not loaded at the time you check for it's content...You can try solutions from this link. You will have to wait for your frames to be loaded in order to see its content.

The most likely reason is that frame index 0 has the same domain name as the main/parent page, while the frame index 1 has a different domain name. Am I correct?
This creates a cross-frame security issue, and the WB control just leaves you high and dry and doesn't tell you what on earth went wrong, and just leaves your objects, properties and data empty (will say "No Variables" in the watch window when you try to expand the object).
The only thing you can access in this situation is pretty much the URL and iFrame properties, but nothing inside the iFrame.
Of course, there are ways to overcome teh cross-frame security issues - but they are not built into the WebBrowser control, and they are external solutions, depending on which WB control you are using (as in, .NET version or pre .NET version).
Let me know if I have correctly identified your problem, and if so, if you would like me to tell you about the solution tailored to your setup & instance of the WB control.
UPDATE: I have noticed that you're doing a .getElementByTagName("HTML")(0).outerHTML to get the HTML, all you need to do is call this on the document object, or the .body object and that should do it. MyDoc.Body.innerHTML should get the the content you want. Also, notice that there are additional iFrames inside these documents, in case that is of relevance. Can you give us the main document URL that has these two URL's in it so we / I can replicate what you're doing here? Also, not sure why you are using DomElement but you should just cast it to the native object it wants to be cast to, either a IHTMLDocument2 or the object you see in the watch window, which I think is IHTMLFrameElement (if i recall correctly, but you will know what i mean once you see it). If you are trying to use an XML object, this could be the reason why you aren't able to get the HTML content, change the object declaration and casting if there is one, and give it a go & let us know :). Now I'm curious too :).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.