I currently have 2 JavaScript variables in which I need to retrieve values from. The HTML consists of a series of nested DIVs with no id/name attributes. Is it possible to retrieve the data from these variables using HTMLAgilityPack? If so how would I go about doing so, if not what would be required, regular expressions? If the latter, please help me in creating a regular expression that would allow me to do this. Thank you.
<div style="margin: 12px 0px;" align="left">
<script type="text/javascript">
variable1 = "var1";
variable2 = "var2";
</script>
</div>
I'm assuming you are trying to scrape this information from a website? Most likely one you don't have direct control over? There are several ways to do this, I'll go easy to hard( at least as I see em):
Ask the owner (of the site). Most of the time they can give you direct access to the information and if you ask nicely, they might just let you have it for free
You can use the webBrowser control, run the javascript and then parse values from the DOM afterwards. As opposed to HttpWebRequest, this allows for all the proper values to be loaded on the page and scraped. Helpful Link Here.
Steal the source with Firebug. Inspect the website with Firebug to see which URLs are called from the background. Most likely, its using an asynchronous request to retrieving the updated information from a webservice. Using Firebug, you can view this under the NET -> XHR. Look at the request and the values returned, you can then retrieve the values your self and parse the contents from the source rather than scrape the page.
I think this might be the information you were looking for, but if not let me know and I can clarify/fix answer
Related
Is there a way to find all the src="" urls when rendering a ASP.net MVC page in the view to then generate DNS prefetch tags on the fly?
https://www.chromium.org/developers/design-documents/dns-prefetching
If I understood correctly I can tell you the following:
Option #1: (Not a pretty solution but would work.)
NOTE: for this try to use simple Javascript and not rely on JQuery or other (since then you still need to "load" the .JS file for that and that is ruining the point of your question.
Process your src/href or some other predefined property tag with some kind of "OwnLogic" to define the "base target",
but in a way that the browser would not be able to initiate the request to obtain that image or other file.
Example:
<img url="" class="DNS_BaseTarget" DNS_BaseTarget="smiley.gif||myCDNPointerInfo" alt="">
Then, with javascript, get a list of all elements that uses the class DNS_BaseTarget and then read the property value and update the "src" tag.
At the same time you can inject by javascript inject all the '<link rel="dns-prefetch" href="https://cdn.yourTargetDomain.com">' that you will use based on the information you just processed.
I did not tested this concept, so "lag" or some sort of delay in the client might be expected (but maybe not noticeble by the user).
Option #2:
The View Result Execution Process (in MVC life cycle) tell us that the method 'Render()' is the last one to be executed.
With this being said, you can create your own custom override logic
Example: intercept view rendering to add HTML/JS on all partial views?
How to intercept view rendering to add HTML/JS on all partial views?
With this concept of trying to "process" the final html before sending it to the user, you could somehow "parse" the file.... try to get all the 'src/href' and then
inject all the '<link rel="dns-prefetch" href="https://cdn.yourTargetDomain.com">' that you will use.
I'm trying to recreate an old C# application of mine that streams from an online radio station. Problem with the old one is, it loaded an entire web page just to display a certain area of it, which takes more resources that I would deem necessary. So, now I'm rewriting the entire application, and am looking for a way how I can retrieve text from the following code on the website:
<div id="now" style="visibility: visible; display: block;">
<div class="scroll" style="margin-left: 0.000px;">
<div id="title">SONG_NAME</div>
<div id="artist">SONG_ARTIST</div>
</div>
</div>
This piece is constantly updated on the page, with the name and artist of the current song.
id="title" is the name of the song and id="artist" is the artist of the song.
I would like to retrieve the name and artist every say, 10 seconds or so.
Any idea what code to use for this ?
You'll probably want to pull the entire page back. The main considerations are:
You could request the html as uncompressed and open the stream using HttpWebResponse.GetResponseStream and then read up until the end of the block you need (you'll need to analyse the text as you go), and finally call HttpWebResponse.Close to close the stream and release the connection
If the entire response is compressed it may be more efficient to get the whole thing anyway before decompressing.
You need to test which is more efficient for the specific page you are scraping.
So the usual way is to retrieve the whole html stream, then use regex to find the block you need, and just keep your code simple.
Recommendation
If you want to keep it really simple then look at HtmlAgilityPack, which is even on NuGet to use with Visual Studio 2012. It makes working with html scraping very simple.
Hi i wanted to show or hide duplicate records according to query. So, I need to know how to call the javascript function from C# codebehind.
<a onclick="Grid1.insertRecord(); return false;" id="a2" href="javascript:">Save</a>
When I click save i need to show a popup which i have written in javascript.
if (!exist)//exists is the query
{
System.Web.UI.Control my = FindControl("a2");
a2.Attributes.Add("onclick", "retrun HideDuplicate()");
This line returns an error saying "a2 doesnot exist in current context."
Why not use an asp.net LinkButton? It has a server side Click event and is accessible from c# code-behind.
The basic <a> tag is not turned into a control by asp.net unless you add a runat="server" to it. Its then turned into a HtmlGenericControl.
<a onclick="Grid1.insertRecord(); return false;" id="a2" href="#" runat="server">Save</a>
This might work for you - its not clear if you have more than one of these links on the page (such as in a row of a gridview?) or if its just on there once.
Also the way you have used javascript is not following best practices but thats a discussion for another day :)
MSDN documentation for programatic creation of client side callbacks without postback with an example where the code behind is in C# might give a good overview of how it is supposed to work in general.
In your case, the corresponding code-behind should implement the interface 'ICallbackEventHandler' and the two methods it describes. In addition, you would need two more client side Javascript functions to prepare and handle the callback, besides the executor/invoker (in your case, a 'save' method). However one of the additional two Javascript functions could be registered in the codebehind, as the example shows.
When I render a page in ASP.NET, the following happens
</head>
<NOSCRIPT>
<meta http-equiv="REFRESH" content="0;URL=/Default.aspx?id=84&epslanguage=en-GB&jse=0" />
</NOSCRIPT>
<title>Page title goes here.</title>
<body>
My masterpage looks like this:
<title>Page title goes here.</title>
</head>
<body>
So what I'm asking is, where the heck has this refresh meta tag come from, why has it put it between my head tag and body tag, and why has my page title jumped outside of the head?!
When viewing the page's generated source in firebug, it shows the title tag and this new meta tag within the head tag, but viewing the source in any browser, it looks like the above. When using wget to scrape the page, it also comes out incorrectly as displayed above.
Any ideas why browsers may be interpreting this in different ways, and more importantly where this new meta tag has come from?
Thanks! Karl.
Edit:
Hi!
Thanks for your replies guys, very informative!
I've discovered that the problem is this line of code:
Page.Header.Controls.Add(ctrl);
Putting the mysterious meta tag in using this line puts it outside the head tag. When commenting this out, the title tag drops back into the right place, and all is well!
Any further thoughts?
Thanks!
Karl.
On the matter of why browsers will be interpreting it differently there are two answers. Firstly the firebug output as you say is generated source. That means its gone through a certain amount of processing already and clearly firefox is doing some magic to say "Well, its a meta and a title tag, they should be in the header so I'll put them there."
The other browsers you are comparing their raw source it sounds like which is before the browser has tried to make sense of it. I suspect you'd get the same if you viewed the raw source in firefox (ctrl-u).
I'd have expected all browsers to do much the same thing as you have described firefox as doing but if not then that's not really somethign to be concerned about. When invalid HTML like this is received the browsers have no real rules of what to do. This means that browsers are welcome to do whatever they want from trying to guess what you meant to just ignoring it entirely.
As for what is causing it, the epslanguage query paramter is from episerver - I don't know if that was in the request url or not so it may be that it is just being persisted or it may be episerver trying to redirect to a page with an explicit language instead of just assuming the default. Unfortunately I'm not familiar with episerver so I can't say any more specific to that.
It is of course definitely the case that there is something on your server side that is causing this to happen.
Do you get that for all pages out of interest or just one specific one or just in one specific circumstance?
Quite often it's a case of an element not being properly closed. Most browsers will try to adjust the markup so that it makes sense, but in most cases the markup will be incorrectly parsed.
You should probably share more of your master page (and the web form using it)!
Maybe your HEAD-tag doesn't have runat="server"?
i want create small web browser , tiny and fast
but i have problem ,
let me explain :
1 - user enter site : google.com
2 - c# program get google.com
3 - find <td nowrap="" align="center">
4 - in web browser only show that area
i dont know where i must start ,
thanks
Ok, I'm going to try answer your question, but I am deciphering as well.
Create a WebBrowser control on your form. (2.0 is fine for what you need) and .Navigate("http://www.google.com");
Get the source code from the Document. You can do this as follows: string source = _WebBrowser.Document.Body.OuterHtml;
Use string manipulation to get to the area on the page you need. For instance .SubString() functions
Save the text into a file, or stream and load it into the WebBrowser control, or replace the pages Document HTML with just the HTML you are wanting to show.
Okay! Looking at the comment it seems you want to request for a page using c# and show only one part of the page. In your case its that specific <td> . Please correct me if I am wrong.
Other than what Kyle has mentioned. Check out HTML agility Pack. It might be of interest to you.