im currently trying to get a online status from a user on a website. Im able to get the website content into a string which shows me the online status of the users :
</path></g></svg><div class="name-2WpE7M">Friends</div></a></div><header>Direct Messages</header><div class="channel-2QD9_O selected-1HYmZZ" style="height: 42px; opacity: 1;"><div class="wrapper-2F3Zv8 small-5Os1Bb forceDarkTheme-2cI4Hb avatar-small"><div user="Dennis" status="online" class="avatar-small stop-animation" style="background-image: url("/assets/322c936a8c8be1b803cd94861bdfa868.png");"></div><div class="online-2S838R status-oxiHuE small-5Os1Bb status"></div></div><span class="name-2WpE7M">Dennis</span><button class="close-3hZ5Ni"></button></div></span></div></div></div><div class="container-2Thooq"><div class="wrapper-2F3Zv8 small-5Os1Bb avatar-small"><div user="halaldi" status="online" class="avatar-3JE4B3 avatar-small stop-animation" style="background-image: url("/assets/0e291f67c9274a1abdddeb3fd919cbaa.png");"></div><div class="online-2S838R status-oxiHuE small-5Os1Bb status status-2kJpnA"></div></div><div class="accountDetails-3k9g4n nameTag-m8r81H"><span
https://pastebin.com/Hb1mp1fq search for status="online" and you will find here 2 users which are online.
But here comes the part where im stuck, whats the best way to get those informations out of the string? I could split the string down until i only have the parts left which i need, but i guess that stupid and not the best way to do it, so i would love to learn a new and better way for this :)
It seems your data is HTML based.
If so you can use HTML parsing library like :
https://www.nuget.org/packages/HtmlAgilityPack
It provies many ways to parse HTML (by tags, css etc).
Related
I am coming back to work on a BOT that scraped data from a site once a day for my personal use.
However they have changed the code during COVID and now it seems they are loading in a lot of the content with Ajax/JavaScript.
I thought that if I did a WebRequest and obtained the response HTML from a URL, it should match the same content that I see in a browser (FF/Chrome) when I right click and "view source". I thought the actual DOM and generated source code would come later when those files were loaded as onload events fired, scripts lazily loaded and so on.
However the source HTML I obtain with my BOT is NOT the same as the HTML I see when viewing the source code. So my regular expressions that find certain links are not available to me.
Why am I seeing a difference between "view source" and a download of the HTML?
I can only think that when the page loads, SCRIPTS run that load other content into the page and that when I view source I am actually seeing a partial generated source rather than the original source code. Therefore is there a way I can call the page with my BOT, wait X seconds before obtaining the response to get this "onload" generated HTML?
Or even better a way for MY BOT (not using someone elses), to view generated source.
This BOT runs as a web service. I can find another site to scrape but it's just painful when I have all the regular expressions working on the source I see, except it's NOT the source my BOT obtains.
A bit confused at why my browser is showing me more content with a view source (not generated source), than my BOT gets when making a valid request.
Any help would be much appreciated this is almost an 8 year project that I have been doing on/off and this change has ruined one of the core parts of the system.
In response to OP's comment, here is the Java code for how to click at different parts on the screen to do this:
You could use Java's Robot class. I just learned about it a few days ago:
// Import
import java.awt.Robot;
// Code
void click(int x, int y, int btn) {
Robot robot = new Robot();
robot.mouseMove(x, y);
robot.mousePress(btn);
robot.mouseRelease(btn);
}
You would then run the click function with the x and y position to click, as well as the button (MouseEvent.BUTTON1, MouseEvent.BUTTON2, etc.)
After stringing together the right positions (this will vary depending on the screen) you could do just about anything.
To use shortcuts, just use the keyPress and keyRelease functions. Here is a good way to do this:
void key(int keyCode, boolean ctrl, boolean alt, boolean shift) {
if (ctrl)
robot.keyPress(KeyEvent.VK_CONTROL);
if (alt)
robot.keyPress(KeyEvent.VK_ALT);
if (shift)
robot.keyPress(KeyEvent.VK_SHIFT);
robot.keyPress(keyCode);
robot.keyRelease(keyCode);
if (ctrl)
robot.keyRelease(KeyEvent.VK_CONTROL);
if (alt)
robot.keyRelease(KeyEvent.VK_ALT);
if (shift)
robot.keyRelease(KeyEvent.VK_SHIFT);
}
Thus, something like Ctrl+Shift+I to open the inspect menu would look like this:
key(KeyEvent.VK_I, true, false, true);
Here are the steps to copy a website's code (from the inspector) with Google Chrome:
Ctrl + Shift + I
Right click the HTML tag
Select "Edit as HTML"
Ctrl + A
Ctrl + C
Then, you can use the technique from this StackOverflow to get the content from the clipboard:
Clipboard c = Toolkit.getDefaultToolkit().getSystemClipboard();
String text = (String) c.getData(DataFlavor.stringFlavor);
Using something like FileOutputStream to put the info into a file:
FileOutputStream output = new FileOutputStream(new File( PATH HERE ));
output.write(text.getBytes());
output.close();
I hope this helps!
I have seemed to have fixed it by just turning on the ability to store cookies in my custom HTTP (Bot/Scraper) class, that was being called from the class trying to obtain the data. Probably the site has a defense system to prevent visitors requesting pages and not the JS/CSS with a different session ID on each request.
However I would like to see some other examples because if it is just cookies then they could use JavaScript to test for JavaScript e.g an AJAX call to log if JS is actually on or some DOM manipulation to determine if you are really Human or not which would break it again.
Every site uses different methods to prevent scrapers, email harvesters, job rapists, link harvesters etc inc working out the standard time between requests for 100% verifiable humans and BOTS and then using those values to help determine spoofed user-agents etc. I wrote a whole system to stop BOTS at my last place of work and its a layered approach, just glad the cookies being enabled solved it on this site but it could easily be beefed up with other tricks to test for BOTS vs HUMANS.
I do know some Java, enough to work out what is going on anyway. My BOT is in C#.
I have posted the same question but I post it again since I haven't got any answers to that post yet.
I am trying to get some information (such as tagName, id using GetElementsByTagName method or GetElementById method) from a content page in a website using winforms.
as you see the pictures attached, no matter which selection you make (select1, select2, select3 etc) web address stays same. however, contents under those selections are different in content page.
I am trying to access to a tagName(or id) from one of them(not selections but contents under a specific selection).
I have debugged and figured out(or seems like) I can not access to tagName(or id) from any of those contents under a specific selection.
It seems like I can only access tagName(or id) from main page. picture 3 will help better explanation of some terms such as main page, content page.
I tried to explain in detail, if my question seems still not clear, let me know plz.
My code looks like this.
var countGetFile = webBrowser1.Document.GetElementsByTagName("IFRAME");
foreach (HtmlElement l in countGetFile)
{
if (l.GetAttribute("width").Equals("100%"))
{
MessageBox.Show(l.GetAttribute("height").ToString());
MessageBox.Show(l.GetAttribute("outerText").ToString());
}
}
I was not able to grab information under 2 down level of #document from html.
html looks something like
...
<src="..." id="A" ... >
#document
...
<src="..." id="B" ... >
#document
...
<span="C" ...>
...
I could grab span information (third curly brackets) with codes looking like
HtmlWindow frame1 = webBrowser1.Document.GetElementById("A").Document.Window.Frames["A"];
HtmlWindow frame2 = frame1.Document.GetElementById("B").Document.Window.Frames["B"];
foreach (HtmlElement elm in frame2.Document.All)
{
if (elm.GetAttribute("tagName").Equals("C"))
{
// your command
}
}
to use Document.Window.Frames you need a header using "System.Collections";
btw, there is a problem. When I try to access to the information in third curly bracket, I need to do some kinds of work between frame1 and frame2 such as delaying for frame2 to have enough time to be able to access to next level after frame1.
I figured a kind of hack to get it through. Place a messagebox to pop up for short time delay, or place a delay function( not freeze ) with async code looking like,
async Task PutTaskDelay()
{
await Task.Delay(5000);//5 secs
}
I just found a temporary solution for accessing to second level. I will appreciate anyone who knows some ways to solve this problem.
I am willing to know how can I get the replies of a tweet?
I am not quite sure if this could be accomplished by using a trend or maybe passing a different API URL in an option file to the Retweets methos, I don't know by hard how to do it, any assistance will be well received.
To solve this, you need to do a Search:
TwitterResponse<TwitterSearchResultCollection> replies = TwitterSearch.Search(tokens, "term", options);
And loop thru the results:
foreach (var reply in replies.ResponseObject)
{ }
Please ensure to use:
if (reply.InReplyToScreenName != null && reply.InReplyToScreenName.ToLower().Equals("term"){}
To get the replies of the right user (the one that you looked for)
Term is going to be replaced by the ScreenName that you look for i.e.: #rodbh08
I'm trying to read the value of a session ID which is served up to a client page (a pin that can then be given to other users who want to join the session), which according to chrome developer tools, is located within this element:
<input type="text" size="18" autocomplete="off" id="idSession" name="idSession" class="lots of stuff here" title="" ">
So far I've been using C# and Xpath to navigate around the site successfully, for testing purposes, but I just can't get hold of the pin that is generated within id="idSession", or by using any other identifier through Xpath. There's a bunch of jquery stuff going on in the background, but neither is it showing up there (again, the code knows about the on-screen locations in the .js files for the ID, but that's it).
I'm new to all of this so would really appreciate a nudge in the right direction, ie. what different tools I need for this, what am I missing etc. what I need to read up on.
Thanks a lot.
What about //input[#id='idSession']/#value to get the content
Also, including a link to a helper library for creating xpath using linq-esq syntax
var xpath = CreateXpath.Where(e => e.TargetElementName == "input" &&
e.Attribute("id").Text == "idSession").Select(e => e.Attribute("value"));
http://unit-testing.net/CurrentArticle/How-to-Create-Xpath-From-Lambda-Expressions.html
I know that this question was raised out several times,
and I even read most of the questions regarding the topic.
But there was a gap of about a month till now, and I'd like to know if there is any process in changing the Timeline Cover picture from the api(via an app).
Do you know any new information about this?
Or is there a solution out there? (except for uploading to cover album or profile album)
c# code as example.. will be excellent
Thanks in advance.
You can dynamically append the Profile user name and get it .
string ProfileURL = string.Empty;
ProfileURL = "http://graph.facebook.com/" + username + "/picture";
Here username will be your profile name.