Finding links in Google source code with Regex

Finding links in Google source code with Regex - c#

I'm trying to grab links to the 10 websites Google produces on the first page when you search for something using Regex. I'm quite new to Regex and having a lot of trouble getting this to work:
MatchCollection links = Regex.Matches(indexPage, #"<h3 class=""r""><a href=""\s*(.+?)\s*"" class=l", RegexOptions.Multiline);
Once I have the links in a collection I am adding them to a list here:
foreach (Match link in links) {
string result = link.Groups[1].Value;
results.Add(result);
}
It isn't finding any links, any help would be great thanks

This find all urls :
"#^((?#
the scheme:
)(?:https?://)(?#
second level domains and beyond:
)(?:[\S]+\.)+((?#
top level domains:
)MUSEUM|TRAVEL|AERO|ARPA|ASIA|EDU|GOV|MIL|MOBI|(?#
)COOP|INFO|NAME|BIZ|CAT|COM|INT|JOBS|NET|ORG|PRO|TEL|(?#
)A[CDEFGILMNOQRSTUWXZ]|B[ABDEFGHIJLMNORSTVWYZ]|(?#
)C[ACDFGHIKLMNORUVXYZ]|D[EJKMOZ]|(?#
)E[CEGHRSTU]|F[IJKMOR]|G[ABDEFGHILMNPQRSTUWY]|(?#
)H[KMNRTU]|I[DELMNOQRST]|J[EMOP]|(?#
)K[EGHIMNPRWYZ]|L[ABCIKRSTUVY]|M[ACDEFGHKLMNOPQRSTUVWXYZ]|(?#
)N[ACEFGILOPRUZ]|OM|P[AEFGHKLMNRSTWY]|QA|R[EOSUW]|(?#
)S[ABCDEGHIJKLMNORTUVYZ]|T[CDFGHJKLMNOPRTVWZ]|(?#
)U[AGKMSYZ]|V[ACEGINU]|W[FS]|Y[ETU]|Z[AMW])(?#
the path, can be there or not:
)(/[a-z0-9\._/~%\-\+&\#\?!=\(\)#]*)?)$#i"

Related

C# Selenium GetElements get error data

Sorry, my English is not good.
I use Selenium to get datas from web,
Here is my code
var workGroups = e.WebDriver.FindElements(By.XPath("//div[#class='workgroup']"));
Console.WriteLine($"Item List: {workGroups.Count} Items");
foreach (var workgroup in workGroups)
{
string workName = workgroup.FindElement(By.XPath("//div[#class='worktitle']/label")).Text;
var detail = workgroup.FindElements(By.XPath("//div[#class='col-4 high']"));
Console.WriteLine($"Item Name: {workName}, Number of Pictures: {detail.Count}");
}
And this is the result:
result
It seems to be catching the first data and all pictures,
I use chromedriver to help me.
I don't know where it is wrong.
Please help me, brothers and sisters.
thank you very much.

Try to use:
string workName = workgroup.FindElement(By.XPath("./div[#class='worktitle']/label")).Text;
var detail = workgroup.FindElements(By.XPath("./div[#class='col-4 high']"));
I didn't test that but assuming from using workgroup element you would like to get only elements that are "inside" the workgroup element area. However, to do so you need to use current "folder" notation (./) instead of root element notation (//) which looking for elements starting from root node in your HTML document and actually going through the entire document.

I need to strip a Google Alerts URL

To preface, I know there are similar threads about this, but I am using C#, not java, or python, or Php. Some threads provide a solution for a single URL, which is not universal. Thanks for not flagging me.
So I am using Google Alerts to get links to articles via email. I have already written a program that can strip the URLs out of the email as well as another program to scrape the websites. My issue is that the links in the google alerts email look like this:
https://www.google.com/url?rct=j&sa=t&url=http://www.foxnews.com/health/2016/08/19/virtual-reality-treadmills-help-prevent-falls-in-elderly.html&ct=ga&cd=CAEYACoTOTc2NjE4NjYyNzMzNzc3NDcyODIaODk2NWUwYzRjMzdmOGI4Nzpjb206ZW46VVM&usg=AFQjCNGyK2EyVBLoKnNkdxIBDf8a_B3Ung. Yeah, ugly.
Because this redirects to the actual article through google, my scraping program does not work on these links. I have tried a million different RegExs from questions here and other sources. I managed to strip off everything up until the http:// of the actual article but it still has the tail end that screws it up. Here is what I have so far. They now look like:
http://www.foxnews.com/health/2016/08/19/virtual-reality-treadmills-help-prevent-falls-in-elderly.html&ct=ga&cd=CAEYACoTOTc2NjE4NjYyNzMzNzc3NDcyODIaODk2NWUwYzRjMzdmOGI4Nzpjb206ZW46VVM&usg=AFQjCNGyK2EyVBLoKnNkdxIBDf8a_B3Ung
private List<string> GetLinks(string message)
{
List<string> list = new List<string>();
Regex urlRx = new Regex(#"((http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)", RegexOptions.IgnoreCase);
MatchCollection matches = urlRx.Matches(message);
foreach (Match match in matches)
{
if(!match.ToString().Contains("news.google.com/news") && !match.ToString().Contains("google.com/alerts"))
{
string find = "=http";
int ind = match.ToString().IndexOf(find);
list.Add(match.ToString().Substring(ind+1));
}
}
return list;
}
Some help getting rid of the endings would be awesome, be it a new RegEx or some extra code. Thanks in advance.

You can use HttpUtility.ParseQueryString to retrieve the url part of the query string. It is located in the System.Web namespace (reference required).
var uri = new Uri("https://www.google.com/url?rct=j&sa=t&url=http://www.foxnews.com/health/2016/08/19/virtual-reality-treadmills-help-prevent-falls-in-elderly.html&ct=ga&cd=CAEYACoTOTc2NjE4NjYyNzMzNzc3NDcyODIaODk2NWUwYzRjMzdmOGI4Nzpjb206ZW46VVM&usg=AFQjCNGyK2EyVBLoKnNkdxIBDf8a_B3Ung");
var queries = HttpUtility.ParseQueryString(uri.Query);
var foxNews = queries["url"]; //http://www.foxnews.com/health/2016/08/19/virtual-reality-treadmills-help-prevent-falls-in-elderly.html

How to get replies with twitterizer C#

I am willing to know how can I get the replies of a tweet?
I am not quite sure if this could be accomplished by using a trend or maybe passing a different API URL in an option file to the Retweets methos, I don't know by hard how to do it, any assistance will be well received.

To solve this, you need to do a Search:
TwitterResponse<TwitterSearchResultCollection> replies = TwitterSearch.Search(tokens, "term", options);
And loop thru the results:
foreach (var reply in replies.ResponseObject)
{ }
Please ensure to use:
if (reply.InReplyToScreenName != null && reply.InReplyToScreenName.ToLower().Equals("term"){}
To get the replies of the right user (the one that you looked for)
Term is going to be replaced by the ScreenName that you look for i.e.: #rodbh08

Regex Issue in C#

I am trying to create a C# routine that removes all of the following prefixes and suffixes and returns just the root word of a domain:
var stripChars = new List<string> { "http://", "https://", "www.", "ftp.", ".com", ".net", ".org", ".info", ".co", ".me", ".mobi", ".us", ".biz" };
I do this with the following code:
originalDomain = stripChars.Aggregate(originalDomain, (current, repl) => Regex.Replace(current, repl, #"", RegexOptions.IgnoreCase));
Which seems to work in almost all cases. Today, however, I discovered that setting "originalDomain" to "NameCheap.com" does not return:
NameCheap
Like it should, but rather:
NCheap
Can anyone look at this and tell me what is going wrong? Any help would be appreciated.

THis is normal: the dot in a regex means any character.
Therefore, .me matches ame in NameCheap.
Escape the dots with a backslash.
Also, you'd be better off using a dedicated URI API for this kind of operation.

I know this doesn't answer your question directly, but given the specific task you are trying to accomplish I would recommend trying something like this:
Uri uri = new Uri(originalDomain);
originalDomain = uri.Host;
EDIT:
If your input may not contain a scheme you can use the uri builder as notied in this post
var hostName = new UriBuilder(input).Host
Hope this helps.

how can I use js/coffee to screen scrape an asp page?

I've got a website that I'd like to pull data from and it's really stuck in the stone ages. There's no web service, no API and it's very much an ASP/Session/table-based-layout page. Pretty fugly.
I'd like to just screen scrape it and use js (coffeescript) to automate that. I wonder if this is possible. I could do this with C# and linqpad but then I'm stuck parsing the tables (and sub-tables and sub-sub-tables) with regex. Plus if I do it with js or coffeescript I'll get much more comfortable with those languages and I'll be able to use jQuery for pulling elements out of the DOM.
I see two possibilities here:
use C# and find a library that will do things like Jquery but in C# code
use coffeescript (js) and use jquery to find the elements that I'm looking for in the page
I'd also like to automate the page a bit (get next set of results). This is strictly for personal use -- I'm not pulling results of someone's search to use in my business. I just want to make a crappy search engine do what I want.

I wrote a class that allows you to supply a bunch of urls and a code block to scrape pages inside a chrome extension. You can find the github repo here: https://github.com/jkarmel/Executor. It could use some more testing and I need to work on the documentation, but it looks like it might be what you are looking for.
Here is how you would use it to get the all the links from a few different pages:
/*
* background.js by Jeremy Karmel.
*/
URLS = ['http://www.apple.com/',
'http://www.google.com/',
'http://www.facebook.com/',
'http://www.stanford.edu'];
//Function will be provided to exector to collect information
var getLinks = function() {
var links = [];
var numLinks = $('a');
$links.each(function(i, val) {links.push(val.href)});
var request = {data: links, url: window.location.href};
chrome.extension.sendRequest(request);
}
var main = function() {
var specForUsersTopics = {
urls : URLS,
code : getLinks,
callback : function(results) {
for (var url in results) {
console.log(url + ' has ' + results[url].length + ' links.');
var links = results[url];
for (var i = 0; i < links.length; i++)
console.log(' ' + links[i]);
}
console.log('all done!!!!');
}
};
var exec = Executor(specForUsersTopics);
exec.start();
}
main();
So basically the code to collect the links would be supplied to the executor instance and then you would do whatever you wanted with the results in the callback. It can deal with longish lists of url (~1000) and it will work on more than one at a time (default == 5). It doesn't handle errors in the code block very well right now, so be sure to test the code you are supplying.

I'm liking Curtain A) "use C# and find a library..."
"HTML Agility Pack" might be just what you're looking for:
http://htmlagilitypack.codeplex.com/

You can do it easily with Node.js, jsdom, and jQuery. See this tutorial (in JavaScript).

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Finding links in Google source code with Regex - c#

Related

C# Selenium GetElements get error data

I need to strip a Google Alerts URL

How to get replies with twitterizer C#

Regex Issue in C#

how can I use js/coffee to screen scrape an asp page?

Categories

Resources