Finding Link Text with Regular Expressions

Finding Link Text with Regular Expressions - c#

Team:
I need some help with some regular expressions. The goal is to be able to identify three different ways that users might express links in a note, and those are as follows.
MSN
possibilities
http://www.msn.com OR
https://www.msn.com OR
www.msn.com
Then by being able to find them I can change each one of them to real A tags as necessary. I realize the first example is already an A tag but I need to add some attributes to it specific to our application -- such as TARGET and ONCLICK.
Now, I have regular expressions that can find each one of those individually, and those are as follows, respective to the examples above.
<a?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*)/?>
(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?
[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?
But the problem is that I can't run all of them on the string because the second one will match a part of the first one and the third one will match a part of both the first and second. At any rate -- I need to be able to find the three permutations distinctly so I can replace each one of them individually -- because the third expression for example will need http:// added to it.
I look forward to everybodys assistance!

Assuming that the link starts or ends either with a space or at beginnd/end of line (or inside an existing A tag) I came up with the following code, which also includes some sample texts:
string regexPattern = "((?:<a (?:.*?)href=\")|^|\\s)((?:http[s]?://)?(?:\\S+)(?:\\.(?:\\S+?))+?)((?:\"(?:.*?)>(.*?)</a>)|\\s|$)";
string[] examples = new string[] {
"some text MSN more text",
"some text http://www.msn.com/path/file?page=some.page&subpage=9#jump more text",
"some text http://www.msn.com/path/file?page=some.page&subpage=9#jump more text",
"some text https://www.msn.com/path/file?page=some.page&subpage=9#jump more text",
"some text www.msn.com/path/file?page=some.page&subpage=9#jump",
"www.msn.com/path/file?page=some.page&subpage=9#jump more text"
};
Regex re = new Regex(regexPattern);
foreach (string s in examples) {
MatchCollection mc = re.Matches(s);
foreach (Match m in mc) {
string prePart = m.Groups[1].Value;
string actualLink = m.Groups[2].Value;
string postPart = m.Groups[3].Value;
string linkText = m.Groups[4].Value;
MessageBox.Show(" prePart: '" + prePart + "'\n actualLink: '" + actualLink + "'\n postPart: '" + postPart + "'\n linkText: '" + linkText + "'");
}
}
As this code uses groups with numbers it should be possible to use the regular expression in JavaScript too.
Depending on what you need to do with the existing A tag you need to parse the particular first group as well.
Update:
Modified the regex as requested so that the link Text becomes group no. 4
Update 2:
To better catch malformed links you might try this modified version:
pattern = "((?:<a (?:.*?)href=\"?)|^|\\s)((?:http[s]?://)?(?:\\S+)(?:\.(?:[^>\"\\s]+))+)((?:\"?(?:.*?)>(.*?)</a>)|\\s|$)";

Well, if we want to do in a single pass, you could create name groups for each scenario:
(?<full><a?\w+((\s+\w+(\s*=\s*(?:".*?"|'.*?'|[^'">\s]+))?)+\s*)/?>.*</a>)|
(?<url>(http|https)://[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)|
(<?www>[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)
Then you would have to check which was the matched group:
Match match = regex.Match(pattern);
if (match.Success)
{
if (match.Groups["full"].Success)
Console.WriteLine(match.Groups["full"].Value);
else if (match.Groups["url"].Success)
....
}

Related

How do I search for a text in a string and grab everything after the search text until it reaches a character?

I am searching a huge html document on the web that will have multiple instances of names. Each section throughout the page source will contain something like this
{"keyword_text":"kathy smith","item_logging_id":"2021-05-16:yakMrD","item_logging_info":"{"source":"entity_bootstrap_connected_user_suggestion",{"keyword_text":"courtney lee","item_logging_id":"2021-05-16:lX1LC2","item_logging_info":"{"source":"entity_bootstrap_connected_user_suggestion",
I want to grab all the names in the source and put them into a text box.
Search the string for "keyword_text":" then grab all text after until it reaches " excluding the "
I want the end result to be
kathy smith
courtney lee

Considering the "until it reaches the character" you can use the regex ([^\"]*)
^ means NOT and * means multiple times. So it reads everything until the first appearance of ". The \ is to escape the quotes.
So in your case this is the regex:
\"keyword_text\":\"([^\"]*) to get the name-part without quotes.
And in c# context:
var matches= new Regex("\"keyword_text\":\"([^\"]*)").Matches(yourInputText);
foreach (Match match in matches)
{
Console.WriteLine(match.Groups[1].Value);
}

string str = "Would \"you\" like to have responses to your \"questions\" sent to you via email?";
var reg = new Regex("\".*?\"");
var matches = reg.Matches(str);
foreach (var item in matches)
{
MessageBox.Show(item.ToString());
}

How to find the third element value using Regex

All, i am currently trying to parse each element that has the format below using regex and c# to find any value in () below.. Example i would like to extract 2002_max_allow_date .. note not all the names in here will be alpha numeric etc...
I initially have the pattern: Regex regex = new Regex(#"(\w\d\d\d.[A-Z])\w+");
However this only returns the name with the numeric etc
From reply i tried the following and trying to format this so that i do not get the syntax error as well as i don't want to change the regex query...
Can someone please assist me in finding the name located in the third position.. example this,'46032','46032','2002_MAX_ALLOW_DATE'
<button class="longlist-cb longlist-cb-yes" id="cb46032"
onclick="$ll.CATG.toggleCb(this,'46032','46032','2002_MAX_ALLOW_DATE')"
</button>

Please try this
Regex rex = new Regex("'[^']+','[^']+','(?<ThirdElement>[^']+)'");
String data = "'46032','46032','2002_MAX_ALLOW_DATE'";
Match match = rex.Match(data);
Console.WriteLine(match.Groups["ThirdElement"]); // Output: 2002_MAX_ALLOW_DATE

SECOND EDIT:
I've written some code that provides all the elements inside the onclick as capture groups:
Regex regex = new Regex("onclick=\"\\$ll.CATG.toggleCb\\((.*),\\s?(.*),\\s?(.*),\\s?(.*)\\)");
string x = "<button class=\"longlist - cb longlist - cb - yes\" id=\"cb46032\" onclick=\"$ll.CATG.toggleCb(this, '46032', '46032', '2002_MAX_ALLOW_DATE')\"></button>";
Match match = regex.Match(x);
if (match.Success)
{
Console.WriteLine("match.Value returns: " + match.Value);
foreach (Group y in match.Groups)
{
Console.WriteLine("the current capture group: " + y.Value);
}
}
else
{
Console.Write("No match");
}
Console.ReadKey();
will print:
EDIT: After trying with VS, this worked for me: Regex regex = new Regex("onclick=\"\\$ll.CATG.toggleCb\\((.*),.*,.*,.*\\)");
ORIGINAL ANSWER:
If you were to use Regex regex = new Regex(#"onclick="\$ll.CATG.toggleCb\(.*,.*,(.*),.*\)"); on your provided text, that should return '46032'.
You could alter this regex by moving the capturing ( and ) to a different .* to capture, say, the fourth element, like this: onclick="\$ll.CATG.toggleCb\((.*),.*,.*,.*\) would capture this.

Why not get the attribute value of onclick, but to get the all HTML of the button which make question become complex.
And use String.Split can resolve your problem simply, but you choose to use RegExp.
the_button_element.GetAttribute('onclick').Split(',')[3]
Or use RegExp:
new Regex(#".*?,'(\w+)'\)$")

How do I remove url from text

I have this sample texts like
EA SPORTS UFC (Microsoft Xbox One, 2014) $40.00 via eBay http://t.co/Wpwj0R1EQm Tibet snake.... http://t.co/yPZXvNnugL
How do I remove urls http://t.co/Wpwj0R1EQm, http://t.co/yPZXvNnugL etc from text. I need to perform sentiment analysis and want clean words.
I am able to get rid of bad characters using simple regex.
The pattern is to remove http://t.co/{Whatever-first-word}

Regular Expressions are your friend.
Simplifying your requirement to be remove all URLS in a given string. If we accept that a URL is anything that starts with http and ends with a space (URLs cannot contain spaces) then something like the follow should suffice. This regex finds any string that starts with http (Will also catch https) and ends in a space and replaces it with an empty string
string text = "EA SPORTS UFC (Microsoft Xbox One, 2014) $40.00 via eBay http://t.co/Wpwj0R1EQm Tibet snake.... http://t.co/yPZXvNnugL";
string cleanedText = Regex.Replace(text, #"http[^\s]+", "");
//cleanedText is now "EA SPORTS UFC (Microsoft Xbox One, 2014) $40.00 via eBay Tibet snake.... "

text = Regex.Replace(text, #"((http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)", "");
The pattern above will match a URL like you want, for example
http://this.com/ah.aspx?id=1
in:
this is a url http://this.com/ah.aspx?id=1 sdfsdf
You can see this in action in a regex fiddle for it.

You can use this function https://stackoverflow.com/a/17253735/2577248
Step1. sub = Find substring between "http://" and " " (white space)
Step2. Replace "http://" + sub with #"";
Step3. Repeat util original string does not contain any "http://t.co/any"
string str = #"EA SPORTS UFC (Microsoft Xbox One, 2014) $40.00 via eBay http://t.co/Wpwj0R1EQm Tibet snake.... http://t.co/yPZXvNnugL" + " ";
while(str.Contains("http://")){
string removedStr = str.Substring("http://", #" ");
str = str.Replace("http://" + removedStr , #"");
}

Regex.Replace
And I would try this patten:
var regex_url_pattern = #"_^(?:(?:https?|ftp)://)(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)(?:\.(?:[a-z\x{00a1}-\x{ffff}0-9]+-?)*[a-z\x{00a1}-\x{ffff}0-9]+)*(?:\.(?:[a-z\x{00a1}-\x{ffff}]{2,})))(?::\d{2,5})?(?:/[^\s]*)?$_iuS"
Combined:
string output = Regex.Replace(input, regex_url_pattern, "");

Regular expressions in C# to detect more than one space between words

I need to use regular expressions in C# to validate a TextBox. I need to add this role in ValidationExpress for a Validation Control in ASP.NET. Regular expressions should not allow this:
more that ONE space between words
considering also beginning and end of the string
Any ideas?

No need for regexps.
if (string.StartsWith(" ") || string.EndsWith(" ") || string.Contains(" ")) throw...

If you are limited to validating a string, try the following pattern (example):
^[ ]?([^ ]+[ ])*[^ ]*$
It doesn't allow strings with two spaces anywhere in the string. This pattern ignores tabs and newlines, by the way. I've picked [ ] so you can see the spaces, but a simple space is the same. \s may not be right for you. For one, it might match a windows new line, \r\n.
Similarly, you can use a negative lookahead (example):
^(?!.*[ ]{2})
If you're using a client side validator you need to match from start to end, so use the pattern (?!.*[ ]{2}).*. It implicitly adds ^...$ around your pattern.
Either way, consider using a custom validator and writing a simple line of code to negate searching for two spaces. Here's how it's done. First, look at the documentation add a JavaScript function to your page:
function noTwoSpaces(source, arguments) {
arguments.IsValid = (arguments.Value.indexOf(' ') == -1);
}
Next, add the CustomValidator control to use it:
<asp:CustomValidator ID="CustomValidator1" runat="server"
ControlToValidate="TextBox1" Display="Dynamic" ErrorMessage=":-("
ClientValidationFunction="noTwoSpaces"></asp:CustomValidator>
And that's it. Much easier than an elusive regex.

I believe you want:
if (myText.Split(" ", StringSplitOptions.RemoveEmptyEntries).Length !=
myText.Split(" ").Length)
{
//String contains multiple spaces
}
As you've now said you DO want regexes I'd use
#"\s\s+"
to match two or more whitespaces or:
#"\ \ +"
to match two or more spaces.

var rex = new Regex(#"\s{2,}");
Edit: Didn't see the start/end of the string.
string poo = "a b";
string poo1 = "a b";
string poo2 = "a b";
string poo3 = "a b";
string poo4 = " a b";
string poo5 = "a b ";
string poo6 = " a b ";
string poo7 = " a b ";
var rex = new Regex(#"^\s{0}.\s{0,1}.\s{0}$");
Console.WriteLine(rex.IsMatch(poo));
Console.WriteLine(rex.IsMatch(poo1));
Console.WriteLine(rex.IsMatch(poo2));
Console.WriteLine(rex.IsMatch(poo3));
Console.WriteLine(rex.IsMatch(poo4));
Console.WriteLine(rex.IsMatch(poo5));
Console.WriteLine(rex.IsMatch(poo6));
Console.WriteLine(rex.IsMatch(poo7));
This returns:
True
False
False
False
False
False
False
False
Since the only valid string is the first one.

If you want to use regex, use this:
{2,}
(NB: There is a space before the brackets).

How about something like
string s = " tada bling zap ";
Regex reg = new Regex(#"\s\s+");
MatchCollection m = reg.Matches(s);

Can be done without regex, but imho it's less obvious to not use regex:
\s{2,} or \s\s+ if .NET doesn't support the match-min-max syntax (I don't recall).

For beginning and ending no white spaces ^[^\s](.*\s.*)+[^\s]$
Mary bought a shoe match
MarySqeezeIntoAshoe no match

Get the number of an href url parameter from downloaded html page?

I am trying to get an ID from a url parameter inside an href that looks like this:
MyItemName
I want the 71312 only and at the momment I am trying to do it using regex (but if you have a better approch I would be glad to try):
string html,itemID;
using (var client = new WebClient())
{
html = client.DownloadString("http://www.mysite.com/search.php?search_text=" + myItemName);
}
string pattern = "" + myItemName + "";
Match m = Regex.Match(html, pattern, RegexOptions.IgnoreCase);
if (m.Success)
{
itemID = m.Groups[1].Value;
MessageBox.Show(itemID);
}
Example of the html:
more html body
<h1>Items - List</h1>
<p>MyItemNameTest, MyItemNameTestB, MYItemNameOther</p>
</div>
more html body

To show where your regex went wrong:
. and ? are special characters in regular expressions. . means "any character" and ? means "zero or one occurences of the previous expression". Therefore your regex fails to match. Also, you need to use verbatim strings in C# (unless you want to escape every backslash):
#"" + myItemName + "";
will probably work.
That said, unless all the links you're examining follow exactly this format, you might run into problems. It's kind of a running gag here on SO that parsing HTML with regular expressions will earn you the wrath of Cthulhu.

Use:
Uri u = new Uri("http://www.mysite.com/myitem.php?id=12313");
string s = u.Query;
HttpUtility.ParseQueryString(s).Get("id");
In variable id you have the number. Figure out the rest of the function :)

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Finding Link Text with Regular Expressions - c#

Related

How do I search for a text in a string and grab everything after the search text until it reaches a character?

How to find the third element value using Regex

How do I remove url from text

Regular expressions in C# to detect more than one space between words

Get the number of an href url parameter from downloaded html page?

Categories

Resources