extract all URLs in a free text block using RegEx [duplicate] - c#

This question already has answers here:
Extract Url using Regex
(2 answers)
Closed 8 years ago.
I'm attempting to detect all URLs listed in a free text block. I'm using the .nets Regex.Matches call.. with the following regex: (http|https)://[^\s "']{4,}
Now, I've put in the following text:
here is a link http://somelink.com
here is a link that I didn't space withhttp://nospacelink.com/something?something=&39358235
http://nospacelink.com/something?something=&12233454
here is a link I already handled.
Here is some secret t&cs you're not allowed to know https://somethingbad.com
Just to be a little annoying I've put in a new address thingy capture type of 'http://somethinginspeechmarks.com' and what are you going to do now?
here is a link http://postTextLink.com at then some post text
Here is a link with a full stop http://alinkwithafullstoplink.com. And then some more.
and I get the following output:
http://somelink.com
http://nospacelink.com?something=&39358235
http://nospacelink.com?something=&12233454
http://alreadyhandledlink.com
https://somethingbad.com
http://somethinginspeechmarks.com
http://postTextLink.com
http://alinkwithafullstoplink.com.
Please notice the full stop on the last entry. How can I update my regex to say "If there is a full stop at the end, please ignore it?"
Also, please note that "Getting parts of a URL (Regex)" has nothing to do with my question, as that question is about how to break down a particular URL. I want to extract multiple, complete urls. Please see my input and current outputs for clarification!
I have got a regex already that does most of what I want, but isn't quite right. Could you please explain where my approach might be improved?

I would add something like [^\.] to the pattern.
This pattern says that the last char can't be a full stop.
So for (http|https)://[^\s "']{4,}[^\.] it will try to match all adresses not ending with a full stop.
Edit:
This one should be better as said in comments: [^.\s"']

Updated:
Consider the following minor change to your pattern:
(http|https)://[^\s "']{4,}(?=\.)

Related

How to split text file by comments symbols in C# [duplicate]

This question already has answers here:
Extract comments from .cs file
(2 answers)
Closed 1 year ago.
Im trying for a while to split code file (treated as text file) by the comments that in it.
For example, for the input:
// Hi guys, I am trying to get some help here.
// I really tried to do this alone.
/* But i still search for help
in our bes friend Google.*/
I expect to get the output:
Hi guys, I am trying to get some help here.
I really tried to do this alone.
But i still search for help in our bes friend Google.
so basiclly i want to recognize that there is a comments in the file (by the symbols // and /* */) and enter the comments in a list (each comment in a differend cell).
I am trying to do so by the code line: codeFile.Split('//', '/', '/');
But with no success.
As well, since it is possible for multi-line comment when using the "/* */" symbol, how can i enter the intire string between them to my list since I am run over the file by the lines?
Thanks in advence.
I would do something like :
Read your file line per line
Check if line match with Regex for your different criterias.
Build a new string (with line break if needed) using the information you got from your checking. You will be able to handle the multi line factor.
Hope this could be helpful.

Search string for specific characters and embed it with code

My previous question was closed, because it was mentioned that similar question was already asked, and answered. I was trying to find something like that, but I can't, so I will try to give more details.
I have a file which I am reading line by line.
In every line I am searching for specific phrase: "see {number}." or "See {number}."
For example:
see 112.
See 4.
etc.
The problem here is that I can't hardcore that specific phrase because I need to found different strings, which I just know starts with "see" followed by any "number", finished by dot.
At first I thought it would be best to use some kind of regular expression. But I am a little worry about performance.
Anyway after I will establish that I need to embed that in code.
Adding specific tag before and after string.
So for example when I found "see 112." i want to replace it with <link={found_number}>see 112.</link> for example: <link=112>see 112.</link>
I would be grateful for any suggestions, how to achieve that.

C# Regex.Matches - Quotation Marks Problem

I am new to the community and first wanna introduce myself. My name is Ben and I am loving to code, but I began to code like 2 years ago and coded not really much since now (maybe 50 hours at all).
So my question is the following. I wanna scrape some data from a Website and it works almost perfect, but the problem is, that I have a string in the source code like that:
-> "key":"Name","role" and I want to grab only the Name without any quotation marks.
Now my code looks like the following:
MatchCollection AllChampionName = Regex.Matches(html, #"key\s*(.+?)\s*role", RegexOptions.Singleline);
But the result in my textbox is like the following text: ":"Name","
I know why, but I don't know how I can handle it, because I don't know how regex works.
Can someone pls tell me the right code, so that I only get Name without quotation marks and maybe a source, where I can read more about Regex and how it works with the commands, because I found no good source :(
Edit: I am programming in C#.
Thanks alot!

Is there a better way to check if an entire string was matched? [duplicate]

This question already has answers here:
Match exact string
(3 answers)
Closed 3 years ago.
I'm parsing a text file line by line and for each line I have a special regex. However in one case a pattern is matching two lines. One that is a correct match and another line only partialy because a couple of values are optional.
Invalid match:
BNE1010/1000 HKG1955/2005 7/PLD/CLD/YLD
matches patial string (shouln't match this at all):
BNE1010/1000
Correct match (matches the entire string):
RG878A/21AUG15 GIG/BOG 1/RG/AV 3/AV 4/AV 5/RG 6/AV081C/22 7/CDC/YD 9/TP
The regex for this is quite long and contains several optionl groups:
^(?<FlightDesignator>([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))(?<OperationalSuffix>[A-Z])?(?<FlightIdentifierDate>\/(\d{2})([A-Z]{3})?(\d{2})?)?(\s(?<FlightLegsChangeIdentifier>(\/?[A-Z]{3})+)(?=(\s|$)))?(\s1(?<JointOperationAirlineDesignators>(\/.{2}[A-Z]?)+))?(\s3\/(?<AircraftOwner>([A-Z]{2}|.)))?(\s4\/(?<CockpitCrewEmployer>(.+?)(?=(?: \d\/|$))))?(\s5\/(?<CabinCrewEmployer>([A-Z]{2}|.)))?(?<OnwardFlight>\s6\/(([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))([A-Z])?(\/(\d{2})([A-Z]{3})?(\d{2})?)?)?(\s7\/(?<MealServiceNote>(\/?[A-Z]{0,3})+))?(\s9\/(?<OperatingAirlineDisclosure>(.{2}[A-Z]?)))?
I think there is no need to study the entire regex becasue it's build dynamically from smaller patterns at runtime and all the parts work correctly. Also lots of combinations are tested with unit tests and they all work... as long as I try to parse ony the line that should be matched by the pattern.
Currently I'm checking if the entire string is matched by
match.Group[0].Value == line
but I find it's quite ugly. I know from JavaScript the regex engine provides an Index property where the regex engine stopped. So my idea was to compare the index with the length of the string. Unfortunatelly I wasn't able to find such a property in C#.
Another idea would be to modify the regex so that it matches only one line and no partial lines.
Example: https://regex101.com/r/dM5wU4/1
The example contains only two cases because there aren't actually any combinations that would change its behavior. I could remove some parameters but it wouldn't change anything.
EDIT:
I've edited my question. Sorry to every for not providing all the information at the first time. I won't ask any more questions when writing on the phone :) It wasn't a good idea. Hopefully it won't get closed now.
You asked whether I could simplify the regex. I would do it if I could and knew how. If it was easy I wouldn't have asked. The problem started as the regex ans string became bigger during development. Now they are at the production length and I can't actually make them shorter even for the sake of the quesion, sorry.
EDIT-2:
I found the reason why I couldn't find the inherited Index and Length properties of the Match class.
For some strange reason when selecting the Match class and pressing F1 Visual Studio opened the wrong help page (Match Properties) even though I'm not working with the Micro Framework. I didn't notice that but I was indeed wondering why there is very little information. Thx to #Jamiec for the correct link. I won't trust Visual Studio anymore when hitting F1.
Disclaimer: Im going to add this, but I doubt its the solution. If it's not this part will get deleted in short order
You can add a $ at the end of your regular expression. This stops your first example matching but continues to match the second example.
As you've not provided any more than 2 examples, its unclear if this actually solves all your cases or just that one specific false positive.
My question is whether it is possible to check if a regular expression matched the entire sting without checking the first group against the original line?
If you're not adverse to checking the entire match to the length of the string you can do that too:
var regex = new Regex(#"^(?<FlightDesignator>([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))(?<OperationalSuffix>[A-Z])?(?<FlightIdentifierDate>\/(\d{2})([A-Z]{3})?(\d{2})?)?(\s(?<FlightLegsChangeIdentifier>(\/?[A-Z]{3})+)(?=(\s|$)))?(\s1(?<JointOperationAirlineDesignators>(\/.{2}[A-Z]?)+))?(\s3\/(?<AircraftOwner>([A-Z]{2}|.)))?(\s4\/(?<CockpitCrewEmployer>(.+?)(?=(?: \d\/|$))))?(\s5\/(?<CabinCrewEmployer>([A-Z]{2}|.)))?(?<OnwardFlight>\s6\/(([A-Z0-9]{2}[A-Z]?)([0-9]{3,4}))([A-Z])?(\/(\d{2})([A-Z]{3})?(\d{2})?)?)?(\s7\/(?<MealServiceNote>(\/?[A-Z]{0,3})+))?(\s9\/(?<OperatingAirlineDisclosure>(.{2}[A-Z]?)))?");
var input1 = #"BNE1010/1000 HKG1955/2005 7/PLD/CLD/YLD";
var input2 = #"RG878A/21AUG15 GIG/BOG 1/RG/AV 3/AV 4/AV 5/RG 6/AV081C/22 7/CDC/YD 9/TP";
var match1 = regex.Match(input1);
var match2 = regex.Match(input2);
Console.WriteLine(match1.Length == input1.Length); // False
Console.WriteLine(match2.Length == input2.Length); // True
Live example: http://rextester.com/NIBE6349

How to replace entire URL using Regex? [duplicate]

This question already has answers here:
C# regex pattern to extract urls from given string - not full html urls but bare links as well
(3 answers)
Closed 9 years ago.
So far I have
messageText1 = Regex.Replace(messageText1, "(www|http|https)*?(com|.co.uk|.org)", "[URL OMITTED]");
With only the www, and without the bracks or http or https it works as intended
For example and input of Hey check out this site, www.google.com, it's really cool would output hey check out this site, [URL OMITTED], it's really cool
But if I put back in the or operators for the start of the URL, it only replaces the .com part of the input
Why won't it work?
Thanks
(www|http|https)*?(com|.co.uk|.org)
means www or http or https 0 to many times immediately followed by com .co.uk or .org.
So it would match for example httphttphttp.co.uk
Your intention was probably just to have a . before the *. Which then means it only looks for (www|http|https) once, then it matchs . (any character) 0 to many times.
You are also missing the . in .com. However, if you want to match a literal . you need to use \., since a . on its own means 'any character'.
With that in mind, the regex I think you were going for is:
(www|http|https).*?(\.com|\.co\.uk|\.org)
This should work better. It will also work for other TLDs that don't end with .com, .co.uk or .org:
messageText1 = Regex.Replace(messageText1, #"\b(?:http://|https://|www\.)\S+", "[URL OMITTED]");
Your expression is missing a . somewhere or (possibly better) a \S+
(www|http|https)\S*(com|\.co\.uk|\.org)
In C#:
Regex.Replace(messageText1, #"(www|http|https)\S*(com|\.co\.uk|\.org)", "[URL OMITTED]");
Note: you probably want to escape the .'s as well.
A simple version which i tried is as follows.
messageText1 = Regex.Replace(messageText1, #"(www)?(.)?[a-z]*.(com)", "[URL OMITTED]");
i tried this with
string messageText1 = " Hey check this out, http:\www.google.com,its cool";
string messageText1 = " Hey check this out, www.google.com,its cool";
string messageText1 = " Hey check this out, google.com,its cool";

Categories