Regex to detect url containing other urls

Regex to detect url containing other urls - c#

XXXXXXhttp://something/something-http://directedto.com/XXXXXXX
I have a list of strings like that where X stands for a random extended ASCII character. I can't find any web source of regex that help me to get
http://something/something-http://directedto.com/
out of the string. Could you provide me a regex pattern that really helps ?
EDIT; the above string is just an example.
as other cases e.g
XXXXXhttp://something/somehttp/qausiehfiuhakjh-/http://directedto.net/soemthignelseXXXXXXX
XXXXXXXXXXhttp://www.yahoo.com/_ylt=Asq0NTMqTVFcCmnB3eR857SbvZx4;_ylu=X3oDMTNvZ2dtNnI1BGEDMQRjY29kZQNwemJ1YWxsY2FoNQRjcG9zAzIEZwMxBGludGwDdXMEbWNvZGUDcHpidWFsbGNhaDUEbXBvcwMzBHBrZ3QDMgRwb3MDMQRzZWMDdGQtbG9jBHNsawN0aXRsZQR0ZXN0AzcwMQR3b2UDMjQ1OTExNQ--/SIG=14l1h2t2v/EXP=1322779228/**http://www.nytimes.com/2011/12/01/nyregion/told-to-diversify-dock-union-offers-nearly-all-white-list.html%3Fsrc=me%26ref=nyregionXXXXXXXXXXXXXX

Detecting a URL is actually very difficult, because it can contain almost any character including "random extended ascii" ones. A good explanation of why it's so hard is here: http://daringfireball.net/2010/07/improved_regex_for_matching_urls. Unfortunately that example assumes there is some kind of "word boundary" around the URL, which is not the case for your problem.
There isn't any way to reliably detect any possible url, but you could make some assumptions, perhaps your urls all start with 'http:' or 'https:' and only contain alpha-numeric characters, underscores and periods? This would work for that:
https?:[a-zA-Z0-9./]+
If you update your question with better examples of the actual text you're trying to search in, I can improve my pattern as necessary.

Related

Regex for ^ | in C#

I am working on HL7 messages and I need a regex. This doesn't work:
HL7 message=MSH|^~\&|DATACAPTOR|123|123|20100816171948|ORU^R01|081617194802900|P|2.3|8859/1
My regex is:
MSH|^~\&|DATACAPTOR|\d{3}|\d{3}|(\d{4}\d{2}\d{2}\d{2}\d{2}\d{2})|ORU\\^R01|\d{20}|P|2.3|8859/1
Can anybody suggest a regex for special characters?
I am using this code:
strRegex = "\\vMSH|^~\\&|DATACAPTOR|\\d{3}|\\d{3}|
(\\d{4}\\d{2}\\d{2}\\d{2}\\d{2}\\d{2})|ORU\\^R01|\\d{20}|P|2.3|8859/1";
Regex rx = new Regex(strRegex, RegexOptions.Compiled | RegexOptions.IgnoreCase );

|, ^, and \ are all special characters in regular expressions, so you'd have to escape them with \. Remember \ is also an escape character within a regular string literal so you'd have to escape that, too:
var strRegex = "\\vMSH\\|\\^~\\\\&\\|DATACAPTOR\\|…
But it's generally a lot easier to use a verbatim string literal (#"…"):
var strRegex = #"\vMSH\|\^~\\&\|DATACAPTOR\|…
Finally, note that (\d{4}\d{2}\d{2}\d{2}\d{2}\d{2}) can be simplified to (\d{14}).
However, for a structure like this, it's probably easier to just use the Split method.
var segment = "MSH|^~\&|DATACAPTOR…";
var fields = segment.Split('|');
var timestamp = fields[5];
Warning: HL7 messages may use different control characters—starting the 4th character in the MSH segment as a field separator (in this case |^~\& are the control characters). It's best to parse the control characters first if you don't control your input and these control characters may change.

For me your question describes two distinct problems.
Problem 1) "..I need a regex..this doesn't work..My regex is..anybody suggest a (better) regex..?"
This is the good part of your question.
As already pointed out by #p-s-w-g some special characters in regular expressions must be escaped. Page Microsoft Developer Network: Character Escapes in Regular Expressions tells you which characters are special and how to escape them.
In order to easily test if your regex recognizes the grammar you may find useful some interactive regex testing tools, e.g. Regex Hero or The Regulator
Problem 2) "I am working on HL7 messages..this doesn't work..My regex is..anybody suggest a (better) regex..?"
This is the bad part of your question.
The
MSH|^~\&|DATACAPTOR|123|123|20100816171948|ORU^R01|081617194802900|P|2.3|8859/1
example shown in your question is already not valid HL7 message fragment. It is something similar to HL7 but it is was already damaged probably by some text pre-processing code. HL7 v2 messages are not transmitted using text protocol that can be manipulated using text tools. The protocol is binary but at the same time partially readable and thus controllable by humans without any special tools. But it is binary protocol and must be processed as such. Regex is a tool for working with text strings not binary strings. And although it may seem possible to outsmart some ancient 20 years old protocol by a new-age regex one-liner, it is not good approach. I have tried to explain the why not in the comment part of your question.
Basic decoding of the fragment is:
MSH-0: MSH
MSH-1: |
MSH-2: ^~\&
MSH-3: DATACAPTOR
MSH-4: 123
MSH-5: 123
MSH-6: ! missing !
MSH-7: 20100816171948
MSH-8: ! missing !
MSH-9: ORU^R01
MSH-10: 081617194802900
MSH-11: P
MSH-12: 2.3
MSH-13: ! missing !
MSH-14: ! missing !
MSH-15: ! missing !
MSH-16: ! missing !
MSH-17: ! missing !
MSH-18: 8859/1
The ! missing ! pieces are really missing. In normal MSH segment they should be there at their corresponding positions, just having default empty value.
By reading Health Level Seven, Version 2.3.1 © 1999 - Chapter 2.24.1 MSH - message header segment we can see that
The message was created 4 years ago in 2010, probably by Capsule Tech, Inc.'s DataCaptor™ and formatted by rules defined by Health Level Seven, Version 2.3© 1997 that is by 17 years old and several times updated standard and was supposed to be used by one of the countries listed in Wikipedia: ISO/IEC 8859-1
From your question I can't see more, but whatever you are trying to do and whatever data you are going to process for whatever reason, the code fragment you are starting with is already wrong, in general the HL7 regex parsing approach is strange and if you're working on a serious software to be used anywhere in the healthcare industry, please consider writing or using a serious and tested parser, e.g. the one used by NHapi library http://sourceforge.net/p/nhapi/code/HEAD/tree/NHapi20/NHapi.Base/Parser/PipeParser.cs

Parse directories from a string

Firstly i have spent Three hours trying to solve this. Also please don't suggest not using regex. I appreciate other comments and can easily use other methods but i am practicing regex as much as possible.
I am using VB.Net
Example string:
"Hello world this is a string C:\Example\Test E:\AnotherExample"
Pattern:
"[A-Z]{1}:.+?[^ ]*"
Works fine. How ever what if the directory name contains a white space? I have tried to match all strings that start with 1 uppercase letter followed by a colon then any thing else. This needs to be matched up until a whitespace, 1 upper letter and a colon. But then match the same sequence again.
Hope i have made sense.

How about "[A-Z]{1}:((?![A-Z]{1}:).)*", which should stop before the next drive letter and colon?
That "?!" is a "negative lookaround" or "zero-width negative lookahead" which, according to Regular expression to match a line that doesn't contain a word? is the way to get around the lack of inverse matching in regexes.

Not to be too picky, but most filesystems disallow a small number of characters (like <>/\:?"), so a correct pattern for a file path would be more like [A-Z]:\\((?![A-Z]{1}:)[^<>/:?"])*.
The other important point that has been raised is how you expect to parse input like "hello path is c:\folder\file.extension this is not part of the path:P"? This is a problem you commonly run into when you start trying to parse without specifying the allowed range of inputs, or the grammar that a parser accepts. This particular problem seems pretty ad hoc and so I don't really expect you to come up with a grammar or to define how particular messages are encoded. But the next time you approach a parsing problem, see if you can first define what messages are allowed and what they mean (syntax and semantics). I think you'll find that once you've defined the structure of allowed messages, parsing can be almost trivial.

Regex for ANY string except "www"? (subdomain)

I was wondering if someone out there could help me with a regex in C#. I think it's fairly simple but I've been wracking my brain over it and not quite sure why I'm having such a hard time. :)
I've found a few examples around but I can't seem to manipulate them to do what I need.
I just need to match ANY alphanumeric+dashes subdomain string that is not "www", and just up to the "."
Also, ideally, if someone were to type "www.subdomain.domain.com" I would like the www to be ignored if possible. If not, it's not a huge issue.
In other words, I would like to match:
(test).domain.com
(test2).domain.com
(wwwasdf).domain.com
(asdfwww).domain.com
(w).domain.com
(wwwwww).domain.com
(asfd-12345-www-bananas).domain.com
www.(subdomain).domain.com
And I don't want to match:
(www).domain.com
It seems to me like it should be easy, but I'm having troubles with the "not match" part.
For what it's worth, this is for use in the IIS 7 URL Rewrite Module, to rewrite for all non-www subdomains.
Thanks!

Is the remainder of the domain name constant, like .domain.com, as in your examples? Try this:
\b(?!www\.)(\w+(?:-\w+)*)(?=\.domain\.com\b)
Explanation:
\w+(?:-\w+)* matches a generic domain-name component as you described (but a little more rigorously).
(?=\.domain\.com\b) makes sure it's the first subdomain (i.e., the last one before the actual domain name).
\b(?!www\.) makes sure it isn't www. (without the \b, it could skip over the first w and match just the ww.).
In my tests, this regex matches precisely the parts you highlighted in your examples, and does not match the www. in either of the last two examples.
EDIT: Here's another version which matches the whole name, capturing the pieces in different groups:
^((?:\w+(?:-\w+)*\.)*)((?!www\.)\w+(?:-\w+)*)(\.domain\.com)$
In most cases, group $1 will contain an empty string because there's nothing before the subdomain name, but here's how it breaks down www.subdomain.domain.com:
$1: "www."
$2: "subdomain"
$3: ".domain.com"

^www\.
And invert the logic for this bit, so if it matches, then your string does not meet your requirements.

This works:
^(?!www\.domain\.com)(?:[a-z\-\.]+\.domain\.com)$
Or, with the necessary backslashes for Java (or C#?) strings:
"^(?!www\\.domain\\.com)(?:[a-z\\-\\.]+\\.domain\\.com)$"
There may be a more concise way (i.e. only typing domain.com once), but this works ..

Just substitute the original with everything after the www, if present (pseudocode):
str = re.sub("(www\.)?(.+)", "\2", str)
Or if you just want to match those which are "wrong" use this:
(www\.([^.]+)\.([^.]+))
And if you must match all those which are good use this:
(([^w]|w[^w]|ww[^w]|www[^.]|www\.([^.]+)\.([^.]+)\.).+)

Just thinking aloud here:
^(?:www\.)?([^\.]+)\.([^\.]+)\.
where...
(?:www\.)? looks for a possible "www" at the start, non-capturing
([^\.]+)\. looks for the sub-domain (anything except a dot at least once until a dot)
([^\.]+)\. looks for the domain, ending with a dot (anything except a dot at least once until a dot)
Note: This expression will not work with double sub-domains:
www.subsub.sub.domain.com

This:
^(?:www\.)?([^.]*)
It matches exactly what you put in parentheses in your question. You will find your answers sitting in group(1). You have to anchor it to the beginning of the line. Use this:
^(?:www\.)?(.*)
If you want everything in the URL except the "www.". One example you did not include in your test cases was "alpha.subdomain.domain.com". In the event you need to match everything, except "www.", that is not in the "domain.com" part of the string, use this:
^(?:www\.)?(.+)((?:\.(?:[^./\?]+)){2})
It will solve all of your cases, but in addition, will also return "alpha.subdomain" from my additional test case. And, for an encore, places ".domain.com" in group 2 and will not match beyond that if there are directories or parameters in the url.
I verified all of these responses here.
Finally, for the sake of overkill, if you want to reject addresses that begin with "www.", you can use negative lookbehind:
^....(?<!www\.).*

Thought i'd share this.
(\\.[A-z]{2,3}){1,2}$
Removes any '.com.au' '.co.uk' from the end. Then you can do an additional lookup to detect whether a URL contains a subdomain.
E.g.
subdaomin1.sitea.com.au
subdaomin2.siteb.co.uk
subdaomin3.sitec.net.au
all become:
subdomain1.sitea
subdomain2.siteb
subdomain3.sitec

Split text file at sentence boundary

I have to process a text file (an e-book). I'd like to process it so that there is one sentence per line (a "newline-separated file", yes?). How would I do this task using sed the UNIX utility? Does it have a symbol for "sentence boundary" like a symbol for "word boundary" (I think the GNU version has that). Please note that the sentence can end in a period, ellipsis, question or exclamation mark, the last two in combination (for example, ?, !, !?, !!!!! are all valid "sentence terminators"). The input file is formatted in such a way that some sentences contain newlines that have to be removed.
I thought about a script like s/...|. |[!?]+ |/\n/g (unescaped for better reading). But it does not remove the newlines from inside the sentences.
How about in C#? Would it be remarkably faster if I use regular expressions like in sed? (I think not). Is there an other faster way?
Either way (sed or C#) is fine. Thank you.

Regex is a good option that I was using for a long time.
A very good regex that worked fine for me is
string[] sentences = Regex.Split(sentence, #"(?<=['""A-Za-z0-9][\.\!\?])\s+(?=[A-Z])");
However, regex is not efficient. Also, though the logic works for ideal cases, it does not work good in production environment.
For example, if my text is,
U.S.A. is a wonderful nation. Most people feel happy living there.
The regex method will classify it as 5 sentences by splitting at each period. But we know that logically that it should be split as only two sentences.
This is what made me to look for a Machine Learning Technique and at last the SharpNLP worked pretty fine for me.
private string mModelPath = #"C:\Users\ATS\Documents\Visual Studio 2012\Projects\Google_page_speed_json\Google_page_speed_json\bin\Release\";
private OpenNLP.Tools.SentenceDetect.MaximumEntropySentenceDetector mSentenceDetector;
private string[] SplitSentences(string paragraph)
{
if (mSentenceDetector == null)
{
mSentenceDetector = new OpenNLP.Tools.SentenceDetect.EnglishMaximumEntropySentenceDetector(mModelPath + "EnglishSD.nbin");
}
return mSentenceDetector.SentenceDetect(paragraph);
}
Here in this example, I have made use of SharpNLP, in which I have used EnglishSD.nbin - a pre-trained model for sentence detection.
Now if I apply the same input on this method, it will perfectly split text into two logical sentences.
You can even tokenize, POSTag, Chuck etc., using the SharpNLP project.
For step by step integration of SharpNLP into your C# application, read through the detailed article I have written. It will explain to you the integration with code snippets.
Thanks

Sentence splitting is a non-trivial problem for which machine learning algorithms have been developed. But splitting on whitespace between [.\?!]+ and a capital letter [A-Z] might be a good heuristic. Remove the newlines first with tr, then apply the RE:
tr '\r\n' ' ' | sed 's/\([.?!]\)\s\s*\([A-Z]\)/\1\n\2/g'
The output should be one sentence per line. Inspect the output and refine the RE if you find errors. (E.g., mr. Ed would be handled incorrectly. Maybe compile a list of such abbreviations.)
Whether C# or sed is faster can only be determined experimentally.

You could use something like this to extract the sentences:
var sentences = Regex.Matches(input, #"[\w ,]+[\.!?]+")
foreach (Match match in sentences)
{
Console.WriteLine(match.Value);
}
This should match sentences containing words, spaces and commas and ending with (any number of) periods, exclamation and question marks.

You can check my tutorial http://code.google.com/p/graph-expression/wiki/SentenceSplitting
Basic idea is to have split chars and impossible pre/post condition at every split. Tjis simple heuristic works very well.

The task you're interested in is often referred to as 'sentence segmentation'. As larsmans said, it's a non-trivial problem, but heuristic approaches often perform reasonably well, at least for English.
It sounds like you're primarily interested in English, so the regex heuristics already presented may perform adequately for your needs. If you'd like a somewhat more accurate solution (at the cost of just a little more complexity), you might consider using LingPipe, an open-source NLP framework. I've had pretty good luck with LingPipe, the few times I've used it.
See http://alias-i.com/lingpipe/demos/tutorial/sentences/read-me.html for a detailed tutorial on sentence segmentation.

Remove spam url in text

Input:
dsfdsf www. cnn .com dksfj kdsfjkdjfdf
www.google.com dkfjkdjfk w w w . ya
hoo .co mdfdd
Output:
dsfdsf dksfj kdsfjkdjfdf dkfjkdjfk mdfdd
How do I write a function that does this in C#?

Basically you would have to implement two steps:
Normalization
Filtering
Normalization means that you would remove all whitespace and other noise characters from your input, then you do a transcoding of all diacritics, special characters etc into the basic latin alphabet (this is to map identical- or similar-looking glyphs to one single char, e.g. omicron and o look identical). You would need to retain a one-to-one mapping from the normalized version of the input to the original input.
Then you would search the normalized input for blocked patterns, retrieve the same pattern in the original input and remove it.
Of course, this approach is not fail-safe, you might get false positives actually.
A good answer describing how the simple filtering is doomed can be found here:
How do you implement a good profanity filter?

Start with learning about the RegEx (Regular Expression) facilities in C#, then you'll need a good RegEx that matches a URL. You'd need to change this to manage URLs with spaces though.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.