Improve regex to Split large text into sentences [duplicate] - c#

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What is a regular expression for parsing out individual sentences?
I want to split large text into sentence . The regex expression i got from answer here
string[] sentences = Regex.Split(mytext, #"(?<=[\.!\?])\s+");
So I thought of using a pattern to do splitting like
if a . ? ! follows a space and a capital letter than do the split.
Capital letter indicates starting of sentence .
text = " Sentence one . Sentence e.g. two ? Sentence three.
sentence[1] = Sentence one
sentence[2] = Sentence e.g. two
For problematic cases like abbreviations i intend to do replacing
mytext.replace("e.g.","eg");
How to implement this in regex ?

\p{Lt} indicates a Unicode uppercase letter (including accents etc.), so
string[] sentences = Regex.Split(mytext, #"(?<=[.!?])\s+(?=\p{Lt})");
should do what you want.
(Note that I don't think . or ? need to be escaped in a character class so I've removed them too, but do check that this still works with those characters.)
However, note that this will still split on e.g. Mr. Jones...

Related

Need Regular expression for the following [duplicate]

This question already has answers here:
Regular expression to validate numbers separated by commas or dashes [closed]
(2 answers)
Closed 2 years ago.
I have to write the Regex expression which accepts - and only numbers either single four digit number or two 4 digit numbers seperated by hyphen as shown below
2751, 2759-2764, 2766-2774, 2776-2777, 2890-2897
3945-3974, 3979, 3984-3999
I have used this Regex ^[0-9_,]+ but this line Regex.IsMatch(line, #"^[0-9_,]+$") returns false.
Regards,
Nagasree
The pattern that you tried is not matching as there is no hyphen or space in the character class. But when you would add those, the pattern still does not take any format into account.
You could match 4 digits with optional hyphen and 4 digits part. Then repeat that preceded by a space:
^[0-9]{4}(?:-[0-9]{4})?(?:, [0-9]{4}(?:-[0-9]{4})?)*$
Regex demo
var s = "2751, 2759-2764, 2766-2774, 2776-2777, 2890-2897";
Console.WriteLine(Regex.IsMatch(s, #"^[0-9]{4}(?:-[0-9]{4})?(?:, [0-9]{4}(?:-[0-9]{4})?)*$"));
Output
True

How to get all possible Regex Matches [duplicate]

This question already has answers here:
C# Code to generate strings that match a regex [closed]
(4 answers)
Closed 3 years ago.
Based off a regex string I would like to get a list of all the possible strings that would match the regex.
Example:
Given a regex string like...
^(en/|)resources/case(-| )studies/
I want to get a list of all the possible strings that would match the regex expression. Like...
^en/resources/case-studies/
or
^/resources/case-studies/
or
^en/resources/case studies/
or
^/resources/case studies/
Thank you
Note that in regex ^ denotes the beginning of the line. You must escape it
Try
\^(en)?/resources/case(-|\s)studies/
explanation:
\^ is ^ escaped.
(en)? is optionally en, where ? means zero or one times.
/resources/case the text as is.
(-|\s) minus sign or white space.
studies/ the text as is.
See: https://dotnetfiddle.net/PO4wKV

Reqular expression for underscore C# [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I need a Reqular expression to identify string in following format in my C# code. The string will always start with "REG" and contain 3 underscores with 2 words and one number between the underscores. See below example:
Example: "REG_SOFTWARE_SECURITY_1234"
I used below REGEX expression suggested by your forums:
"\b[a-zA-Z0-9_]+\b"
But it passes the incorrect inputs also like:
REG_1234
So, it should only pass input in format - "REG_SOFTWARE_SECURITY_1234" Any suggestions?
I dont see issue with your regex. You might be using it incorrectly in c#.
Try this
var str="ALPHABET_ ALPHABET _ ALPHABET _99";
var res = Regex.Matches(str,#"\b[a-zA-Z0-9_]+\b");
foreach (Match match in res)
{
Console.WriteLine(match.Value);
}
Fiddle Here
The backslashes in your search string are not being interpreted as you expect. The C# string "\b[a-zA-Z0-9_]+\b" starts and ends with a "backspace" 0x0008 character.
To match the example string you need to use either "\\b[a-zA-Z0-9_]+\\b" or #"\b[a-zA-Z0-9_]+\b" as the regular expression.
If the task is to match a string with two words and one number separated by three underscores then the following should work:
#"\b[a-zA-Z]+_[a-zA-Z]+_[a-zA-Z]+_[0-9]+\b"
Did you mean a letter by ALPHABET ?
then you could try \w+_ \w+ _ \w+ _\d+, if you intended whitespaces, and you are searching for words, not just letters.
If you want to find a pattern like Letter Underscore Blank etc. , try \w_ \w _ \w _\d.

I still don't understand regular expression [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 8 years ago.
I have this piece of code but have no idea what it is supposed to be matching. I have looked at many different sites to try and learn the keywords, but I just don't understand regex.
string key = #"^(.*)\s*=\s*(.*)\s*$";
Match value = Regex.Match(line, key);
This looks for the start of a line (^), finds any number of characters ((.*)), followed by some whitespace (\s*) an equal sign (=) some more whitespace (\s*) and any number of characters ((.*)) and the end of line ($)
Some valid example lines:
a=a
abc = xyz
value=5
etc

Regex - At least one alphanumeric character and allow spaces [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 3 years ago.
I want to make sure a first name field has at least one alphanumeric character and also allow spaces and dashes.
**VALID**
David
Billie Joe
Han-So
**INVALID**
-
Empty is also invalid
To ensure the dashes and spaces happen in legitimate places, use this:
(?i)^[a-z]+(?:[ -]?[a-z]+)*$
See demo.
(?i) puts us in case-insensitive mode
^ ensures we're at the beginning of the string
[a-z]+ matches one or more letters
[ -]?[a-z]+ matches an optional single space or dash followed by letters...
(?:[ -]?[a-z]+)* and this is allowed zero or more times
$ asserts that we have reached the end of the string
You mentioned alphanumeric, so in case you also want to allow digits:
(?i)^[a-z0-9]+(?:[ -]?[a-z0-9]+)*$
use this pattern
^(?=.*[a-zA-Z])[a-zA-Z -]+$
Demo
oh, for alphanumeric use
^(?=.*[a-zA-Z0-9])[a-zA-Z 0-9-]+$

Categories