C# Using Regex to find words, any tips?

C# Using Regex to find words, any tips? - c#

So i am studying programming in university and i have a task for which i need to use especially Regex.
So basically, i need to make program, which copies the text from first file until it meets first non-copied word from second file or it reaches the end of the file and when it finds that non-copied word (or reaches the end of the file), then it copies text from second file until it meets first non-copied word from first file or it reaches the end of the file, reapeat till both files end. Lower and Upper characters don't matter.
For example:
File1: You are very beautiful, can you give me your number?
File2: Beautiful is Beyonce, not me.
Result: You are very Beautiful is Beyonce, not me. beautiful, can you give me your number?
Yes, i know, it is a confused result, but i need to make, so do you have any ideas or tips, how i could make this program ?

First off, get yourself a Regex designer that follows .Net Regex rules.
I use: https://rad-software-regular-expression-designer.software.informer.com/
This is how you find words:
\b\w+\b
"A word boundary followed by at least 1 word char followed by a closing boundary", ensure you use the case insensitive RegexOption
Then, loop through both string using a case insensitive comparison to find matching words.
Once done, store the first word in a variable.
Now, create another Regex, it's match string is a "positive lookahead zero width assertion) around the first matching word. Don't forget the case insensitive switch.
If you match the word, the word gets replaced, instead we use "lookaround", which just return a zero width location and you get two "beautiful"s in the result
Use Regex's Instance method "Replace(left, right)"
Now, I've also thrown the code together but you're supposed to be doing this yourself.
// So I've hidden the code in this pastebin:
https://pastebin.com/fBAu1zBY
// Only go there is you've not managed to figure this out for yourself!

Related

How to check if string has letter in second character with regex

Say I wanted to create an ID number such as 1A45 or 4F01.
What would the regex be to make sure that the string had exactly one letter as the second character?
I am unsure how to check for specific combinations of characters.
What I have so far is:
if(!Regex.IsMatch(txtTrainID.Text, #"^[\w,\d,\w,\w]+$"))
Which is obviously completely wrong, I've had trouble trying to find a decent simple answer to this anywhere.

If that's the only requirement (and I am sure it's not), use anchors and a character class in the second position as in
^.[A-Za-z]
See a demo on regex101.com.
What you probably mean, comes down to:
^\d[a-zA-Z]\d{2}$
The latter means one digit, one of a-zA-Z, followed by two other digits and the end of the string. See another demo on the same site.

C# regular expression filter bad word with dot

Is it possible to remove a dot in a bad word with regex in csharp?
With a dot you can bypass the filter example: co.okies
Text:
Co.okies are tasty.
should like
Cookies are tasty.

It's partly possible, you could copy your text and remove every character that's not within the a-z0-9 range except for spaces. Then you could match every word in that range against your dictionary. This however wouldn't solve the space issue. The space issue could partially be resolved by making a permutation list of every possible word at with all the words before and after at every word position in the text. And then match those permutations against your dictionary again. Once it's matched you can figure out where you have to start removing and where you have to stop removing. (Don't forget the tokens you've removed in the first step...)
But to be clearly honest, the link Kobi gave you in the comments perfectly explains why you shouldn't attempt to do this, unless we are talking about a very small scenario with finite possibilities. Anyway whatever security you try to implement it will always be hacked in the end.
Kobi's link

Parse directories from a string

Firstly i have spent Three hours trying to solve this. Also please don't suggest not using regex. I appreciate other comments and can easily use other methods but i am practicing regex as much as possible.
I am using VB.Net
Example string:
"Hello world this is a string C:\Example\Test E:\AnotherExample"
Pattern:
"[A-Z]{1}:.+?[^ ]*"
Works fine. How ever what if the directory name contains a white space? I have tried to match all strings that start with 1 uppercase letter followed by a colon then any thing else. This needs to be matched up until a whitespace, 1 upper letter and a colon. But then match the same sequence again.
Hope i have made sense.

How about "[A-Z]{1}:((?![A-Z]{1}:).)*", which should stop before the next drive letter and colon?
That "?!" is a "negative lookaround" or "zero-width negative lookahead" which, according to Regular expression to match a line that doesn't contain a word? is the way to get around the lack of inverse matching in regexes.

Not to be too picky, but most filesystems disallow a small number of characters (like <>/\:?"), so a correct pattern for a file path would be more like [A-Z]:\\((?![A-Z]{1}:)[^<>/:?"])*.
The other important point that has been raised is how you expect to parse input like "hello path is c:\folder\file.extension this is not part of the path:P"? This is a problem you commonly run into when you start trying to parse without specifying the allowed range of inputs, or the grammar that a parser accepts. This particular problem seems pretty ad hoc and so I don't really expect you to come up with a grammar or to define how particular messages are encoded. But the next time you approach a parsing problem, see if you can first define what messages are allowed and what they mean (syntax and semantics). I think you'll find that once you've defined the structure of allowed messages, parsing can be almost trivial.

Regex - Lookahead for pattern through newlines

Still learning Regex, and am having trouble getting my head wrapped around the lookahead concept. Similar data to my question here - Matching multiple lines up until a sepertor line? , say I have the following lines handed to me by the user:
0000AA.The horizontal coordinates are valid at the epoch date displayed above.
0000AA.The epoch date for horizontal control is a decimal equivalence
0000AA.of Year/Month/Day.
0000AA
[..]
So a really simple Regex is #^[0-9]{4}[A-Z]{2}\.(?<noteline>.*), where would give me every line. Fantastic. :) However, I'd like a lookahead (or a condition?) that would look at the next line and tell me if the line has the code WITHOUT a '.'. (i.e. If the NEXT line would match #^[0-9]{4}[A-Z]{2}[^\.]
Trying the lookahead, I get hits on the first two lines (because the following line has '.' after the code) but not on the last.
Edit: Using the regex above, or the one offered below gives me all lines, but I'd like to know IF a blank line (line with AA0000 code, but no '.' afterwards) follows. For example, when I get to the match on the line of Year/Month/Day, I'd like to know IF that line is followed by a blank line (or not). (Like with a grouping name that's not spaces or empty, for high-level example.)
Edit 2: I may be mis-using the 'lookahead' term. Going back over .NET's regex, I see something referred to as a Alternation Construct, but not sure if that could be used here.
Thanks! Mike.

Apply the option RegexOptions.Multiline. It changes the meaning of ^ and $ making them match the beginning and the end of ervery line instead the beginning and end of the entire string.
var matches = Regex.Matches(input,
#"^[0-9]{4}[A-Z]{2}\..*$?(?!^[0-9]{4}[A-Z]{2}[^.])",
RegexOptions.Multiline);
The negative look ahead is
find(?!suffix)
It matches a position not preceeding a suffix. Don't escape the dot within the brackets [ ]. The bracket disables the special meaning of most characters anyway.
I also added .*$? making the pattern match until the end of the current line. The ? is required in order make * lazy. Otherwise it is greedy, meaning that will try to get as many characters as possible and possibly match several lines at a time.
If you need only the number part, you can capture it in a group by enclosing it within parentheses.
(^[0-9]{4}[A-Z]{2})\..*$?(?!^[0-9]{4}[A-Z]{2}[^.])
You can then get the group like this
string number = match.Groups[1].Value;
Note: Group #0 represents the entire match.

After doing a lot of research, and hit and misses, I'm certain now that it can't be done - or, rather - it CAN be but would be prohibitively difficult - easier to do it in code.
To refrain, I was looking at a multiline string (document), where every line was preceded by a 6-digit code. Some lines - the lines I'm interested in - have a '.' after the 6-digit code, and then open text. I was hoping there would be a way to get me each line in a group, along with a flag letting me know if the next line has no free-text entry. (No '.' after the 6-digit code.) I.e. Two line data entry would give me two matches on the document. First match would have the line's text in the group called 'notetext', and the group 'lastline' would be empty. The second line would have the second part of the entered note in 'notetext', and the group 'lastline' would have something (anything, content wouldn't matter.)
From what I understand, lookaheads are zero-width assertions, so that if it matches, the returnable value is still empty. Without using lookahead, the match for 'lastline' would consume the next line's code, making the 'notetext' skip that line (giving me every other line of text.) So, I would need to have some back-reference to revert back to.
By this time, it'd be easier (code-wise) to simply get all the lines, and add up text until I get to the end of their notes. (Looping over then entire document, which can't be more than 200 lines as opposed to looping through the regex-matched lines, and the ease of reading the code for future modifications would out-weigh any slight speed advantage the regex could get me.
Thanks guys -
-Mike.

Regular expression need to identify where sentences don't have a space between them

I need a regular expression to identify all instances where a sentence begins without a space following the previous period.
For example, this is a bad sentence:
I'm sentence one.This is sentence two.
this needs to be fixed as follows:
I'm sentence one. This is sentence two.
It's not simply a case of doing a string replace of '.' with '. ' because there are a also a lot of isntances where the rest of the sentences in the paragraph the correct spacing, and this would give those an extra space.

\.(?!\s) will match dots not followed by a space. You probably want exclamation marks and question marks as well though: [\.\!\?](?!\s)
Edit:
If C# supports it, try this: [\.\!\?](?!\s|$). It won't match the punctuation at the end of the string.

You could search for \w\s{1}\.[A-Z] to find a word character, followed by a single space character, followed by a period, followed by a Capital letter, to identify these. For a find/replace: find: (\w\s{1}\.)(A-Z]) and replace with $1 $2.

I doubt that you can create a regular expression that will work in the general case.
Any regex solution you come up with is going to have some interesting edge cases that you'll have to look at carefully. For example, the abbreviation "i.e." would become "i. e." (i.e., it will have an extra space and, if this parenthetical comment were run through the regex, it would become "i. e. ,").
Also, the proper way to quote text is to include the punctuation inside the quotes, as in "He said it was okay." If you had ["He said it was okay."This is a new sentence.], your regex solution might put a space before the final quote, or might ignore the error altogether.
Those are just two cases that come to mind immediately. There are plenty of others.
Whereas a regular expression will work in a limited set of simple sentences, real written language will quickly show that regular expressions are insufficient to provide a general solution to this problem.

if a sentence ends with e.g. ... you probably don't want to change this to . . .
I think the previous answers don't consider this case.
try to insert space where you find a word followed a new word starting with uppercase
find (\w+[\.!?])([A-Z]'?\w+) replace $1 $2

Best website ever: http://www.regular-expressions.info/reference.html

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.