Regular Expression to Replace Unwanted Letters - c#

I wrote a small program in C# to Capture ingame Text.
My issue is that the Text allso containts Collor Codes which i try to not to have. I read about the function Regex.Replace
Which i think is going to suite for that.
I have Following String (Line) i want to clear i used the small little tool espresso to play a little bit with regular expression but i never figured it really out.
This is the String i am going to work with:
|c001177ffSave Code =|r |cff00AA00A|cff00AA00G|cff00AA00Q|cffff69b4g|r |cff00AA00R|cff40e0d09|cffffff00$|cffffff00#|r |cff40e0d04|cffff69b4f|cff00AA00R
I try to use ^|( [a-zA-Z0-9]{9})
which gave me theese matches
c001177ff
cff00AA00
cff00AA00
cff00AA00
cffff69b4
cff00AA00
cff40e0d0
cffffff00
cffffff00
cff40e0d0
cffff69b4
cff00AA00
Well i am not good at regex more likly i just started it. I don't want any body to present me completed solution (you are more than welcome to do that) at least a little help how i can solve that issue. I want to filter the Text.
Inpute Code
|c001177ffSave Code =|r |cff00AA00A|cff00AA00G|cff00AA00Q|cffff69b4g|r |cff00AA00R|cff40e0d09|cffffff00$|cffffff00#|r |cff40e0d04|cffff69b4f|cff00AA00R
Should be Filtered to this
Save Code = AGQg R9$# 4fR
I think theese are Hexadecimal Color Codes the |c marks the beginning and the |r the End of the string.I think the |r | is just used to indicate that the first color string ends than we get an SPACE and the | indicates the next start.

How about a simple Linq?
var output = String.Join("", input.Split('|')
.Select(s => s.Length != 10 ? ' ' : s.Last()))
.Trim();

So I think the problem you were having was not escaping your |... the following regex works for me:
var replaced = Regex.Replace(intput, #"\|c[0-9a-zA-Z]{8}|\|r", "");
\|c[0-9a-zA-Z]{8} - match starting with "|c" and then any 8 letters or numbers
| - or
\|r - match "|r"

You're on the right track. Your regex
^|( [a-zA-Z0-9]{9})
Both forces the match to be only at the start of your input string, due to the ^ start-of-line anchor, and the | needs to be escaped, because unescaped, it's a special "or" operator, which completely changes the meaning of your regex.
In addition, the space after the | is undesired, and the capture group is unnecessary, as you only want to eliminate this portion.
If you replace all instances of this
\|[a-zA-z0-9]{9}
with nothing (the empty string)
You will achieve most of your goal. Try it here: http://regex101.com/r/rF6yB6/1
But it seems you really want to eliminate not just nine characters after the pipe, but up through nine characters. So use the {1,9} range quantifier instead:
\|[a-zA-z0-9]{1,9}
Try it: http://regex101.com/r/rF6yB6/2
This seems to achieve your goal exactly.
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference.

string input = "[The example input from your question]";
string output = input.Replace("|r", "");
while (output.Contains("|c"))
output = output.Remove(output.IndexOf("|c"), 10);
// output = "Save Code = AGQg R9$# 4fR"
I like this much more than using Regexes just because it's so much more clear to me.

var str1 = "|c001177ffSave Code =|r |cff00AA00A|cff00AA00G|cff00AA00Q|cffff69b4g|r |cff00AA00R|cff40e0d09|cffffff00$|cffffff00#|r |cff40e0d04|cffff69b4f|cff00AA00R"
var str2 = Regex.Replace(str,#"\|(r|[a-zA-Z0-9]{9})","") //"Save Code = AGQg R9$# 4fR"

In addition to this answer re: escaping the "pipe" character, you're starting your regex with the caret (^) character. This matches the beginning of a line.
A correct regex would be:
\|c[0-9a-zA-Z]{8}

This regex should match all of the characters you want to remove:
([|]c([0-9]|[a-f]|[A-F]){8})|[|]r
Here's the breakdown...
The vertical pipe is an OR marker, so to search for it, place it in square brackets [ and ].
The parenthesis makes a set. So you're searching for ([|]c([0-9]|[a-f]|[A-F]){8}) OR [|]r which is all of your color codes OR |r.
Breakdown of the color codes is the set that begins with |c and is followed by the set of exactly 8 characters that can be 0 though 9 or a through f or A through F.
I tested it at RegexPal.com.

Related

Splitting a string with some characters with some ignored characters as well

There is a string: "QARR_1 * QARR_1 * NPSH[*] + NPSH0". I want to split it into a string array (exactly of 4 items) to get output as: QARR_1, QARR_1, NPSH[*], NPSH0.
I understand, I should use Regex lookaround concepts here but, I am not able to achieve the desired result. Kindly help.
I think you could do it like this without lookarounds:
(\w+(?:\[\*\])?)
Test
http://rextester.com/YHNRC51736
a capured group (
get or more word characters \w+
with an optional non captured group (?:\[\*\])?
import re
a = "QARR_1 * QARR_1 * NPSH[*] + NPSH0"
x= re.split(' \* | \+ ',a)
print x
['QARR_1', 'QARR_1', 'NPSH[*]', 'NPSH0']
Hmmm, well... this works in the regex tool I used:
\w+\[?\*?\]?
Not the most elegant, but pretty simple, so long as the input isn't broken like: "Abc12*]", "abc12[]", etc.
How it works:
\w+ this will greedily capture any sequence of word characters (keeps capturing until it runs out of characters that match), basically translates to: [a-zA-Z0-9_]+
\[?, \*?, \]? well, to start, the backslash here is used as an escape character to get Regex to literally look for the characters [, * and ]. They need to be escaped because they have a special meaning in Regex syntax otherwise. The ? at the end of each part tells the Regex pattern to match for the character between 0 and 1 times. It is necessary to be able to capture it 0 times, to allow matches that don't have the characters ([, * ,] ) at the end to be made.
A few examples of the kind of things it will match:
apples123_121231_2133414[*]
Ap1]
Orange_11[*
1ba222nnana*]
A few examples of the kinds of things it won't match:
(note, cases where part of the word is highlighted, only the highlighted part will be matched.)
Pares]*[
++++!!~+
111Grapes[]
111Grapes[]*
So, given the input you supplied, it should be fine... these are just a few things to be aware of.

delete extra text and punctuation marks from the string keeping just smileys?

I am running into some problems using the regular expression. Can you please help me out? The following in the problem I am trying to solve -
Input: :,... :D..:::))How are you today :P?..:(*
Output :D :) :P :(
Basically I want to remove the punctuations and text from the input string like-(.,:; etc) and replace them with empty string. But I want to keep the smilies -:) ,:( OR :P .I have written the following code but it is not working.
Regex= "[A-Za-z]|:[D(P(]"
but it also remove the ":D and :P" smilie.
The following regex string should work for you:
(((?<!:)[^:])|(:(?![PD\(\)])))[^:]*
It's made up of two parts:
( ((?<!:)[^:]) | (:(?![PD\(\)])) )
[^:]*
The first part is an OR (|) statement that uses Negative Lookahead and Lookbehind. It finds the first character in a block of text that doesn't contain a smiley by looking for either:
A character that is obviously not in a smiley:
Any character that is not preceded by a colon: (?<!:)
and is not a colon itself: [^:]
OR a colon that is not followed by a smiley character:
A colon :
That is not followed by a character that is the second half of a smiley: (?![PD\(\)]))
The second part ([^:]*) continues looking until we find the beginning of a potential smiley (a colon).
This Regex currently only finds the following smileys:
:D
:P
:(
:)
You can update the second half of the OR statement to find other smileys.
To sum it up, this Regex should find everything that is not part of a smiley. You can simply declare it in a Regex variable and then call .Replace(string input, string replacement), passing in your input string and the string you want to replace the non-smiley characters with (String.Empty in this case).
Not so perfect solution:
string text = ":,... :D..:::))How are you today :P?..:(*";
text = text.Replace(":)", "###)");
text = text.Replace(":D", "###D");
text = text.Replace(":P", "###P");
// clean up your punctuation marks here
//
text = text.Replace("###)", ":)");
text = text.Replace("###D", ":D");
text = text.Replace("###P", ":P");

Regex in a string

I need some help on a problem.
In fact I search to check for an image type by the hexadecimal code.
string JpgHex = "FF-D8-FF-E0-xx-xx-4A-46-49-46-00";
Then I have a condition on
string.StartsWith(pngHex).
The problem is that the "x" characters presents in my "JpgHex" string can be whatever I want.
I think I need a regex to check that but I don't know how!!
Thanks a lot!
I'm not quite clear what exactly you want to do, but the dot '.' character represents any character in Regex.
So the regex "^FF-D8-FF-E0-..-..-4A-46-49-46-00" will probably do the trick. '^' = Start of input.
If you want to allow only hex chars you can use "^FF-D8-FF-E0-[0-9A-F]{2}-[0-9A-F]{2}-4A-46-49-46-00".
Like I said, I'd need a better idea of what pattern you need to match.
Here are some examples:
Regex rgx =
new Regex(#"^FF-D8-FF-E0-[a-zA-Z0-9]{2}-[a-zA-Z0-9]{2}-4A-46-49-46-00$");
rgx.IsMatch(pngHex); // is match will return a bool.
I use [a-zA-Z0-9]{2} to denote two instances of a character, caps or small or a number. So the above regex would match :
FF-D8-FF-E0-aa-zZ-4A-46-49-46-00
FF-D8-FF-E0-11-22-4A-46-49-46-00
.. etc
Based on your need change the regex accordingly so for capitals and numbers only you change to [A-Z0-9]. The {2} denotes two occurrences.
The ^ denotes the string should start with FF and $ means the string should end with 00.
Lets say you wanted to only match two numbers, so you would use \d{2}, the whole thing would look like this:
Regex rgx = new Regex(#"^FF-D8-FF-E0-\d{2}-\d{2}-4A-46-49-46-00$");
rgx.IsMatch(pngHex);
How do I know of these magical characters? Simple, there are docs everywhere. See this MSDN page for some basic regex patterns. This page shows some quantifiers, those are things like match one or more or match only one.
Cheat-sheets also come in handy.
A regex would help you; you can use the following tool to help you test and learn: -
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
I recommend you have a play because then you'll learn!
To simply match any character in place of the x, the following should work: -
"^FF-D8-FF-E0-..-..-4A-46-49-46-00$"
In C#, it would be something like this: -
var test = "FF-D8-FF-E0-AB-CD-4A-46-49-46-00";
var foo = new Regex("^FF-D8-FF-E0-..-..-4A-46-49-46-00$");
if (foo.IsMatch(test))
{
// Do magic
}
You will need to read up on regular expressions to understand some of the characters that may not look familiar, i.e. ^ and $. See http://www.regular-expressions.info/

Regex matching with wildcard and where each character in expression can only be used once

I need some help to write a Regex for character matching. The scenario is that I have a text file with about 300 000 lines, with one word on each line. I need to find the words that match a certain set of characters.
Think of Scrabble as a very similar example, where a user has a set of characters, say for example P E S plus a wildcard character that can match any character (but only once).
If the text file contains the following words:
PIE
PIES
PEES
PASS
PLEASE
...only the words in bold should be matched, as each of the user's characters, including the wildcard, can only be used maximum once in matching.
Is there a way to write a regex expression for this?
I have started with...:
\b[P,E,S]\b
...but don't know how I should express that:
Each character (P, E, S) can only be used once
Any character (the wildcard) can also be used once
Thank you in advance! Please let me know if I need to clarify the problem.
// Peter
This is not very easy with regex (if at all possible).
Much simpler would be something like this:
List<char> set = new List<char>("PES");
string s = "PIES";
bool matches = s.Count(ch => !set.Remove(ch)) < 2;
Impossible is nothing :
You can do this with regexes using lookahaeds :
(?=^.+$)(?=^[^P]*?P?[^P]*?$)(?=^[^E]*?E?[^E]*?$)(?=^[^S]*?S?[^S]*?$)
Basically if you break it down there are five components :
First lookahead :
(?=^.+$)
Checks if length is >= 1
Then the three parts :
(?=^[^P]*?P?[^P]*?$)
for E and S respectively check if a maximum of 1 of these characters exist.
The above simply tells to check the whole string for a single occurrence of P. If more than one P is found the regex fails. Same is applied to the following two lookaheads.
For the wildcard I have to think a smart way to do it :)..

Matching an (easy??) regular expression using C#'s regex

Ok sorry this might seem like a dumb question but I cannot figure this thing out :
I am trying to parse a string and simply want to check whether it only contains the following characters : '0123456789dD+ '
I have tried many things but just can't get to figure out the right regex to use!
Regex oReg = new Regex(#"[\d dD+]+");
oReg.IsMatch("e4");
will return true even though e is not allowed...
I've tried many strings, including Regex("[1234567890 dD+]+")...
It always works on Regex Pal but not in C#...
Please advise and again i apologize this seems like a very silly question
Try this:
#"^[0-9dD+ ]+$"
The ^ and $ at the beginning and end signify the beginning and end of the input string respectively. Thus between the beginning and then end only the stated characters are allowed. In your example, the regex matches if the string contains one of the characters even if it contains other characters as well.
#comments: Thanks, I fixed the missing + and space.
Oops, you forgot the boundaries, try:
Regex oReg = new Regex(#"^[0-9dD +]+$");
oReg.IsMatch("e4");
^ matches the begining of the text stream, $ matches the end.
It is matching the 4; you need ^ and $ to terminate the regex if you want a full match for the entire string - i.e.
Regex re = new Regex(#"^[\d dD+]+$");
Console.WriteLine(re.IsMatch("e4"));
Console.WriteLine(re.IsMatch("4"));
This is because regular expressions can also match parts of the input, in this case it just matches the "4" of "e4". If you want to match a whole line, you have to surround the regex with "^" (matches line start) and "$" (matches line end).
So to make your example work, you have to write is as follows:
Regex oReg = new Regex(#"^[\d dD+]+$");
oReg.IsMatch("e4");
I believe it's returning True because it's finding the 4. Nothing in the regex excludes the letter e from the results.
Another option is to invert everything, so it matches on characters you don't want to allow:
Regex oReg = new Regex(#"[^0-9dD+]");
!oReg.IsMatch("e4");

Categories