Regex Not Matching Unicode

Regex Not Matching Unicode - c#

How would I go about using Regex to match Unicode strings? I'm loading in a couple keywords from a text file and using them with Regex on another file. The keywords both contain unicode (such as á, etc). I'm not sure where the problem is. Is there some option I have to set?
Code:
foreach (string currWord in _keywordList)
{
MatchCollection mCount = Regex.Matches(
nSearch.InnerHtml, "\\b" + #currWord + "\\b", RegexOptions.IgnoreCase);
if (mCount.Count > 0)
{
wordFound.Add(currWord);
MessageBox.Show(#currWord, mCount.ToString());
}
}
And reading the keywords to a list:
var rdComp = new StreamReader(opnDiag.FileName);
string compSplit = rdComp.ReadToEnd()
.Replace("\r\n", "\n")
.Replace("\n\r", "\n");
rdComp.Dispose();
string[] compList = compSplit.Split(new[] {'\n'});
Then I change the array to a list.

When matching on a specific character, I believe regular expressions only support literals for the ASCII character set. Beyond that, you can use \uxxxx to match on the Unicode code point.
See here.

You can use [\u0000-\uffff]+ to match at least the BMP

Related

Find hashtags in string

I am working on a Xamarin.Forms PCL project in C# and would like to detect all the hashtags.
I tried splitting at spaces and checking if the word begins with an # but the problem is if the post contains two spaces like "Hello #World Test" it would lose that the double space
string body = "Example string with a #hashtag in it";
string newbody = "";
foreach (var word in body.Split(' '))
{
if (word.StartsWith("#"))
newbody += "[" + word + "]";
newbody += word;
}
Goal output:
Example string with a [#hashtag] in it
I also only want it to have A-Z a-z 0-9 and _ stopping at any other character
Test #H3ll0_W0rld$%Test => Test [#H3ll0_W0rld]$%Test
Other Stack questions try to detect the string and extract it, I would like it work with it and put it back in the string without losing anything that methods such as splitting by certain characters would lose.

You can use Regex with #\w+ and $&
Explanation
# matches the character # literally (case sensitive)
\w+ matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$& Includes a copy of the entire match in the replacement string.
Example
var input = "asdads sdfdsf #burgers, #rabbits dsfsdfds #sdf #dfgdfg";
var regex = new Regex(#"#\w+");
var matches = regex.Matches(input);
foreach (var match in matches)
{
Console.WriteLine(match);
}
or
var result = regex.Replace(input, "[$&]" );
Console.WriteLine(result);
Ouput
#burgers
#rabbits
#sdf
#dfgdfg
asdads sdfdsf [#burgers], [#rabbits] dsfsdfds [#sdf] [#dfgdfg]
Updated Demo here
Another Example

Use a regular expression: \#\w*
string pattern = "\#\w*";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(input);

Removing special characters using Regex in C#

I have one problem in this code. I want to remove all special characters but the square brackets are not getting removed.
string regExp = "[\\\"]";
string tmp = Regex.Replace(str, regExp," ");
string[] strArray = tmp.Split(',');
obj.amcid = db.Execute("select MAX(amcid)+1 from sca_amcmaster");
foreach (string i in strArray)
{
// int myInts = int.Parse(i);
db.Execute(";EXEC insertitems1 #0,#1", i, obj.invoiceno);
}

Square Brackets are metacharacters in Regular Expressions, which allow us to define list of things. So if you want to match then using Regex you need to change your expression to:
string regExp = "\[\\\"\]";
Therefore, you simply need to include the backslashes before the square brackets to match then too.
If none of them are required in the expression, you can group then using brackets, and the character ? (zero or more matches):
string regExp = "(\[)?(\\)?(\")?(\])?";

Retrieve Alphabet with white space

I would like to retrieve the alphabet only but the code is not enough to make it.
What am I missing?
[A-Öa-ö]+$
16440 dallas
23941 cityO < You also have white space after "O"
931 00 Texas
10581 New Orleans

It's because you specify a sequence from the ASCII character table. And åäö is not directly after Z in the ascii table.
You can see it here: http://www.asciitable.com/
So what you need is a regex that specifies those separately:
[A-Za-zåäöÅÄÖ]+$
So the complete regex is:
var re = new Regex("([A-Za-zåäöÅÄÖ]+)$", RegexOptions.Multiline);
var matches = re.Matches(data);
Console.WriteLine(matches[0].Groups[1].Value);
However, since you want to allow white spaces within the name (as for "New Orleans") you need to allow it, simply include it in the regex:
var re = new Regex("([A-Za-zåäöÅÄÖ ]+)$", RegexOptions.Multiline);
Unfortunately that also includes white spaces in the beginning and the end:
" New Orleans "
To fix that you start by specifying the regex as greedy, i.e. tell it to use less characters:
new Regex("([A-Za-zåäöÅÄÖ ]+?)$", RegexOptions.Multiline)
The problem with that is that it do not take other lines than New orleans. Don't ask me why. To fix that I told the regex that it must have a space between the digits and the text and that there may be a space after the text:
var re = new Regex("\\s([A-Za-zåäöÅÄÖ ]+?)[\\s]*$", RegexOptions.Multiline);
which works with all lines.
Regex breakdown:
\\s A single whitespace (which should not be included in the match since it's not in the parenthesis expression)
([A-Za-zåäöÅÄÖ ]+?)
Find a character which either is in the alphabet or space
+ there must be one or more
? use greedy search.
[\\s]*
[\\s] Find a white space character
* There must be zero or more if it
Alternative
As an alternative to regex you can do something like this:
public IEnumerable<string> GetCodes(string data)
{
var lines = data.Split(new[] { Environment.NewLine }, StringSplitOptions.None);
foreach (var line in lines)
{
for (var i = 0; i < line.Length; i++)
{
if (!char.IsLetter(line[i]))
continue;
var text = line.Substring(i).TrimEnd(' ');
yield return text;
break;
}
}
}
Which is invoked like:
var codes = GetCodes(yourData).ToList();

In C#, you can use \p{L} Unicode category class to match all Unicode characters. You may match zero or more whitespace characters with \s*. End of string is $ (or \Z or \z). The word you need can be captured and this capture can easily be retrieved from the match result via GroupCollection.
Thus, you can use
(\p{L}+)\s*$
or - if you plan to match specific Finnish, etc. letters:
(?i)([A-ZÅÄÖ]+)\s*$
See the regex demo
C# demo:
var strs = new string[] {"16440 dallas", "23941 cityO ", "931 00 Texas", "10581 New Orleans"};
foreach (var s in strs) {
var match = Regex.Match(s, #"(\p{L}+)\s*$");
if (match.Success)
{
Console.WriteLine(match.Groups[1].Value);
}
}

Check for special characters are not allowed in C#

I have to validate a text box from a list of special characters that are not allowed.
This all are not allowed characters.
"&";"\";"/";"!";"%";"#";"^";"(";")";"?";"|";"~";"+";" ";
"{";"}";"*";",";"[";"]";"$";";";":";"=";"
Where semi-column is used to just separate between characters .I tried to write a regex for some characters to validate if it had worked i would extend it.it is not working .
What I am doing wrong in this.
Regex.IsMatch(textBox1.Text, #"^[\%\/\\\&\?\,\'\;\:\!\-]+$")

^[\%\/\\\&\?\,\'\;\:\!\-]+$
matches the strings that consist entirely of special characters. You need to invert the character class to match the strings that do not contain a special character:
^[^\%\/\\\&\?\,\'\;\:\!\-]+$
^--- added
Alternatively, you can use this regex to match any string containing only alphanumeric characters, hyphens, underscores and apostrophes.
^[a-zA-Z0-9\-'_]$
The regex you mention in the comments
[^a-zA-Z0-9-'_]
matches a string that contains any character except those that are allowed (you might need to escape the hyphen, though). This works as well, assuming you reverse the condition correctly (accept the strings that do not match).

If you are just looking for any of a list of characters then a regular expression is the more complicated option. String.IndexOfAny will return the first index of any of an array of characters or -1. So the check:
if (input.IndexOfAny(theCharacetrers) != -1) {
// Found one of them.
}
where theCharacetrers has previously been set up at class scope:
private readonly char[] theCharacetrers = new [] {'&','\','/','!','%','#','^',... };

You needed to remove ^ from the beginning and $ from the end of the pattern, otherwise in order to match the string should start and end with the special characters.
So, instead of
#"^[\%\/\\\&\?\,\'\;\:\!\-]+$"
it should be
#"[\%\/\\\&\?\,\'\;\:\!\-]+"
You can read more about start of string and end of string anchors here

Your RegExp is "string consiting only of special characters (since you have begin/end markers ^ and $).
You probably want just check if string does not contain any of the characters #"[\%\/\\\&\?\,\'\;\:\!\-]") would be enough.
Also String.IndexOfAny may be better fit if you just need to see if any of the characters is present in the source string.

PLease use this in textchange event
//Regex regex = new Regex("([a-zA-Z0-9 ._#]+)");
Regex regex = new Regex("^[a-zA-Z0-9_#(+).,-]+$");
string alltxt = txtOthers.Text;//txtOthers is textboxes name;
int k = alltxt.Length;
for (int i = 0; i <= k - 1; i++)
{
string lastch = alltxt.Substring(i, 1);
MatchCollection matches = regex.Matches(lastch);
if (matches.Count > 0)
{
}
else
{
txtOthers.Text = alltxt.Remove(i, 1);
i = i - 1;
alltxt = txtOthers.Text;
k = alltxt.Length;
}
txtOthers.Select(txtOthers.TextLength, 0);
}
BY Sharafu Hameed

C# Why i can not split the string?

string myNumber = "3.44";
Regex regex1 = new Regex(".");
string[] substrings = regex1.Split(myNumber);
foreach (var substring in substrings)
{
Console.WriteLine("The string is : {0} and the length is {1}",substring, substring.Length);
}
Console.ReadLine();
I tried to split the string by ".", but it the splits return 4 empty string. Why?

. means "any character" in regular expressions. So don't split using a regex - split using String.Split:
string[] substrings = myNumber.Split('.');
If you really want to use a regex, you could use:
Regex regex1 = new Regex(#"\.");
The # makes it a verbatim string literal, to stop you from having to escape the backslash. The backslash within the string itself is an escape for the dot within the regex parser.

the easiest solution would be: string[] val = myNumber.Split('.');

. is a reserved character in regex. if you literally want to match a period, try:
Regex regex1 = new Regex(#"\.");
However, you're better off simply using myNumber.Split(".");

The dot matches a single character, without caring what that character
is. The only exception are newline characters.
Source: http://www.regular-expressions.info/dot.html
Therefore your implying in your code to split the string at each character.
Use this instead.
string substr = num.Split('.');

Keep it simple, use String.Split() method;
string[] substrings = myNumber.Split('.');
It has an other overload which allows specifying split options:
public string[] Split(
char[] separator,
StringSplitOptions options
)

You don't need regex you do that by using Split method of string object
string myNumber = "3.44";
String[] substrings = myNumber.Split(".");
foreach (var substring in substrings)
{
Console.WriteLine("The string is : {0} and the length is {1}",substring, substring.Length);
}
Console.ReadLine();

The period "." is being interpreted as any single character instead of a literal period.
Instead of using regular expressions you could just do:
string[] substrings = myNumber.Split(".");

In Regex patterns, the period character matches any single character. If you want the Regex to match the actual period character, you must escape it in the pattern, like so:
#"\."
Now, this case is somewhat simple for Regex matching; you could instead use String.Split() which will split based on the occurrence of one or more static strings or characters:
string[] substrings = myNumber.Split('.');

try
Regex regex1 = new Regex(#"\.");
EDIT: Er... I guess under a minute after Jon Skeet is not too bad, anyway...

You'll want to place an escape character before the "." - like this "\\."
"." in a regex matches any character, so if you pass 4 characters to a regex with only ".", it will return four empty strings. Check out this page for common operators.

Try
Regex regex1 = new Regex("[.]");

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Regex Not Matching Unicode - c#

When matching on a specific character, I believe regular expressions only support literals for the ASCII character set. Beyond that, you can use \uxxxx to match on the Unicode code point. See here.

You can use [\u0000-\uffff]+ to match at least the BMP

Related

Find hashtags in string

Removing special characters using Regex in C#

Retrieve Alphabet with white space

Check for special characters are not allowed in C#

C# Why i can not split the string?

Categories

Resources