C# Regular Expressions with \Uxxxxxxxx characters in the pattern

C# Regular Expressions with \Uxxxxxxxx characters in the pattern - c#

Regex.IsMatch( "foo", "[\U00010000-\U0010FFFF]" )
Throws: System.ArgumentException: parsing "[-]" - [x-y] range in reverse order.
Looking at the hex values for \U00010000 and \U0010FFF I get: 0xd800 0xdc00 for the first character and 0xdbff 0xdfff for the second.
So I guess I have really have one problem. Why are the Unicode characters formed with \U split into two chars in the string?

They're surrogate pairs. Look at the values - they're over 65535. A char is only a 16 bit value. How would you expression 65536 in only 16 bits?
Unfortunately it's not clear from the documentation how (or whether) the regular expression engine in .NET copes with characters which aren't in the basic multilingual plane. (The \uxxxx pattern in the regular expression documentation only covers 0-65535, just like \uxxxx as a C# escape sequence.)
Is your real regular expression bigger, or are you actually just trying to see if there are any non-BMP characters in there?

To workaround such things with .Net regex engine, I'm using following trick:
"[\U010000-\U10FFFF]" is replaced with [\uD800-\uDBFF][\uDC00-\uDFFF]
The idea behind this is that as .Net regexes handle code units instead of code points, we're providing it with surrogate ranges as regular characters. It's also possible to specify more narrow ranges by operating with edges, e.g.: [\U011DEF-\U013E07] is same as (?:\uD807[\uDDEF-\uDFFF])|(?:[\uD808-\uD80E][\uDC00-\uDFFF])|(?:\uD80F[\uDC00-uDE07])
It's harder to read and operate with, and it's not that flexible, but still fits as workaround.

Related

How to escape variable name when using Roslyn C# Syntax Factory?

So I'm using Roslyn SyntaxFactory to generate C# code.
Is there a way for me to escape variable names when generating a variable name using IdentifierName(string)?
Requirements:
It would be nice if Unicode is supported but I suppose ASCII can suffice
It would be nice if it's reversible
Always same result for same input ("a" is always "a")
Unique result for each input ("a?"->"a_" cannot be same as "a!"->"a_")
Can convert from 1 special character to multiple single ones

The implication from the API docs seems to be that it expects a valid C# identifier here, so Roslyn's not going to provide an escaping mechanism for you. Therefore, it falls to you to define a string transformation such that it achieves what you want.
The way to do this would be to look at how other things already do it. Look at HTML entities, which are always introduced using &. They can always be distinguished easily, and there's a way to encode a literal & as well so that you don't restrict your renderable character set. Or consider how C# strings allow you to include string delimiters and other special characters in the string through the use of \.
You need to pick a character which is valid in C# identifiers to be your 'marker' for a sequence which represents one of the non-identifier characters you want to encode, and a way to allow that character to also be represented. Then make a mapping table for what comes after the marker for each of the encoded characters. If you want to do all of Unicode, the easiest way is probably to just use Unicode codepoint numbers. The resulting identifiers might not be very readable, but maybe that doesn't matter in your use case.
Once you have a suitable system worked out, it should be pretty straightforward to write a string transformation function which implements it.

Regular expression to match all numeric characters except 5

When I want to match all numeric characters except 5 I use:
[^\D|5]
or
[^\D5]
or
[0-46-9]
or
[012346789]
When I want to match no numeric characters I can use:
[^\d]
or
[\D]
All of them work well. But when I use [^^\d5] or [^^\d|5] to match all numeric characters except 5, it doesn't work.
I want to use it in a lot of cases. For example, I want to match all \p{P} but not \:. Is there any way to use ^\d to match all numeric character except 5?

You could match all digits except 5 using this:
[123467890]
There is no reason to use a shorthand version of everything.
It makes no difference to the regex engine.
In fact, adding in alternation| and zero-length assertions^ will only degrade your performance.
A shorter version would be:
[0-46-9]
Hyphen/Dash behavior inside character classes []
Hyphens will specify a range inside character classes. You can look up an ASCII table to see what range you are doing, for example: [ -Z] actually matches ASCII 33 to 127.
Edit:
Ok, now I have a better understanding of your requirements.
You need to be specific about what you need to match up front.
You can do this using negative/positive lookaheads:
(?!.*?5.*?)(?!.*?\p{Alpha}.*?)(\p{P}*?$|\p{L}*?$)
This will match under the following conditions:
There is no number 5
There is no character from the POSIX class: Alpha
Any character with the Unicode property "letter" or "punctuation"

\d is just [0-9]. See the Java regex reference for confirmation.
Just use [0-46-9]. You can try it in a regex fiddle.
UPDATE:
Based on the requirement to leverage De Morgan's laws and use a logical complement per the OP's comment, here is my interpretation of the logical complement of [^\D5].
[^\D5] essentially means "NOT (a non-digit character OR 5)". Compare this to "NOT (A OR B)" in the referenced Wikipedia article on De Morgan's laws.
What we need then is "(NOT a non-digit character) AND (NOT 5)". Compare this to "(NOT A) AND (NOT B)" in the referenced Wikipedia article.
Here then is my interpretation of logically complementing [^\D5] using a sequence of lookahead expressions for logical ANDing:
(?!\D)(?!5).
No, it does not use double negation by ^^; this does not work as you have found; but the above logical complement essentially means what we want in regexese - "(NOT a non-digit character) AND (NOT 5)" - applied to a single character (i.e. .).
You can see in a follow-on regex fiddle that the above logical complement yields the same results as [^\D5] like it should.

Counting special UTF-8 character

I'm finding a way to count special character that form by more than one character but found no solution online!
For e.g. I want to count the string "வாழைப்பழம". It actually consist of 6 tamil character but its 9 character in this case when we use the normal way to find the length. I am wondering is tamil the only kind of encoding that will cause this problem and if there is a solution to this. I'm currently trying to find a solution in C#.
Thank you in advance =)

Use StringInfo.LengthInTextElements:
var text = "வாழைப்பழம";
Console.WriteLine(text.Length); // 9
Console.WriteLine(new StringInfo(text).LengthInTextElements); // 6
The explanation for this behaviour can be found in the documentation of String.Length:
The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

A minor nitpick: strings in .NET use UTF-16, not UTF-8
When you're talking about the length of a string, there are several different things you could mean:
Length in bytes. This is the old C way of looking at things, usually.
Length in Unicode code points. This gets you closer to the modern times and should be the way how string lengths are treated, except it isn't.
Length in UTF-8/UTF-16 code units. This is the most common interpretation, deriving from 1. Certain characters take more than one code unit in those encodings which complicates things if you don't expect it.
Count of visible “characters” (graphemes). This is usually what people mean when they say characters or length of a string.
In your case your confusion stems from the difference between 4. and 3. 3. is what C# uses, 4. is what you expect. Complex scripts such as Tamil use ligatures and diacritics. Ligatures are contractions of two or more adjacent characters into a single glyph – in your case ழை is a ligature of ழ and ை – the latter of which changes the appearance of the former; வா is also such a ligature. Diacritics are ornaments around a letter, e.g. the accent in à or the dot above ப்.
The two cases I mentioned both result in a single grapheme (what you perceive as a single character), yet they both need two actual characters each. So you end up with three code points more in the string.
One thing to note: For your case the distinction between 2. and 3. is irrelevant, but generally you should keep it in mind.

Simple lexical parser

I want to write a lexical parser for regular text.
So i need to detect following tokens:
1) Word
2) Number
3) dot and other punctuation
4) "..." "!?" "!!!" and so on
I think that is not trivial to write "if else" condition for each item.
So is there any finite state machine generators for c#?
I know ANTLR and other but while i will try to learn how to work with these tools i can write my own "ifelse" FSM.
i hope to found something like:
FiniteStateMachine.AddTokenDefinition(":)","smile");
FiniteStateMachine.AddTokenDefinition(".","dot");
FiniteStateMachine.ParseText(text);

I suggest using Regular Expressions. Something like #"[a-zA-Z\-]+" will pick up words (a-z and dashes), while #"[0-9]*(\.[0-9]+)?" will pick up numbers (including decimal numbers). Dots and such are similar - #"[!\.\?]+" - and you can just add whatever punctuation you need inside the square brackets (escaping special Regex characters with a ).
Poor man's "lexer" for C# is very close to what you are looking for, in terms of being a lexer. I recommend googling regular expressions for words and numbers or whatever else you need to find out what expressions, exactly you need.
EDIT:
Or see Justin's answer for the particular regexes.

We need to know specifics on what you consider a word or a number. That being said, I'll assume "word" means "a C#-style identifier," and "number" means "a string of base-10 numerals, possibly including (but not starting or ending with) a decimal point."
Under those definitions, words would be anything matching the following regex:
#"\b(?!\d)\w+\b"
Note that this would also match unicode. Numbers would match the following:
#"\b\d+(?:\.\d+)?\b"
Note again that this doesn't cover hexadecimal, octal, or scientific notation, although you could add that in without too much difficulty. It also doesn't cover numeric literal suffixes.
After matching those, you could probably get away with this for punctuation:
#"[^\w\d\s]+"

Shall this Regex do what I expect from it, that is, matching against "A1:B10,C3,D4:E1000"?

I'm currently writing a library where I wish to allow the user to be able to specify spreadsheet cell(s) under four possible alternatives:
A single cell: "A1";
Multiple contiguous cells: "A1:B10"
Multiple separate cells: "A1,B6,I60,AA2"
A mix of 2 and 3: "B2:B12,C13:C18,D4,E11000"
Then, to validate whether the input respects these formats, I intended to use a regular expression to match against. I have consulted this article on Wikipedia:
Regular Expression (Wikipedia)
And I also found this related SO question:
regex matching alpha character followed by 4 alphanumerics.
Based on the information provided within the above-linked articles, I would try with this Regex:
Default Readonly Property Cells(ByVal cellsAddresses As String) As ReadOnlyDictionary(Of String, ICell)
Get
Dim validAddresses As Regex = New Regex("A-Za-z0-9:,A-Za-z0-9")
If (Not validAddresses.IsMatch(cellsAddresses)) then _
Throw New FormatException("cellsAddresses")
// Proceed with getting the cells from the Interop here...
End Get
End Property
Questions
1. Is my regular expression correct? If not, please help me understand what expression I could use.
2. What exception is more likely to be the more meaningful between a FormatException and an InvalidExpressionException? I hesitate here, since it is related to the format under which the property expect the cells to be input, aside, I'm using an (regular) expression to match against.
Thank you kindly for your help and support! =)

I would try this one:
[A-Za-z]+[0-9]+([:,][A-Za-z]+[0-9]+)*
Explanation:
Between [] is a possible group of characters for a single position
[A-Za-z] means characters (letters) from 'A' to 'Z' and from 'a' to 'z'
[0-9] means characters (digits) from 0 to 9
A "+" appended to a part of a regex means: repeat that one or more times
A "*" means: repeat the previous part zero or more times.
( ) can be used to define a group
So [A-Za-z]+[0-9]+ matches one or more letters followed by one or more digits for a single cell-address.
Then that same block is repeated zero or more times, with a ',' or ':' separating the addresses.

Assuming that the column for the spreadsheet is any 1- or 2-letter value and the row is any positive number, a more complex but tighter answer still would be:
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
"[A-Z]{1,2}[1-9]\d*" is the expression for a single cell reference. If you replace "[A-Z]{1,2}[1-9]\d*" in the above with then the complex expression becomes
^<cell>(:<cell>)?(,<cell>(:<cell>*)?)*$
which more clearly shows that it is a cell or a range followed by one or more "cell or range" entries with commas in between.
The row and column indicators could be further refined to give a tighter still, yet more complex expression. I suspect that the above could be simplified with look-ahead or look-behind assertions, but I admit those are not (yet) my strong suit.

I'd go with this one, I think:
(([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*,)*([A-Z]+[1-9]\d*:)?[A-Z]+[1-9]\d*
This only allows capital letters as the prefix. If you want case insensitivity, use RegexOptions.IgnoreCase.
You could simplify this by replacing [A-Z]+[1-9]\d* with plain old [A-Z]\d+, but that will only allow a one-letter prefix, and it also allows stuff like A0 and B01. Up to you.
EDIT:
Having thought hard about DocMax's mention of lookarounds, and using Hans Kesting's answer as inspiration, it occurs to me that this should work:
^[A-Z]+\d+((,|(?<!:\w*):)[A-Z]+\d+)*$
Or if you want something really twisted:
^([A-Z]+\d+(,|$|(?<!:\w*):))*(?<!,|:)
As in the previous example, replace \d+ with [1-9]\d* if you want to prevent leading zeros.
The idea behind the ,|(?<!\w*:): is that if a group is delimited by a comma, you want to let it through; but if it's a colon, it's only allowed if the previous delimiter wasn't a colon. The (,|$|...) version is madness, but it allows you to do it all with only one [A-Z]+\d+ block.
However! Even though this is shorter, and I'll admit I feel a teeny bit clever about it, I pity the poor fellow who has to come along and maintain it six months from now. It's fun from a code-golf standpoint, but I think it's best for practical purposes to go with the earlier version, which is a lot easier to read.

i think your regex is incorrect, try (([A-Za-z0-9]*)[:,]?)*
Edit : to correct the bug pointed out by Baud : (([A-Za-z0-9]*)[:,]?)*([A-Za-z0-9]+)
and finally - best version : (([A-Za-z]+[0-9]+)[:,]?)*([A-Za-z]+[0-9]+)
// ah ok this wont work probably... but to answer 1. - no i dont think your regex is correct
( ) form a group
[ ] form a charclass (you can use A-Z a-d 0-9 etc or just single characters)
? means 1 or 0
* means 0 or any
id suggest reading http://www.regular-expressions.info/reference.html .
thats where i learned regexes some time ago ;)
and for building expressions i use Rad Software Regular Expression Designer

Let's build this step by step.
If you are following an Excel addressing format, to match a single-cell entry in your CSL, you would use the regular expression:
[A-Z]{1,2}[1-9]\d*
This matches the following in sequence:
Any character in A to Z once or twice
Any digit in 1 to 9
Any digit zero or more times
The digit expression will prevent inputting a cell address with leading zeros.
To build the expression that allows for a cell address pair, repeat the expression preceded by a colon as optional.
[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?
Now allow for repeating the pattern preceded by a comma zero or more times and add start and end string delimiters.
^[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?(,[A-Z]{1,2}[1-9]\d*(:[A-Z]{1,2}[1-9]\d*)?)*$
Kind of long and obnoxious, I admit, but after trying enough variants, I can't find a way of shortening it.
Hope this is helpful.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.