Regular expression to catch letters beyond a-z

Regular expression to catch letters beyond a-z - c#

A normal regexp to allow letters only would be "[a-zA-Z]" but I'm from, Sweden so I would have to change that into "[a-zåäöA-ZÅÄÖ]". But suppose I don't know what letters are used in the alphabet.
Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?

You can use \pL to match any 'letter', which will support all letters in all languages. You can narrow it down to specific languages using 'named blocks'. More information can be found on the Character Classes documentation on MSDN.
My recommendation would be to put the regular expression (or at least the "letter" part) into a localised resource, which you can then pull out based on the current locale and form into the larger pattern.

What about \p{name} ?
Matches any character in the named character class specified by {name}.
Supported names are Unicode groups and block ranges. For example, Ll, Nd, Z,
IsGreek, IsBoxDrawing.
I don't know enough about unicode, but maybe your characters fit a unicode class?

See character categories selection with \p and \w unicode semantics.

All chars are "valid," so I think you're really asking for chars that are "generally considered to be letters" in a locale.
The Unicode specification has some guidelines, but in general the answer is "no," you would need to list the characters you decide are "letters."

Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?
This is not, in general, possible.
After all Engligh text does include some accented characters (e.g. in "fête" and "naïve" -- which in UK-English to be strictly correct still use accents). In some languages some of the standard letters are rarely used (e.g. y-diaeresis in French).
Then consider including foreign words are included (this will often be the case where technical terms are used). Quotations would be another source.
If your requirements are sufficiently narrowly defined you may be able to create a definition, but this requires linguistic experience in that language.

This regex allows only valid symbols through:
[a-zA-ZÀ-ÿ ]

Related

How to escape variable name when using Roslyn C# Syntax Factory?

So I'm using Roslyn SyntaxFactory to generate C# code.
Is there a way for me to escape variable names when generating a variable name using IdentifierName(string)?
Requirements:
It would be nice if Unicode is supported but I suppose ASCII can suffice
It would be nice if it's reversible
Always same result for same input ("a" is always "a")
Unique result for each input ("a?"->"a_" cannot be same as "a!"->"a_")
Can convert from 1 special character to multiple single ones

The implication from the API docs seems to be that it expects a valid C# identifier here, so Roslyn's not going to provide an escaping mechanism for you. Therefore, it falls to you to define a string transformation such that it achieves what you want.
The way to do this would be to look at how other things already do it. Look at HTML entities, which are always introduced using &. They can always be distinguished easily, and there's a way to encode a literal & as well so that you don't restrict your renderable character set. Or consider how C# strings allow you to include string delimiters and other special characters in the string through the use of \.
You need to pick a character which is valid in C# identifiers to be your 'marker' for a sequence which represents one of the non-identifier characters you want to encode, and a way to allow that character to also be represented. Then make a mapping table for what comes after the marker for each of the encoded characters. If you want to do all of Unicode, the easiest way is probably to just use Unicode codepoint numbers. The resulting identifiers might not be very readable, but maybe that doesn't matter in your use case.
Once you have a suitable system worked out, it should be pretty straightforward to write a string transformation function which implements it.

Counting special UTF-8 character

I'm finding a way to count special character that form by more than one character but found no solution online!
For e.g. I want to count the string "வாழைப்பழம". It actually consist of 6 tamil character but its 9 character in this case when we use the normal way to find the length. I am wondering is tamil the only kind of encoding that will cause this problem and if there is a solution to this. I'm currently trying to find a solution in C#.
Thank you in advance =)

Use StringInfo.LengthInTextElements:
var text = "வாழைப்பழம";
Console.WriteLine(text.Length); // 9
Console.WriteLine(new StringInfo(text).LengthInTextElements); // 6
The explanation for this behaviour can be found in the documentation of String.Length:
The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.

A minor nitpick: strings in .NET use UTF-16, not UTF-8
When you're talking about the length of a string, there are several different things you could mean:
Length in bytes. This is the old C way of looking at things, usually.
Length in Unicode code points. This gets you closer to the modern times and should be the way how string lengths are treated, except it isn't.
Length in UTF-8/UTF-16 code units. This is the most common interpretation, deriving from 1. Certain characters take more than one code unit in those encodings which complicates things if you don't expect it.
Count of visible “characters” (graphemes). This is usually what people mean when they say characters or length of a string.
In your case your confusion stems from the difference between 4. and 3. 3. is what C# uses, 4. is what you expect. Complex scripts such as Tamil use ligatures and diacritics. Ligatures are contractions of two or more adjacent characters into a single glyph – in your case ழை is a ligature of ழ and ை – the latter of which changes the appearance of the former; வா is also such a ligature. Diacritics are ornaments around a letter, e.g. the accent in à or the dot above ப்.
The two cases I mentioned both result in a single grapheme (what you perceive as a single character), yet they both need two actual characters each. So you end up with three code points more in the string.
One thing to note: For your case the distinction between 2. and 3. is irrelevant, but generally you should keep it in mind.

Simple lexical parser

I want to write a lexical parser for regular text.
So i need to detect following tokens:
1) Word
2) Number
3) dot and other punctuation
4) "..." "!?" "!!!" and so on
I think that is not trivial to write "if else" condition for each item.
So is there any finite state machine generators for c#?
I know ANTLR and other but while i will try to learn how to work with these tools i can write my own "ifelse" FSM.
i hope to found something like:
FiniteStateMachine.AddTokenDefinition(":)","smile");
FiniteStateMachine.AddTokenDefinition(".","dot");
FiniteStateMachine.ParseText(text);

I suggest using Regular Expressions. Something like #"[a-zA-Z\-]+" will pick up words (a-z and dashes), while #"[0-9]*(\.[0-9]+)?" will pick up numbers (including decimal numbers). Dots and such are similar - #"[!\.\?]+" - and you can just add whatever punctuation you need inside the square brackets (escaping special Regex characters with a ).
Poor man's "lexer" for C# is very close to what you are looking for, in terms of being a lexer. I recommend googling regular expressions for words and numbers or whatever else you need to find out what expressions, exactly you need.
EDIT:
Or see Justin's answer for the particular regexes.

We need to know specifics on what you consider a word or a number. That being said, I'll assume "word" means "a C#-style identifier," and "number" means "a string of base-10 numerals, possibly including (but not starting or ending with) a decimal point."
Under those definitions, words would be anything matching the following regex:
#"\b(?!\d)\w+\b"
Note that this would also match unicode. Numbers would match the following:
#"\b\d+(?:\.\d+)?\b"
Note again that this doesn't cover hexadecimal, octal, or scientific notation, although you could add that in without too much difficulty. It also doesn't cover numeric literal suffixes.
After matching those, you could probably get away with this for punctuation:
#"[^\w\d\s]+"

Regex ignore underscores

I have a regex ([-#.\/,':\w]*[\w])* and it matches all words within a text (including punctuated words like I.B.M), but I want to make it exclude underscores and I can't seem to figure out how to do it... I tried adding ^[_] (e.g. (^[_][-#.\/,':\w]*[\w])*) but it just breaks up all the words into letters. I want to preserve the word matching, but I don't want to have words with underscores in them, nor words that are entirely made up of underscores.
Whats the proper way to do this?
P.S.
My app is written in C# (if that makes any difference).
I can't use A-Za-z0-9 because I have to match words regardless of the language (could be Chinese, Russian, Japanese, German, English).
Update
Here is an example:
"I.B.M should be parsed as one word w_o_r_d! Russian should work too: мплекс исторических событий."
The matches should be:
I.B.M.
should
be
parsed
as
one
word
Russian
should
work
too
мплекс
исторических
событий
Note that w_o_r_d should not get matched.

Try this instead:
([-#.\/,':\p{L}\p{Nd}]*[\p{L}\p{Nd}])*
The \w class is composed of [\p{L}\p{Nd}\p{Pc}] when you're performing Unicode matching. (Or simply [a-zA-Z0-9] if you're doing non-Unicode matching.)
It's the \p{Pc} Unicode category -- punctuation/connector -- that causes the problem by matching underscores, so we explicitly match against the other categories without including that one.
(Further information here, "Character Classes: Word Character", and here, "Character Classes: Supported Unicode General Categories".)

Tue underscore comes from \w.
Simply use A-Za-z0-9 instead.

For a more concise version of LukeH's regex, you can use simply:
([-#.\/,':\p{L}]*\p{L})*
I simply used \p{L} instead of Lu, Ll, Lt, Lo, Lm. See Supported Unicode General Categories

Diacritics alphabetical ordering in C#

I want to know how do you perform a reliable alphabetical ordering (for a listbox) of people's full names with the diacritics of the language in C sharp?
Thanks in advance.
Q: So you just want to treat diacritics as the "original" letter? (eg: João is the same as Joao)? – NullUserException
A: I want to treat them as they should be treated in the language I define, respecting the rules of alphabetical ordering that people apply everyday. I'm sure it's written in the grammars of each language. Thanks. – Queops

This MSDN article should give you what you need: Comparing and Sorting Data for a Specific Culture. It describes culture sensitive comparison, sorting, and normalization, with code samples.

You can use s.Normalize(NormalizationForm.FormD) to normalize a string s, separating accented characters into unaccented characters followed by the accent symbol. You can then use CharUnicodeInfo.GetUnicodeCategory(c) == UnicodeCategory.NonSpacingMark to identify accent characters c in the normalized s. Given that, you can implement your own string comparison operator to place whatever ordering you need on accents.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.