So I'm using Roslyn SyntaxFactory to generate C# code.
Is there a way for me to escape variable names when generating a variable name using IdentifierName(string)?
Requirements:
It would be nice if Unicode is supported but I suppose ASCII can suffice
It would be nice if it's reversible
Always same result for same input ("a" is always "a")
Unique result for each input ("a?"->"a_" cannot be same as "a!"->"a_")
Can convert from 1 special character to multiple single ones
The implication from the API docs seems to be that it expects a valid C# identifier here, so Roslyn's not going to provide an escaping mechanism for you. Therefore, it falls to you to define a string transformation such that it achieves what you want.
The way to do this would be to look at how other things already do it. Look at HTML entities, which are always introduced using &. They can always be distinguished easily, and there's a way to encode a literal & as well so that you don't restrict your renderable character set. Or consider how C# strings allow you to include string delimiters and other special characters in the string through the use of \.
You need to pick a character which is valid in C# identifiers to be your 'marker' for a sequence which represents one of the non-identifier characters you want to encode, and a way to allow that character to also be represented. Then make a mapping table for what comes after the marker for each of the encoded characters. If you want to do all of Unicode, the easiest way is probably to just use Unicode codepoint numbers. The resulting identifiers might not be very readable, but maybe that doesn't matter in your use case.
Once you have a suitable system worked out, it should be pretty straightforward to write a string transformation function which implements it.
Related
In my app I compare strings. I have strings that look the same but some of them contain white space, and other contain nbsp, so when I compare them I get that they are different. However, they represent the same entity so I have issues when I compare them. That's why I want to decode the strings I compare. That way nbsp will be converted to space in both of the strings and they will be treated as equal when I do the comparison. So here's what I do:
HttpUtility.HtmlDecode(string1)[0]
HttpUtility.HtmlDecode(string2)[0]
But I still get that string1[0] has ascii code of 160, and string2[0] has ascii code of 32.
Obviously I am not understanding the concept. What am I doing wrong?
You are trying to compare two different characters, no matter how resembling they might seem to you.
The fact that they have different character codes is enough to make the comparison fail. The easiest thing to do is replace the non-breaking space by a regular space and then compare them.
bool c = html.Replace('\u00A0', ' ').Equals(regular);
I'm finding a way to count special character that form by more than one character but found no solution online!
For e.g. I want to count the string "வாழைப்பழம". It actually consist of 6 tamil character but its 9 character in this case when we use the normal way to find the length. I am wondering is tamil the only kind of encoding that will cause this problem and if there is a solution to this. I'm currently trying to find a solution in C#.
Thank you in advance =)
Use StringInfo.LengthInTextElements:
var text = "வாழைப்பழம";
Console.WriteLine(text.Length); // 9
Console.WriteLine(new StringInfo(text).LengthInTextElements); // 6
The explanation for this behaviour can be found in the documentation of String.Length:
The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.
A minor nitpick: strings in .NET use UTF-16, not UTF-8
When you're talking about the length of a string, there are several different things you could mean:
Length in bytes. This is the old C way of looking at things, usually.
Length in Unicode code points. This gets you closer to the modern times and should be the way how string lengths are treated, except it isn't.
Length in UTF-8/UTF-16 code units. This is the most common interpretation, deriving from 1. Certain characters take more than one code unit in those encodings which complicates things if you don't expect it.
Count of visible “characters” (graphemes). This is usually what people mean when they say characters or length of a string.
In your case your confusion stems from the difference between 4. and 3. 3. is what C# uses, 4. is what you expect. Complex scripts such as Tamil use ligatures and diacritics. Ligatures are contractions of two or more adjacent characters into a single glyph – in your case ழை is a ligature of ழ and ை – the latter of which changes the appearance of the former; வா is also such a ligature. Diacritics are ornaments around a letter, e.g. the accent in à or the dot above ப்.
The two cases I mentioned both result in a single grapheme (what you perceive as a single character), yet they both need two actual characters each. So you end up with three code points more in the string.
One thing to note: For your case the distinction between 2. and 3. is irrelevant, but generally you should keep it in mind.
I'm working on a new feature for a C# application that will process a text given by the user. This text can contain any character, but everything that is between braces ({}) or between brackets ([]) will be treated on a special way (basically, the text inside brackets will be replaced for another text, and the braces will indicate a subsection in the given text and will be processed differently).
So, I want to give the user the choice to use braces and brackets on his text, so the first thing I thought was to use "{{" to represent "{", and the same for all other special characters, but this will give problems. If he wants to open a subsection and wants the first character in the subsection to be "{", then he would write "{{{", but that's the same thing he would write if he would like the character before the subsection to be "{". So this causes an ambiguity.
Now I'm thinking I could use "\" to escape braces and brackets, and use "\\" to represent "\". And I'm kinda figuring out how to process this, but I got a feeling I'm trying to reinvent the wheel here. Wonder if there is a known algorithm or library that does what I'm trying to do.
Why don't you use an existing markup convention? There are plenty of lightweight syntaxes to choose from; depending on your user population, some of them might already be familiar with MediaWiki markup and/or BBcode and/or reST and/or Markdown.
Why don't you use XML tags instead of special characters?
<section>
Blah blah blah blah <replace id="some identifier" />
</section>
This approach would let you parse your text using any XML parser in Microsoft .NET and any other platform. And you'll save time because there's nothing to escape.
I'd recommend using \ to escape {} chars in the text and un-escaped {} to surround a subsection. This is how C# handles " chars in a string. Using double braces introduces ambiguities and makes correctly processing the text difficult, if not impossible. Your choice also depends on your target users. Developers are comfortable using escape chars but they can be confusing to non-dev users. You might want to use tags like <sub> and </sub> to indicate a subsection. Either way, you can use a regular expression to parse the user's text into a RegEx.Matches collection.
A normal regexp to allow letters only would be "[a-zA-Z]" but I'm from, Sweden so I would have to change that into "[a-zåäöA-ZÅÄÖ]". But suppose I don't know what letters are used in the alphabet.
Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?
You can use \pL to match any 'letter', which will support all letters in all languages. You can narrow it down to specific languages using 'named blocks'. More information can be found on the Character Classes documentation on MSDN.
My recommendation would be to put the regular expression (or at least the "letter" part) into a localised resource, which you can then pull out based on the current locale and form into the larger pattern.
What about \p{name} ?
Matches any character in the named character class specified by {name}.
Supported names are Unicode groups and block ranges. For example, Ll, Nd, Z,
IsGreek, IsBoxDrawing.
I don't know enough about unicode, but maybe your characters fit a unicode class?
See character categories selection with \p and \w unicode semantics.
All chars are "valid," so I think you're really asking for chars that are "generally considered to be letters" in a locale.
The Unicode specification has some guidelines, but in general the answer is "no," you would need to list the characters you decide are "letters."
Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?
This is not, in general, possible.
After all Engligh text does include some accented characters (e.g. in "fête" and "naïve" -- which in UK-English to be strictly correct still use accents). In some languages some of the standard letters are rarely used (e.g. y-diaeresis in French).
Then consider including foreign words are included (this will often be the case where technical terms are used). Quotations would be another source.
If your requirements are sufficiently narrowly defined you may be able to create a definition, but this requires linguistic experience in that language.
This regex allows only valid symbols through:
[a-zA-ZÀ-ÿ ]
Regex.IsMatch( "foo", "[\U00010000-\U0010FFFF]" )
Throws: System.ArgumentException: parsing "[-]" - [x-y] range in reverse order.
Looking at the hex values for \U00010000 and \U0010FFF I get: 0xd800 0xdc00 for the first character and 0xdbff 0xdfff for the second.
So I guess I have really have one problem. Why are the Unicode characters formed with \U split into two chars in the string?
They're surrogate pairs. Look at the values - they're over 65535. A char is only a 16 bit value. How would you expression 65536 in only 16 bits?
Unfortunately it's not clear from the documentation how (or whether) the regular expression engine in .NET copes with characters which aren't in the basic multilingual plane. (The \uxxxx pattern in the regular expression documentation only covers 0-65535, just like \uxxxx as a C# escape sequence.)
Is your real regular expression bigger, or are you actually just trying to see if there are any non-BMP characters in there?
To workaround such things with .Net regex engine, I'm using following trick:
"[\U010000-\U10FFFF]" is replaced with [\uD800-\uDBFF][\uDC00-\uDFFF]
The idea behind this is that as .Net regexes handle code units instead of code points, we're providing it with surrogate ranges as regular characters. It's also possible to specify more narrow ranges by operating with edges, e.g.: [\U011DEF-\U013E07] is same as (?:\uD807[\uDDEF-\uDFFF])|(?:[\uD808-\uD80E][\uDC00-\uDFFF])|(?:\uD80F[\uDC00-uDE07])
It's harder to read and operate with, and it's not that flexible, but still fits as workaround.