Is there an easy way to replicate the behavior of MySQL's utf_general_ci collation in C#?
In particular, given a Unicode string, I want to generate a(n ASCII?) string that can then be trivially sorted or compared, as utf_general_ci would.
I found this question, which shows how to strip accents from strings, which looks like a similar but not quite equivalent function, e.g., it doesn't decompose ß into ss.
For my purposes, that may end up being good enough, but if there's a way to replicate its behavior completely I'd prefer that.
Take a look at the SortKey class and its KeyData property.
For given collation (CultureInfo in .NET terms) and a string, you can get a naturally sortable byte array using MyCultureInfo.CompareInfo.GetSortKey(mystring).KeyData
I do not think the collating keys for .NET and for MySQL will necessarily match, however both are using the same technique based on Unicode Collation algorithm
Related
Is it possible to use regular expressions to define syntax highlighting in Scintilla? And if so, how to do it?
I have a custom language to process, which cannot be described in simple terms of keywords and delimiters. The meaning of particular structures in this language is dependent only on their position relative to keywords. I have regular expression based parser for this format, all I need is to apply regular expression defined rules as text styles.
I mean if something matches regex1, it should have style1. Is it possible? How?
If not - can I set styles for manually selected ranges? I mean to assign style number to a specified character range in editor. How to do it?
Is it possible to define Scintilla styles in code, not in xml file?
EDIT:
OK, I've found a way.
foreach (Match m in Patterns.Keyword0.Matches(Encoding.ASCII.GetString(e.RawText)))
e.GetRange(m.Index, m.Index + m.Length).SetStyle(1);
The problem is RawText property. It's byte buffer of UTF-8 encoded text. The text property contains nice UTF-16 text, but the GetRage method accepts byte offset not character offset. If I use conversion on each TextChanged event I loose almost all speed advantage from using Scintilla.
Of course the easiest way would be to change internal encoding to UTF-16, but when I do it, I get exception saying this encoding is not supported. The only one supported seems to be UTF-8 which is ridiculously hard (and slow) to process.
I'm hitting a wall here.
The key to this is to set the lexer to SCLEX_CONTAINER and then handle the SCN_STYLENEEDED notification. This means you only ever have to process the text that actually needs styling.
There are several guides linked at the top of the Scintilla Documentation that detail various aspects of implementing customs lexers, so I won't bother repeating any of that here.
As for performance: I've written custom scintilla lexers is python that decode to utf-8 when styling and have never noticed any significant issues, so I'd be amazed if you couldn't at least match that using C#.
In my answer to this question, I mentioned that we used UpperCamelCase parsing to get a description of an enum constant not decorated with a Description attribute, but it was naive, and it didn't work in all cases. I revisited it, and this is what I came up with:
var result = Regex.Replace(camelCasedString,
#"(?<a>(?<!^)[A-Z][a-z])", #" ${a}");
result = Regex.Replace(result,
#"(?<a>[a-z])(?<b>[A-Z0-9])", #"${a} ${b}");
The first Replace looks for an uppercase letter, followed by a lowercase letter, EXCEPT where the uppercase letter is the start of the string (to avoid having to go back and trim), and adds a preceding space. It handles your basic UpperCamelCase identifiers, and leading all-upper acronyms like FDICInsured.
The second Replace looks for a lowercase letter followed by an uppercase letter or a number, and inserts a space between the two. This is to handle special but common cases of middle or trailing acronyms, or numbers in an identifier (except leading numbers, which are usually prohibited in C-style languages anyway).
Running some basic unit tests, the combination of these two correctly separated all of the following identifiers: NoDescription, HasLotsOfWords, AAANoDescription, ThisHasTheAcronymABCInTheMiddle, MyTrailingAcronymID, TheNumber3, IDo3Things, IAmAValueWithSingleLetterWords, and Basic (which didn't have any spaces added).
So, I'm posting this first to share it with others who may find it useful, and second to ask two questions:
Anyone see a case that would follow common CamelCase-ish conventions, that WOULDN'T be correctly separated into a friendly string this way? I know it won't separate adjacent acronyms (FDICFCUAInsured), recapitalize "properly" camelCased acronyms like FdicInsured, or capitalize the first letter of a lowerCamelCased identifier (but that one's easy to add - result = Regex.Replace(result, "^[a-z]", m=>m.ToString().ToUpper());). Anything else?
Can anyone see a way to make this one statement, or more elegant? I was looking to combine the Replace calls, but as they do two different things to their matches it can't be done with these two strings. They could be combined into a method chain with a RegexReplace extension method on String, but can anyone think of better?
So while I agree with Hans Passant here, I have to say that I had to try my hand at making it one regex as an armchair regex user.
(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))
Is what I came up with. It seems to pass all the tests you put forward in the question.
So
var result = Regex.Replace(camelCasedString, #"(?<a>(?<!^)((?:[A-Z][a-z])|(?:(?<!^[A-Z]+)[A-Z0-9]+(?:(?=[A-Z][a-z])|$))|(?:[0-9]+)))", #" ${a}");
Does it in one pass.
not that this directly answers the question, but why not test by taking the standard C# API and converting each class into a friendly name? It'd take some manual verification, but it'd give you a good list of standard names to test.
Let's say every case you come across works with this (you're asking us for examples that won't and then giving us some, so you don't even have a question left).
This still binds UI to programmatic identifiers in a way that will make both programming and UI changes brittle.
It still assumes your program will only be used in one language. Either your potential market it so small that just indexing an array of names would be scalable enough (e.g. a one-client bespoke or in-house project), or you are assuming you will never be successful enough to need to be available to other languages or other dialects of your first-chosen language.
Does "well, it'll work as long as we're a failure" sound like a passing grade in balancing designs?
Either code it to use resources, or else code it to pass the enum name blindly or use an array of names, as that at least will be modifiable afterwards.
E.g:
isValidCppIdentifier("_foo") // returns true
isValidCppIdentifier("9bar") // returns false
isValidCppIdentifier("var'") // returns false
I wrote some quick code but it fails:
my regex is "[a-zA-Z_$][a-zA-Z0-9_$]*"
and I simply do regex.IsMatch(inputString).
Thanks..
It should work with some added anchoring:
"^[a-zA-Z_][a-zA-Z0-9_]*$"
If you really need to support ludicrous identifiers using Unicode, feel free to read one of the various versions of the standard and add all the ranges into your regexp (for example, pages 713 and 714 of http://www-d0.fnal.gov/~dladams/cxx_standard.pdf)
Matti's answer will work to sanitize identifiers before inserting into C++ code, but won't handle C++ code as input very well. It will be annoying to separate things like L"wchar_t string", where L is not an identifier. And there's Unicode.
Clang, Apple's compiler which is built on a philosophy of modularity, provides a set of tokenizer functions. It looks like you would want clang_createTranslationUnitFromSourceFile and clang_tokenize.
I didn't check to see if it handles \Uxxxx or anything. Can't make any kind of gurarantees. Last time I used LLVM was five years ago and it wasn't the greatest experience… but not the worst either.
On the other hand, GCC certainly has it, although you have to figure out how to use cpp_lex_direct.
If I want to have a case-insensitive string-keyed dictionary, which version of StringComparer should I use given these constraints:
The keys in the dictionary come from either C# code or config files written in english locale only (either US, or UK)
The software is internationalized and will run in different locales
I normally use StringComparer.InvariantCultureIgnoreCase but wasn't sure if that is the correct case. Here is example code:
Dictionary< string, object> stuff = new Dictionary< string, object>(StringComparer.InvariantCultureIgnoreCase);
There are three kinds of comparers:
Culture-aware
Culture invariant
Ordinal
Each comparer has a case-sensitive as well as a case-insensitive version.
An ordinal comparer uses ordinal values of characters. This is the fastest comparer, it should be used for internal purposes.
A culture-aware comparer considers aspects that are specific to the culture of the current thread. It knows the "Turkish i", "Spanish LL", etc. problems. It should be used for UI strings.
The culture invariant comparer is actually not defined and can produce unpredictable results, and thus should never be used at all.
References
New Recommendations for Using Strings in Microsoft .NET 2.0
This MSDN article covers everything you could possibly want to know in great depth, including the Turkish-I problem.
It's been a while since I read it, so I'm off to do so again. See you in an hour!
The concept of "case insensitive" is a linguistic one, and so it doesn't make sense without a culture.
See this blog for more information.
That said if you are just talking about strings using the latin alphabet then you will probably get away with the InvariantCulture.
It is probably best to create the dictionary with StringComparer.CurrentCulture, though. This will allow "ß" to match "ss" in your dictionary under a German culture, for example.
Since the keys are your known fixed values, then either InvariantCultureIgnoreCase or OrdinalIgnoreCase should work. Avoid the culture-specific one, or you could hit some of the more "fun" things like the "Turkish i" problem. Obviously, you'd use a cultured comparer if you were comparing cultured values... but it sounds like you aren't.
StringComparer.OrdinalIgnoreCase is slightly faster than InvariantCultureIgnoreCase FWIW ("An ordinal comparison is fast, but culture-insensitive" according to MSDN.
You'd have to be doing a lot of comparisons to notice the difference of course.
The Invariant Culture exists specifically to deal with strings that are internal to the program and have nothing to do with user data or UI. It sounds like this is the case for this situation.
System.Collections.Specialized includes StringDictionary. The Remarks section of the MSDN states "A key cannot be null, but a value can.
The key is handled in a case-insensitive manner; it is translated to lowercase before it is used with the string dictionary.
In .NET Framework version 1.0, this class uses culture-sensitive string comparisons. However, in .NET Framework version 1.1 and later, this class uses CultureInfo.InvariantCulture when comparing strings. For more information about how culture affects comparisons and sorting, see Comparing and Sorting Data for a Specific Culture and Performing Culture-Insensitive String Operations.
How can I insert ASCII special characters (e.g. with the ASCII value 0x01) into a string?
I ask because I am using the following:
str.Replace( "<TAG1>", Convert.ToChar(0x01).ToString() );
and I feel that there must be a better way than this. Any Ideas?
Update:
Also If I use this methodology, do I need to worry about unicode & ASCII clashing?
I believe you can use \uXXXX to insert specified codes into your string.
ETA: I just tested it and it works. :-)
using System;
class Uxxxx {
public static void Main() {
Console.WriteLine("\u20AC");
}
}
Also If I use this methodology, do I need to worry about unicode & ASCII clashing?
Your first problem will be your tags clashing with ASCII. Once you get to TAG10, you will clash with 0x0A: line feed. If you ensure that you will never get more than nine tags, you should be safe. Unicode-encoding (or rather: UTF8) is identical to ASCII-encoding when the byte-values are between 0 and 127. They only differ when the top-bit is set.
and I feel that there must be a better
way than this. Any Ideas?
It looks as if you're trying to manipulate a binary chunk using textual tools. If you want to insert the byte 0x01, for example, you're not manipulating text anymore, since you don't care what that byte might represent, and since it looks like you don't even care which encoding you'll be outputting.
A better way would be to treat the thing you're manipulating as a binary chunk of data, which would let you insert bits and bytes easily, without using brittle workarounds and worrying about side effects.