String StartsWith() issue with Danish text

String StartsWith() issue with Danish text - c#

Can anyone explain this behaviour?
var culture = new CultureInfo("da-DK");
Thread.CurrentThread.CurrentCulture = culture;
"daab".StartsWith("da"); //false
I know that it can be fixed by specifying StringComparison.InvariantCulture. But I'm just confused by the behavior.
I also know that "aA" and "AA" are not considered the same in a Danish case-insensitive comparision, see http://msdn.microsoft.com/en-us/library/xk2wykcz.aspx. Which explains this
String.Compare("aA", "AA", new CultureInfo("da-DK"), CompareOptions.IgnoreCase) // -1 (not equal)
Is this linked to the behavior of the first code snippet?

Here a test that illustrates the problem, daab og dåb (same word in old and modern language respectively) means baptism/christening.
public class can_handle_remnant_of_danish_language
{
[Fact]
public void daab_start_with_då()
{
var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
Assert.True("daab".StartsWith("då")); // Fails
}
[Fact]
public void daab_start_with_da()
{
var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
Assert.True("daab".StartsWith("da")); // Fails
}
[Fact]
public void daab_start_with_daa()
{
var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
Assert.True("daab".StartsWith("daa")); // Succeeds
}
[Fact]
public void dåb_start_with_daa()
{
var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
Assert.True("dåb".StartsWith("daa")); // Fails
}
[Fact]
public void dåb_start_with_da()
{
var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
Assert.True("dåb".StartsWith("da")); // Fails
}
[Fact]
public void dåb_start_with_då()
{
var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
Assert.True("dåb".StartsWith("då")); // Succeeds
}
}
All the above tests should be successfull with my understanding of the language, and im danish!
I aint got no degree in grammar though. :-)
Seems like a bug to me.

Like Nappy said, its a feature of the danish language, where "aa" and "å" is still the same. Danish got another two letters, æ and ø, but I am not sure if they can be written using two letters as well.
I think in the second example "aA" is not changed while "AA" is changed to "Å". Just to confuse things even more, "Aa" is considered equal to "AA" and "aa" only when using case-insensitive comparing.

The modern spelling of "baptism" in Danish, namely dåb, is certainly not considered to start with da, for a Danophone. If daab is supposed to be an old-fashioned spelling of dåb, it is a bit philosophical whether it starts with da or not. But for (modern) collation purposes, it does not (alphabetically, such daab goes after disk, not before).
However, if your string is not supposed to represent natural language, but is instead some kind of technical code, like hexadecimal digits, surely you do not want to use any culture-specific rules. The solution here is not to use the invariant culture. The invariant culture has (English) rules itself!
Instead, you want to use ordinal comparison.
Ordinal comparison simply compares the strings char by char, without any assumptions of what sequences are "equivalent" in some sense. (Technical remark: Each char is a UTF-16 code unit, not a "character". Ordinal comparison is ignorant of the rules of Unicode normalization.)
I think the confusion arises because, by default, some string methods use a culture-aware comparison, and other string methods use the ordinal comparison.
The following examples all use a culture-aware comparison:
"Straße".StartsWith("Strasse", StringComparison.CurrentCulture)
"Straße".Equals("Strasse", StringComparison.CurrentCulture)
"ne\u0301e".StartsWith("née", StringComparison.CurrentCulture)
"ne\u0301e".Equals("née", StringComparison.CurrentCulture)
"Straße".StartsWith("Strasse") // CurrentCulture is default for 'StartsWith'!
"ne\u0301e".StartsWith("née") // CurrentCulture is default for 'StartsWith'!
Each of the above may depend on the .NET version as well! (As an example, the first one gives true if the current culture is the invariant culture and you are under .NET Framework 4.8; but it gives false if the current culture is the invariant culture and you use .NET 6.)
But these examples use ordinal comparison:
"Straße".StartsWith("Strasse", StringComparison.Ordinal)
"Straße".Equals("Strasse", StringComparison.Ordinal)
"ne\u0301e".StartsWith("née", StringComparison.Ordinal)
"ne\u0301e".Equals("née", StringComparison.Ordinal)
"Straße".Equals("Strasse") // Ordinal is default for 'Equals'!
"ne\u0301e".Equals("née") // Ordinal is default for 'Equals'!
So remember to check what the default comparison is for the string method you use, and specify the opposite one if needed. (Or always specify the comparison, even when redundant, if you prefer.)

Related

odd results when comparing strings based on culture

is there a reason why :
string s1 = "aéa";
string s2 = "aea";
string result = s1.Equals(s2, StringComparison.CurrentCultureIgnoreCase);
result = s1.Equals(s2, StringComparison.InvariantCultureIgnoreCase);
result = false in all cases although my current culture is french.
I would expect one of the 2 lines should return true?
On the other hand, I get
int a = string.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace);
a = 0 meaning an equality.
This sounds paradoxal to me. Any explanation???
thx in advance.

In the first equality check, you are ignoring case with StringComparison.CurrentCultureIgnoreCase in your current culture (fr). So, first check should be false.
In the second one, you are ignoring case in invariant culture with StringComparison.InvariantCultureIgnoreCase. é is not equal to e in invariant culture. Those characters are in fact different (has different meaning) in most cultures. This check should be false.
In the last one, you are ignoring characters, such as diacritics, with CompareOptions.IgnoreNonSpace. The last one should be true.
Also, read here.

Why this string ("ʿAbdul-Baha'"^^mso:text#de) doesn't start with "?

"\"ʿAbdul-Baha'\"^^mso:text#de".StartsWith("\"") // is false
"\"Abdul-Baha'\"^^mso:text#de".StartsWith("\"") // is true
(int)'ʿ' // is 703`
is there anyone could tell me Why?

You need to use the second parameter of the function BeginsWith; StringComparison.Ordinal (or StringComparison.OrdinalIgnoreCase). This instructs the function to compare by character value and to take no consideration to cultural information on sorting. This quote is from the MSDN-link below:
"An operation that uses word sort rules performs a culture-sensitive comparison wherein certain nonalphanumeric Unicode characters might have special weights assigned to them. Using word sort rules and the conventions of a specific culture, the hyphen ("-") might have a very small weight assigned to it so that "coop" and "co-op" appear next to each other in a sorted list."
This seems to affect how BeginsWith performs depending on locale/culture (see the comments on OP's post) - it works for some but not for others.
In my example (unit-test) below I show that if you convert the strings to a char-array and look at the first character, it it actually the same. When calling the BeginsWith-function you need to add the Ordinal comparison to get the same result.
For reference my locale is Swedish.
For further info: MSDN: StringComparison Enumeration
[Test]
public void BeginsWith_test()
{
const string string1 = "\"ʿAbdul-Baha'\"^^mso:text#de";
const string string2 = "\"Abdul-Baha'\"^^mso:text#de";
var chars1 = string1.ToCharArray();
var chars2 = string2.ToCharArray();
Assert.That(chars1[0], Is.EqualTo('"'));
Assert.That(chars2[0], Is.EqualTo('"'));
Assert.That(string1.StartsWith("\"", StringComparison.InvariantCulture), Is.False);
Assert.That(string1.StartsWith("\"", StringComparison.CurrentCulture), Is.False);
Assert.That(string1.StartsWith("\"", StringComparison.Ordinal), Is.True); // Works
Assert.That(string2.StartsWith("\""), Is.True);
}

What's the use case for int32.Parse(String, IFormatProvider) over int32.Parse(String)?

When would it make sense to use int32.Parse(String, IFormatProvider)?
As far as I can tell, this and int32.Parse(String) uses NumberStyles.Integer anyway which only allows a plus, a minus, or digits, optionally surrounded by whitespace, so why does the locale format enter into the equation?
I know about thousand separators, but they don't matter because NumberStyles.Integer disallows them no matter your region.

Consider if you have culture where negative sign is M (minus). I am pretty sure it doesn't exist but just consider that you have something like that. Then you can do:
string str = "M123";
var culture = new CultureInfo("en-US");
culture.NumberFormat.NegativeSign = "M";
int number = Int32.Parse(str, culture);
This would result in -123 as value. This is where you can use int32.Parse(String, IFormatProvider) overload. If you don't specify the culture, then it would use the current culture and would fail for the value M123.
(Old Answer)
It is useful with string with thousand separator
Consider the following example,
string str = "1,234,567";
System.Threading.Thread.CurrentThread.CurrentCulture = new CultureInfo("de-DE");
int number = Int32.Parse(str, CultureInfo.CurrentCulture);
This would result in an exception since . is the thousand separator in German culture.
For
int number = Int32.Parse("1.234", NumberStyles.AllowThousands);
The above would parse successfully, since the German culture uses . as thousand separator.
But if you have current culture set as US then it would give an exception.
System.Threading.Thread.CurrentThread.CurrentCulture = new CultureInfo("en-US");
int number = Int32.Parse("1.234", NumberStyles.AllowThousands);
See: Int32.Parse Method (String, IFormatProvider)
The provider parameter is an IFormatProvider implementation, such as
a NumberFormatInfo or CultureInfo object. The provider parameter
supplies culture-specific information about the format of s. If
provider is null, the NumberFormatInfo object for the current culture
is used.

Well how about the thousand separators?
I think in USA they use ',' and in Greece they use '.'
USA: 1,000,000
Greece: 1.000.000

In case somebody else is also wondering about this 6 years later, there's still no point in using Int32.ToString(IFormatProvider?) or Int32.Parse(String, IFormatProvider?) since changing the culture makes no difference with the default format and NumberStyles.
You can run this simple test to verify:
using System;
using System.Globalization;
using System.Linq;
class IntToStringTest
{
static void Main()
{
var cultures = CultureInfo.GetCultures(CultureTypes.AllCultures);
var input = -123456789;
var defaultOutput = input.ToString();
var outputCulturePairs = cultures.Select(c => (Output: input.ToString(c), Culture: c));
var parsedOutputs = outputCulturePairs.Select(p => Int32.Parse(p.Output, p.Culture));
Console.WriteLine(outputCulturePairs.All(p => p.Output == defaultOutput));
Console.WriteLine(parsedOutputs.All(o => o == input));
}
}
Edit 8/8/2020: This is only true for .NET Framework. On .NET Core some Arabic cultures use the minus sign AFTER the value.

When to use XmlConvert.ToString vs Object.ToString()

When should I use XmlConvert.ToString to convert a given value versus the ToString method on the given type.
For example :
int inputVal = 1023;
I can convert this inputVal to string representation using either method:
string strVal = inputVal.ToString();
or
string strVal = XmlConvert.ToString(inputVal);
What is the rule for using XmlConvert.ToString versus doing plain Object.ToString.

The XmlConvert.ToString methods are locale independent so the string representation will be consistent across different locales. With Object.ToString you may get a different representation according to the current culture associated with the thread.
So using one versus the other is a matter of the scenario, XmlConvert lends well if you're exchanging data with another system and want a consistent textual representation for example a double value.
You can see the differences in the following example:
double d = 1.5;
Thread.CurrentThread.CurrentCulture = new CultureInfo("pt-PT");
Console.WriteLine(d.ToString()); // 1,5
Console.WriteLine(XmlConvert.ToString(d)); // 1.5
Thread.CurrentThread.CurrentCulture = new CultureInfo("en-US");
Console.WriteLine(d.ToString()); // 1.5
Console.WriteLine(XmlConvert.ToString(d)); // 1.5

Decimal to string with thousand's separators?

Consider a Decimal value:
Decimal value = -1234567890.1234789012M;
i want to convert this Decimal value to a string, and include "thousands separators".
Note: i don't want to include thousand's separators, i want to include digit grouping. The difference is important for cultures that don't group numbers into thousands, or don't use commas to separate groups
Some example output with different standard formatting strings, on my computer, with my current locale:
value.ToString() = -1234567890..1234789012 (Implicit General)
value.ToString("g") = -1234567890..1234789012 (General)
value.ToString("d") = FormatException (Decimal whole number)
value.ToString("e") = -1..234568e++009 (Scientific)
value.ToString("f") = -1234567890..123 (Fixed Point)
value.ToString("n") = -12,,3456,,7890..123 (Number with commas for thousands)
value.ToString("r") = FormatException (Round trippable)
value.ToString("c") = -$$12,,3456,,7890..123 (Currency)
value.ToString("#,0.#") = -12,,3456,,7890..1
What i want (depending on culture) is:
en-US -1,234,567,890.1234789012
ca-ES -1.234.567.890,1234789012
gsw-FR -1 234 567 890,1234789012 (12/1/2012: fixed gws-FR to gsw-FR)
fr-CH -1'234'567'890.1234789012
ar-DZ 1,234,567,890.1234789012-
prs-AF 1.234.567.890,1234789012-
ps-AF 1،234،567،890,1234789012-
as-IN -1,23,45,67,890.1234789012
lo-LA (1234567,890.1234789012) (some debate if numbers should be "1,234,567,890")
qps-PLOC 12,,3456,,7890..1234789012
How can i convert a Decimal to a string, with digit groupings?
Update: Some more desired output, using my current culture of :
-1234567890M --> -12,,3456,,7890
-1234567890.1M --> -12,,3456,,7890..1
-1234567890.12M --> -12,,3456,,7890..12
-1234567890.123M --> -12,,3456,,7890..123
-1234567890.1234M --> -12,,3456,,7890..1234
-1234567890.12347M --> -12,,3456,,7890..12347
-1234567890.123478M --> -12,,3456,,7890..123478
-1234567890.1234789M --> -12,,3456,,7890..1234789
-1234567890.12347890M --> -12,,3456,,7890..1234789
-1234567890.123478901M --> -12,,3456,,7890..123478901
-1234567890.1234789012M --> -12,,3456,,7890..1234789012
Update: i tried peeking at how Decimal.ToString() manages to use the General format to show all the digits that it needs to show:
public override string ToString()
{
return Number.FormatDecimal(this, null, NumberFormatInfo.CurrentInfo);
}
except that Number.FormatDecimal is hidden somewhere:
[MethodImpl(MethodImplOptions.InternalCall)]
public static extern string FormatDecimal(decimal value, string format, NumberFormatInfo info);
So that's a dead end.

The ToString method on decimals by default uses the CultureInfo.CurrentCulture for the user's session, and thus varies based on whom is running the code.
The ToString method also accepts an IFormatProvider in various overloads. This is where you need to supply your culture-specific Formatters.
For instance, if you pass the NumberFormat for fr-CH, you can format things as that culture expects:
var culture = CultureInfo.CreateSpecificCulture("fr-CH");
Decimal value = -1234567890.1234789012M;
Console.WriteLine(value.ToString("##,#.###############", culture.NumberFormat));
Will output
-1'234'567'890.1234789012
Edit #3 - rewrote using custom formatters. This should do what you want based on the new updated question.
Edit #4 - Took all of your input, and ran this:
public void TestOutput()
{
PrintValue(-1234567890M);
PrintValue(-1234567890.1M);
PrintValue(-1234567890.12M);
PrintValue(-1234567890.123M);
PrintValue(-1234567890.1234M);
PrintValue(-1234567890.12347M);
PrintValue(-1234567890.123478M);
PrintValue(-1234567890.1234789M);
PrintValue(-1234567890.12347890M);
PrintValue(-1234567890.123478901M);
PrintValue(-1234567890.1234789012M);
}
private static void PrintValue(decimal value)
{
var culture = CultureInfo.CreateSpecificCulture("qps-PLOC");
Console.WriteLine(value.ToString("##,#.###############", culture.NumberFormat));
}
Gives output matching what you supplied:
--12,,3456,,7890
--12,,3456,,7890..1
--12,,3456,,7890..12
--12,,3456,,7890..123
--12,,3456,,7890..1234
--12,,3456,,7890..12347
--12,,3456,,7890..123478
--12,,3456,,7890..1234789
--12,,3456,,7890..1234789
--12,,3456,,7890..123478901
--12,,3456,,7890..1234789012
As pointed out by Joshua, this only works for some locales.
From the looks of it then, you need to pick the lesser of two evils: Knowing the precision of your numbers, or specifying formats for each culture. I'd wager knowing the precision of your numbers may be easier.
In which case, a previous version of my answer may be of use:
To explicitly control the number of decimal places to output, you can clone the number format provided by the culture and modify the NumberDecimalDigits property.
var culture = CultureInfo.CreateSpecificCulture("fr-CH");
Decimal value = -1234567890.1234789012M;
NumberFormatInfo format = (NumberFormatInfo)culture.NumberFormat.Clone();
format.NumberDecimalDigits = 30;
Console.WriteLine(value.ToString("n", format));
This outputs:
-1'234'567'890.123478901200000000000000000000

You can specify a custom pattern (the pattern will appropriately resolve to the culture specific method of grouping and the appropriate grouping and decimal separator characters). A pattern can have positive, negative and zero sections. The positive pattern is always the same but the negative pattern depends on the culture and can be retrieved from the NumberFormatInfo's NumberNegativePattern property. Since you want as much precision as possible, you need to fill out 28 digit placeholders after the decimal; the comma forces grouping.
public static class DecimalFormatters
{
public static string ToStringNoTruncation(this Decimal n, IFormatProvider format)
{
NumberFormatInfo nfi = NumberFormatInfo.GetInstance(format);
string[] numberNegativePatterns = {
"(#,0.############################)", //0: (n)
"-#,0.############################", //1: -n
"- #,0.############################", //2: - n
"#,0.############################-", //3: n-
"#,0.############################ -"};//4: n -
var pattern = "#,0.############################;" + numberNegativePatterns[nfi.NumberNegativePattern];
return n.ToString(pattern, format);
}
public static string ToStringNoTruncation(this Decimal n)
{
return n.ToStringNoTruncation(CultureInfo.CurrentCulture);
}
}
Sample output
Locale Output
======== ============================
en-US -1,234,567,890.1234789012
ca-ES -1.234.567.890,1234789012
hr-HR - 1.234.567.890,1234789012
gsw-FR -1 234 567 890,1234789012
fr-CH -1'234'567'890.1234789012
ar-DZ 1,234,567,890.1234789012-
prs-AF 1.234.567.890,1234789012-
ps-AF 1،234،567،890,1234789012-
as-IN -1,23,45,67,890.1234789012
lo-LA (1234567,890.1234789012)
qps-PLOC -12,,3456,,7890..1234789012
There is currently no locale that uses NegativeNumberFormat 4 (n -), so that case cannot be tested. But there's no reason to think it would fail.

You need to include the culture when formatting for your strings. You can either use String.Format and include the culture as the first parameter or use the object's ToString method and use the overload that takes a culture.
The following code produces the expected output (except for gws-FR, it couldn't find a culture with that string).
namespace CultureFormatting {
using System;
using System.Globalization;
class Program {
public static void Main() {
Decimal value = -1234567890.1234789012M;
Print("en-US", value);
Print("ca-ES", value);
//print("gws-FR", value);
Print("fr-CH", value);
Print("ar-DZ", value);
Print("prs-AF", value);
Print("ps-AF", value);
Print("as-IN", value);
Print("lo-LA", value);
Print("qps-PLOC", value);
}
static void Print(string cultureName, Decimal value) {
CultureInfo cultureInfo = new CultureInfo(cultureName);
cultureInfo.NumberFormat.NumberDecimalDigits = 10;
// Or, you could replace the {1:N} with {1:N10} to do the same
// for just this string format call.
string result =
String.Format(cultureInfo, "{0,-8} {1:N}", cultureName, value);
Console.WriteLine(result);
}
}
}
The above code produces the following output:
en-US -1,234,567,890.1234789012
ca-ES -1.234.567.890,1234789012
fr-CH -1'234'567'890.1234789012
ar-DZ 1,234,567,890.1234789012-
prs-AF 1.234.567.890,1234789012-
ps-AF 1،234،567،890,1234789012-
as-IN -1,23,45,67,890.1234789012
lo-LA (1234567,890.1234789012)
qps-PLOC --12,,3456,,7890..1234789012
If you're working with a multithreaded system, such as ASP.Net, you can change the thread's CurrentCulture property. Changing the thread's culture will allow all of the associated ToString and String.Format calls to use that culture.
Update
Since you're wanting to display all of the precision you're going to have to do a bit of work. Using NumberFormat.NumberDecimalDigits will work, except that if the value has less precision, the number will output with trailing zeros. If you need to make sure you display every digit without any extras, you will need to calculate the precision beforehand and set that before you convert it to a string. The StackOverflow question Calculate System.Decimal Precision and Scale may be able to help you determine the precision of the decimal.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

String StartsWith() issue with Danish text - c#

Related

odd results when comparing strings based on culture

Why this string ("ʿAbdul-Baha'"^^mso:text#de) doesn't start with "?

What's the use case for int32.Parse(String, IFormatProvider) over int32.Parse(String)?

When to use XmlConvert.ToString vs Object.ToString()

Decimal to string with thousand's separators?

Categories

Resources