Why this string ("ʿAbdul-Baha'"^^mso:text#de) doesn't start with "? - c#

"\"ʿAbdul-Baha'\"^^mso:text#de".StartsWith("\"") // is false
"\"Abdul-Baha'\"^^mso:text#de".StartsWith("\"") // is true
(int)'ʿ' // is 703`
is there anyone could tell me Why?

You need to use the second parameter of the function BeginsWith; StringComparison.Ordinal (or StringComparison.OrdinalIgnoreCase). This instructs the function to compare by character value and to take no consideration to cultural information on sorting. This quote is from the MSDN-link below:
"An operation that uses word sort rules performs a culture-sensitive comparison wherein certain nonalphanumeric Unicode characters might have special weights assigned to them. Using word sort rules and the conventions of a specific culture, the hyphen ("-") might have a very small weight assigned to it so that "coop" and "co-op" appear next to each other in a sorted list."
This seems to affect how BeginsWith performs depending on locale/culture (see the comments on OP's post) - it works for some but not for others.
In my example (unit-test) below I show that if you convert the strings to a char-array and look at the first character, it it actually the same. When calling the BeginsWith-function you need to add the Ordinal comparison to get the same result.
For reference my locale is Swedish.
For further info: MSDN: StringComparison Enumeration
[Test]
public void BeginsWith_test()
{
const string string1 = "\"ʿAbdul-Baha'\"^^mso:text#de";
const string string2 = "\"Abdul-Baha'\"^^mso:text#de";
var chars1 = string1.ToCharArray();
var chars2 = string2.ToCharArray();
Assert.That(chars1[0], Is.EqualTo('"'));
Assert.That(chars2[0], Is.EqualTo('"'));
Assert.That(string1.StartsWith("\"", StringComparison.InvariantCulture), Is.False);
Assert.That(string1.StartsWith("\"", StringComparison.CurrentCulture), Is.False);
Assert.That(string1.StartsWith("\"", StringComparison.Ordinal), Is.True); // Works
Assert.That(string2.StartsWith("\""), Is.True);
}

Related

How do you find a delimited/isolated substring with string.contains?

I am trying to parse out and identify some values from strings that I have in a list.
I am using string.Contains to identify the value im looking for, but I am getting hits even if the value is surrounded by other text. How can I make sure I only get a hit if the value is isolated?
Example parse:
Looking for value = "302"
string sale =
"199708. (30), italiano, delim fabricata modella, serialNumber302. tnr F18529302E.";
var result = sale.ToLower().Contains(”302”));
In this example I will get a hit for "serialNumber302" and "F18529302E", which in the context is incorrect since I only want a hit if it finds “302” isolated, like “dontfind302 shouldfind 302”.
Any ideas on how to do this?
If you try Regex, you can define a word boundary using \b:
string sale =
"199708. (30), italiano, delim fabricata modella, serialNumber302. tnr F18529302E.";
bool result = Regex.IsMatch(sale, #"\b302\b"); // false
sale = "A string with 302 isolated";
result = Regex.IsMatch(sale, #"\b302\b"); // true
So 302 will only be found if it is at the start of the string, at the end of the string, or if it is surrounded by non-word characters i.e. not a-z A-Z 0-9 or _
EDIT: From the comments I realiſed that it waſn't clear whether or not "serialNum302" ſhould get a hit. I aſſumed ſo in this anſwer.
I ſee a few eaſy ways you could do this:
1) If the input is always a number as in the example, one option would be to only ſearch for ſubſtrings not ſurrounded by more numbers, by examining all the reſults of an initial ſearch and comparing their neighboring characters againſt the ſtring "0123456789". I really don't think this is the beſt option though, becauſe ſooner or later it's goïng to break when it miſinterprets one of the other bits of data.
2) If the ſtring sale always has the ſeriäl number in the format "serialNumber[Num]", inſtead of juſt looking for Num, look for "serialNumber" + Num, as this is leſs likely to be meſſed up with the other data.
3) From your ſtring, it looks like you have a ſtandardized format that's beïng introduced to the ſyſtem. In this caſe, parſe it in a ſtandardized way, e.g. by ſplitting it into ſubſtrings at the commas, then parſing each ſubſtring differently as it requires.

Convert string into three letter Abbreviation

I've recently been given a new project by work to convert Any given string into 1-3 letter abbreviations.
An example of something similar to what I must produce is below however the strings given could be anything:
switch (string.Name)
{
case "Emotional, Social & Personal": return "ESP";
case "Speech & Language": return "SL";
case "Physical Development": return "PD";
case "Understanding the World": return "UW";
case "English": return "E";
case "Expressive Art & Design": return "EAD";
case "Science": return "S";
case "Understanding The World And It's People"; return "UTW";
}
I figured that I could use string.Split & count the number of words in the array. Then add conditions for handling particular length strings as generally these sentences wont be longer than 4 words however problems I will encounter are.
If a string is longer than I expected it wouldn't be handled
Symbols must be excluded from the abbreviation
Any suggestions as to the logic I could apply would be very appreciated.
Thanks
Something like the following should work with the examples you have given.
string abbreviation = new string(
input.Split()
.Where(s => s.Length > 0 && char.IsLetter(s[0]) && char.IsUpper(s[0]))
.Take(3)
.Select(s => s[0])
.ToArray());
You may need to adjust the filter based on your expected input. Possibly adding a list of words to ignore.
It seems that if it doesn't matter, you could just go for the simplest thing. If the string is shorter than 4 words, take the first letter of each string.
If the string is longer than 4, eliminate all "ands", and "ors", then do the same.
To be better, you could have a lookup dictionary of words that you wouldn't care about - like "the" or "so".
You could also keep an 3D char array, in alphabetical order for quick lookup. That way, you wouldn't have any repeating abbreviations.
However, there are only a finite number of abbreviations. Therefore, it might be better to keep the 'useless' words stored in another string. That way, if the abbreviation your program does by default is already taken, you can use the useless words to make a new one.
If all of the above fail, you could start to linearly move through string to get a different 3 letter word abbreviation - sort of like codons on DNA.
Perfect place to use a dictionary
Dictionary<string, string> dict = new Dictionary<string, string>() {
{"Emotional, Social & Personal", "ESP"},
{"Speech & Language","SL"},
{"Physical Development", "PD"},
{"Understanding the World","UW"},
{"English","E"},
{"Expressive Art & Design","EAD"},
{"Science","S"},
{"Understanding The World And It's People","UTW"}
};
string results = dict["English"];​
Following snippet may help you:
string input = "Emotional, Social & Personal"; // an example from the question
string plainText = CultureInfo.CurrentCulture.TextInfo.ToTitleCase(Regex.Replace(input, #"[^0-9A-Za-z ,]", "").ToLower()); // will produce a text without special charactors
string abbreviation = String.Join("",plainText.Split(" ".ToCharArray(),StringSplitOptions.RemoveEmptyEntries).Select(y =>y[0]).ToArray());// get first character from each word

Sanitizing a String for a Property Name

Problem
I need to sanitize a collection of Strings from user input to a valid property name.
Context
We have a DataGrid that works with runtime generated classes. These classes are generated based on some parameters. Parameter names are converted into Properties. Some of these parameter names are from user input. We implemented this and it all seemed to work great. Our logic to sanitizing strings was to only allow numbers and letters and convert the rest to an X.
const string regexPattern = #"[^a-zA-Z0-9]";
return ("X" + Regex.Replace(input, regexPattern, "X")); //prefix with X in case the name starts with a number
The property names were always correct and we stored the original string in a dictionary so we could still show a user friendly parameter name.
However, where the trouble starts is when a string only differs in illegal characters like this:
Parameter Name
Parameter_Name
These were both converted into:
ParameterXName
A solution would be to just generate some safe, unrelated names like A, B C. etc. But I would prefer the name to still be recognizable in debug. Unless it's too complicated to implement this behavior of course.
I looked at other questions on StackOverflow, but they all seem to remove illegal characters, which has the same problem.
I feel like I'm reinventing the wheel. Is there some standard solution or trick for this?
I can suggest to change algorithm of generating safe, unrelated and recognizable names.
In c# _ is valid symbol for member names. Replace all invalid symbols (chr) not with X but with "_"+(short)chr+"_".
demo
public class Program
{
public static void Main()
{
string [] props = {"Parameter Name", "Parameter_Name"};
var validNames = props.Select(s=>Sanitize(s)).ToList();
Console.WriteLine(String.Join(Environment.NewLine, validNames));
}
private static string Sanitize(string s)
{
return String.Join("", s.AsEnumerable()
.Select(chr => Char.IsLetter(chr) || Char.IsDigit(chr)
? chr.ToString() // valid symbol
: "_"+(short)chr+"_") // numeric code for invalid symbol
);
}
}
prints
Parameter_32_Name
Parameter_95_Name

C# Regex.Match to decimal

I have a string "-4.00 %" which I need to convert to a decimal so that I can declare it as a variable and use it later. The string itself is found in string[] rows. My code is as follows:
foreach (string[] row in rows)
{
string row1 = row[0].ToString();
Match rownum = Regex.Match(row1.ToString(), #"\-?\d+\.+?\d+[^%]");
string act = Convert.ToString(rownum); //wouldn't convert match to decimal
decimal actual = Convert.ToDecimal(act);
textBox1.Text = (actual.ToString());
}
This results in "Input string was not in a correct format." Any ideas?
Thanks.
I see two things happening here that could contribute.
You are treating the Regex Match as though you expect it to be a string, but what a Match retrieves is a MatchGroup.
Rather than converting rownum to a string, you need to lookat rownum.Groups[0].
Secondly, you have no parenthesised match to capture. #"(\-?\d+\.+?\d+)%" will create a capture group from the whole lot. This may not matter, I don't know how C# behaves in this circumstance exactly, but if you start stretching your regexes you will want to use bracketed capture groups so you might as well start as you want to go on.
Here's a modified version of your code that changes the regex to use a capturing group and explicitly look for a %. As a consequence, this also simplifies the parsing to decimal (no longer need an intermediary string):
EDIT : check rownum.Success as per executor's suggestion in comments
string[] rows = new [] {"abc -4.01%", "def 6.45%", "monkey" };
foreach (string row in rows)
{
//regex captures number but not %
Match rownum = Regex.Match(row.ToString(), #"(\-?\d+\.+?\d+)%");
//check for match
if(!rownum.Success) continue;
//get value of first (and only) capture
string capture = rownum.Groups[1].Value;
//convert to decimal
decimal actual = decimal.Parse(capture);
//TODO: do something with actual
}
If you're going to use the Match class to handle this, then you have to access the Match.Groups property to get the collection of matches. This class assumes that more than one occurrence appears. If you can guarantee that you'll always get 1 and only 1 you could get it with:
string act = rownum.Groups[0];
Otherwise you'll need to parse through it as in the MSDN documentation.

String StartsWith() issue with Danish text

Can anyone explain this behaviour?
var culture = new CultureInfo("da-DK");
Thread.CurrentThread.CurrentCulture = culture;
"daab".StartsWith("da"); //false
I know that it can be fixed by specifying StringComparison.InvariantCulture. But I'm just confused by the behavior.
I also know that "aA" and "AA" are not considered the same in a Danish case-insensitive comparision, see http://msdn.microsoft.com/en-us/library/xk2wykcz.aspx. Which explains this
String.Compare("aA", "AA", new CultureInfo("da-DK"), CompareOptions.IgnoreCase) // -1 (not equal)
Is this linked to the behavior of the first code snippet?
Here a test that illustrates the problem, daab og dåb (same word in old and modern language respectively) means baptism/christening.
public class can_handle_remnant_of_danish_language
{
[Fact]
public void daab_start_with_då()
{
var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
Assert.True("daab".StartsWith("då")); // Fails
}
[Fact]
public void daab_start_with_da()
{
var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
Assert.True("daab".StartsWith("da")); // Fails
}
[Fact]
public void daab_start_with_daa()
{
var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
Assert.True("daab".StartsWith("daa")); // Succeeds
}
[Fact]
public void dåb_start_with_daa()
{
var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
Assert.True("dåb".StartsWith("daa")); // Fails
}
[Fact]
public void dåb_start_with_da()
{
var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
Assert.True("dåb".StartsWith("da")); // Fails
}
[Fact]
public void dåb_start_with_då()
{
var culture = new CultureInfo("da-DK"); Thread.CurrentThread.CurrentCulture = culture;
Assert.True("dåb".StartsWith("då")); // Succeeds
}
}
All the above tests should be successfull with my understanding of the language, and im danish!
I aint got no degree in grammar though. :-)
Seems like a bug to me.
Like Nappy said, its a feature of the danish language, where "aa" and "å" is still the same. Danish got another two letters, æ and ø, but I am not sure if they can be written using two letters as well.
I think in the second example "aA" is not changed while "AA" is changed to "Å". Just to confuse things even more, "Aa" is considered equal to "AA" and "aa" only when using case-insensitive comparing.
The modern spelling of "baptism" in Danish, namely dåb, is certainly not considered to start with da, for a Danophone. If daab is supposed to be an old-fashioned spelling of dåb, it is a bit philosophical whether it starts with da or not. But for (modern) collation purposes, it does not (alphabetically, such daab goes after disk, not before).
However, if your string is not supposed to represent natural language, but is instead some kind of technical code, like hexadecimal digits, surely you do not want to use any culture-specific rules. The solution here is not to use the invariant culture. The invariant culture has (English) rules itself!
Instead, you want to use ordinal comparison.
Ordinal comparison simply compares the strings char by char, without any assumptions of what sequences are "equivalent" in some sense. (Technical remark: Each char is a UTF-16 code unit, not a "character". Ordinal comparison is ignorant of the rules of Unicode normalization.)
I think the confusion arises because, by default, some string methods use a culture-aware comparison, and other string methods use the ordinal comparison.
The following examples all use a culture-aware comparison:
"Straße".StartsWith("Strasse", StringComparison.CurrentCulture)
"Straße".Equals("Strasse", StringComparison.CurrentCulture)
"ne\u0301e".StartsWith("née", StringComparison.CurrentCulture)
"ne\u0301e".Equals("née", StringComparison.CurrentCulture)
"Straße".StartsWith("Strasse") // CurrentCulture is default for 'StartsWith'!
"ne\u0301e".StartsWith("née") // CurrentCulture is default for 'StartsWith'!
Each of the above may depend on the .NET version as well! (As an example, the first one gives true if the current culture is the invariant culture and you are under .NET Framework 4.8; but it gives false if the current culture is the invariant culture and you use .NET 6.)
But these examples use ordinal comparison:
"Straße".StartsWith("Strasse", StringComparison.Ordinal)
"Straße".Equals("Strasse", StringComparison.Ordinal)
"ne\u0301e".StartsWith("née", StringComparison.Ordinal)
"ne\u0301e".Equals("née", StringComparison.Ordinal)
"Straße".Equals("Strasse") // Ordinal is default for 'Equals'!
"ne\u0301e".Equals("née") // Ordinal is default for 'Equals'!
So remember to check what the default comparison is for the string method you use, and specify the opposite one if needed. (Or always specify the comparison, even when redundant, if you prefer.)

Categories