How .NET's regex engine treats RTL+LTR mixed strings? - c#

I have a mixed Hebrew/english string to parse.
The string is built like this:
[3 hebrew] [2 english 2] [1 hebrew],
So, it can be read as: 1 2 3, and it is stored as 3 2 1 (exact byte sequence in file, double-checked in hex editor, and anyway RTL is only the display attribute). .NET regex parser has RTL option, which (when given for plain LTR text) starts processing from right side of the string.
I am wondering, when this option should be applied to extract [3 hebrew] and [2 english] parts from the string,or to check if [1 hebrew] matches the end of the string? Are there any hidden specifics or there's nothing to worry about (like when processing any LTR string with special unicode characters)?
Also, can anyone recommend me a good RTL+LTR text editor? (afraid that VS Express displays the text wrong sometimes, and if it can even start messing the saved strings - I would like to re-check the files without using hex editors anymore)

The RightToLeft option refers to the order through the character sequence that the regular expression takes, and should really be called LastToFirst since in the case of Hebrew and Arabic it is actually left-to-right, and with mixed RLT and LTR text such as you describe the expression "right to left" is even less appropriate.
This has a minor effect on speed (will only matter if the searched text is massive) and on regular expressions that are done with a startAt index (searching those earlier in the string than startAt rather than later in the string).
Examples; let's hope the browers don't mess this up too much:
string saying = "למכות is in כתר"; //Just because it amuses me that this is a saying whatever way round the browser puts malkuth and kether.
string kether = "כתר";
Console.WriteLine(new Regex(kether, RegexOptions.RightToLeft).IsMatch(saying));//True
Console.WriteLine(new Regex(kether, RegexOptions.None).IsMatch(saying));//True, perhaps minutely faster but so little that noise would hide it.
Console.WriteLine(new Regex(kether, RegexOptions.RightToLeft).IsMatch(saying, 2));//False
Console.WriteLine(new Regex(kether, RegexOptions.None).IsMatch(saying, 2));//True
//And to show that the ordering is codepoint rather than physical display ordering:
Console.WriteLine(new Regex("" + kether[0] + ".*" + kether[2]).IsMatch(saying));//True
Console.WriteLine(new Regex("" + kether[2] + ".*" + kether[0]).IsMatch(saying));//False

Related

c# - Replacing extended ascii characters

I'm parsing a number of text files that contain 99.9% ascii characters. Numbers, basic punctuation and letters A-Z (upper and lower case).
The files also contain names, which occasionally contain characters which are part of the extended ascii character set, for example umlauts Ü and cedillas ç.
I want to only work with standard ascii, so I handle these extended characters by processing any names through a series of simple replace() commands...
myString = myString.Replace("ç", "c");
myString = myString.Replace("Ü", "U");
This works with all the strange characters I want to replace except for Ø (capital O with a forward slash through it). I think this has the decimal equivalent of 157.
If I process the string character-by-character using ToInt32() on each character it claims the decimal equivalent is 65533 - well outside the normal range of extended ascii codes.
Questions
why doesn't myString.Replace("Ø", "O"); work on this character?
How can I replace "Ø" with "O"?
Other information - may be pertinent. Opening the file with Notepad shows the character as a "Ø". Comparison with other sources indicate that the data is correct (i.e. the full string is "Jørgensen" - a valid Danish name). Viewing the character in visual studio shows it as "�". I'm getting exactly the same problem (with this one character) in hundreds of different files. I can happily replace all the other extended characters I encounter without problems. I'm using System.IO.File.ReadAllLines() to read all the lines into an array of strings for processing.
Replace works fine for the 'Ø' when it 'knows' about it:
Console.WriteLine("Jørgensen".Replace("ø", "o"));
In your case the problem is that you are trying to read the data with the wrong encoding, that's why the string does not contain the character which you are trying to replace.
Ø is part of the extended ASCII set - iso-8859-1, but File.ReadAllLines tries to detect encoding using BOM chars and, I suspect, falls back to UTF-8 in your case (see Remarks in the documentation).
The same behavior you see in the VS code - it tries to open the file with UTF-8 encoding and shows you �:
If you switch the encoding to the correct one - it shows the text correctly:
If you know what encoding is used for your files, just use it explicitly, here is an example to illustrate the difference:
// prints J?rgensen
File.ReadAllLines("data.txt")
.Select(l => l.Replace("Ø", "O"))
.ToList()
.ForEach(Console.WriteLine);
// prints Jorgensen
File.ReadAllLines("data.txt",Encoding.GetEncoding("iso-8859-1"))
.Select(l => l.Replace("Ø", "O"))
.ToList()
.ForEach(Console.WriteLine);
If you want to use chars from the default ASCII set, you may convert all special chars from the extended set to the base one (it will be ugly and non-trivial). Or you can search online how to deal with your concern, and you may find String.Normalize() or this thread with several other suggestions.
public static string RemoveDiacritics(string s)
{
var normalizedString = s.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder();
for(var i = 0; i < normalizedString.Length; i++)
{
var c = normalizedString[i];
if(CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)
stringBuilder.Append(c);
}
return stringBuilder.ToString();
}
...
// prints Jorgensen
File.ReadAllLines("data.txt", Encoding.GetEncoding("iso-8859-1"))
.Select(RemoveDiacritics)
.ToList()
.ForEach(Console.WriteLine);
I'd strongly recommend reading C# in Depth: Unicode by Jon Skeet and Programming with Unicode by Victor Stinner books to have a much better understanding of what's going on :) Good luck.
PS. My code example is functional, compact but pretty inefficient, if you parse huge files consider using another solution.

Unable to match on first column name after parsing CSV document

I am working on CSV import where I take a file (with headings as the first row) and parse the document into DataTable structure.
When I try to organise the data into a collection for some reason (unknown to me), my state machine fails to match on the very first column heading. It should be fairly straight forward, no magic involved.
foreach (DataRow row in dt.Rows)
{
foreach (DataColumn col in row.Table.Columns)
{
switch(col.ColumnName)
{
default:
// debug: Exceptions.LogException(new Exception(" csv {ColumnName:'" + col.ColumnName + "',Length:" + col.ColumnName.Length + ",Test:" + string.Equals(col.ColumnName, "Name") + "}"));
break;
case "Name":
// doesn't get picked up
break:
My debug line(s) return the following: csv {ColumnName:'Name',Length:5,Test:False}
Interestingly enough, if I add a dummy column to the file in front of Name column then my case: "Name" works fine.
Any ideas what could be causing an issue like that?
Great comments and suggestions
Reproducible code example - was going to make one today but it looks like we have a different problem
Leading/trailing spaces - checked for those before posting
Name being reserved - tried a different column name didn't make a difference
Weird characters - checked CSV in Notepad, Sublime (fancy Notepad) before posting for strange characters. But after JAZ suggested to check the length s/he was right on the money (see above).
Pursuing the issue of weird characters
So far it doesn't seem to be any of the usual suspects: space, tab, newline, carriage return (or combination of both). But one thing is for sure it's at the begging of the string as suggested by debugging log.
sb.Append("{Col:'" + col.ColumnName.Substring(0, 3) + "',Len:" + col.ColumnName.Length + "}");
Returning {Col:'vN',Len:6} where first column is vName.
Culprit/Solution
Finally found the culprit U+FEFF aka BYTE ORDER MARK character which appears at the start of text stream (but can also appear in the middle ZERO WIDTH NO-BREAK SPACE) and indicates the type of encoding (UTF-8, UTF-16, UTF-32, etc).
Found by converting a string of characters into Unicode as follows:
col.ColumnName.Select(t => string.Format("U+{0:X4}", (ushort)t)).ToList()
Producing the following output for vName string:
U+FEFF = byte order mark
U+0076 = v
U+004E = N
U+0061 = a
U+006D = m
U+0065 = e
Handy to know
Just wanted to share that you can quickly check the type of encoding and line break used by opening the file in Notepad. Would have been handy to know this when I was posting my question. Below are three different CSV files which use a different encoding.
Probably a line feed or some weird character in the data - What is the length of the string when it fails? That would tell you if there are too many chars.
Start with the CSV data and check it in a real editor, not Excel, see if there is something in the data.
Use Notepad++ and change the encoding of the text file to see the extra characters. Don't think Windows Notepad will show them.

Can't replace single whitespace with string.Replace()

I have run into a problem I do not understand. I am reading data from a file and have run into a situation where string.Replace(" ", "<whatever>") on an entry from the file will not replace the occurence of a single whitespace. I cannot help but to feel there is something very basic that I have missed, since the same kind of string declared as a literal works fine.
A typical line from the file (each entry is separated by a tab):
"2016-feb-08 09:54:00" "2016-feb-08 17:28:00" "Short" "227" "5 170,00" "+3,90%" "0,00"
The data from the file is read into an array using File.ReadAllLines().Split(new[] {"\t" }, StringSplitOptions.None);.
I then want to clean up the fifth entry for further processing, and this is when I run into the problem:
entries[4].Replace(" ", string.Empty).Replace("\"", string.Empty); gives "5 170,00"
Regex.Replace(entries[4], #"\s+", string.Empty).Replace("\"", string.Empty); gives "5170,00", which is the result I am looking for.
Running the first Replace() on a literal with a single space works fine, so I am curious if the whitespace inside the strings from the file are different somehow? And while the Regex solution works, I really want to know what my "issue" is.
You can use code like below to check hex values of the character. A normal space is 0x20 which the value showing between the five and the one in the code you posted.
string input = "2016-feb-08 09:54:00 2016-feb-08 17:28:00 Short 227 5 170,00 +3,90% 0,00";
byte[] output = Encoding.UTF8.GetBytes(input);

How do I prevent a string from appearing in a result string when a set of child strings are concatenated to form the result string?

I have 5 strings, let's call them
EarthString
FireString
WindString
WaterString
HeartString
All of them can have varying length, any of them can be empty, or can be very long (but never null).
These 5 strings are very good friends, and every weekend they are concatenated to form a result string using this c# statement
ResultString = EarthString + FireString + WindString + WaterString + HeartString
Depending on the values of these strings, sometimes (only sometimes), ResultString will contain "Captain Planet" as a substring.
My question is, how do I manipulate each of the 5 strings before they are concatenated, so that when they are combined, "Captain Planet" will never appear as a substring in the resultant string?
The only way I can think of right now is to examine each character in each string, in sequential order, but that seems very tedious. Since each of the 5 good friends strings can be of any length, examining the characters individually will also require some kind of concatenation before we can determine whether any character need to be dropped.
Edit: The resultant string is a filtered version of the 5 strings concatenated together, all the other content remain the same except the "Captain Planet" string is dropped. Yes, i'm looking for a solution which allows the 5 strings to be manipulated before concatenation. (this is actually a simplification of a bigger programming problem i'm encountering). Thanks guys.
If you want to do it pre-concat you could
Assign the start and end of each string a numeric value based on the portion of "CaptainPlanet" they contein. Ex: if Air = "net the big captain" then it would get 3 for a start value and 7 for an end value. to determine if you could concat 2 values safely you would just check to see if the end of the left string + start of the right string were not equal to the total length of "CaptainPlanet". If you had very large strings this would allow you to inspect just the first x and last x characters of the string to compute the start/end value.
This solution doesn't account for short strings like ei air = "Cap" , earth ="tain" and fire="Planet". In that case you would need to have a special case for tokens that are shorter than the length of "CaptainPlanet" For those.
Is there a particular reason you can't just do this?
ResultString.Replace("CaptainPlanet", "x");
If it doesn't matter how many chars will be dropped, you can remove f.e. all 'C' in all strings.
The original answer cleared all of the strings, but as pointed out by J.Steen, there was already a formulation of the expected output. So there we go.
Run elementString.Replace("Captain Planet", "") on every substring.
Now you have to identify all the prefixes / suffixes of "Captain Planet" on each of the substrings, and keep that information so that it can be processed before contatenation. That is, e.g. if the substring ends with "Capt", then you should have an information that "substring contains at the end a prefix of the 4 first letters of 'Captain Planet'". You also have to consider the cases of complete substrings (e.g. one of the strings is "ptain Pla"). The problem also becomes more complex if any of the e.g. prefixes can be recursive or repeated (e.g. "CaptainCap" contains 2 kinds of valid prefixes for "CaptainCaptain", and "apt" can be found at two locations in the resulting string);
You process that information before concatenation so that the result string has the same thing as ResultString.Replace("Captain Planet", ""). Congratulations, you have made your program much more complex than necessary!
But in short, you cannot get both the result that you want (all of the substrings intact except for the combined result output) and do the processing wholly before the concatenation step.

C#: How do you go upon constructing a multi-lined string during design time?

How would I accomplish displaying a line as the one below in a console window by writing it into a variable during design time then just calling Console.WriteLine(sDescription) to display it?
Options:
-t Description of -t argument.
-b Description of -b argument.
If I understand your question right, what you need is the # sign in front of your string. This will make the compiler take in your string literally (including newlines etc)
In your case I would write the following:
String sDescription =
#"Options:
-t Description of -t argument.";
So far for your question (I hope), but I would suggest to just use several WriteLines.
The performance loss is next to nothing and it just is more adaptable.
You could work with a format string so you would go for this:
string formatString = "{0:10} {1}";
Console.WriteLine("Options:");
Console.WriteLine(formatString, "-t", "Description of -t argument.");
Console.WriteLine(formatString, "-b", "Description of -b argument.");
the formatstring makes sure your lines are formatted nicely without putting spaces manually and makes sure that if you ever want to make the format different you just need to do it in one place.
Console.Write("Options:\n\tSomething\t\tElse");
produces
Options:
Something Else
\n for next line, \t for tab, for more professional layouts try the field-width setting with format specifiers.
http://msdn.microsoft.com/en-us/library/txafckwd.aspx
If this is a /? screen, I tend to throw the text into a .txt file that I embed via a resx file. Then I just edit the txt file. This then gets exposed as a string property on the generated resx class.
If needed, I embed standard string.Format symbols into my txt for replacement.
Personally I'd normally just write three Console.WriteLine calls. I know that gives extra fluff, but it lines the text up appropriately and it guarantees that it'll use the right line terminator for whatever platform I'm running on. An alternative would be to use a verbatim string literal, but that will "fix" the line terminator at compile-time.
I know C# is mostly used on windows machines, but please, please, please try to write your code as platform neutral. Not all platforms have the same end of line character. To properly retrieve the end of line character for the currently executing platform you should use:
System.Environment.NewLine
Maybe I'm just anal because I am a former java programmer who ran apps on many platforms, but you never know what the platform of the future is.
The "best" answer depends on where the information you're displaying comes from.
If you want to hard code it, using an "#" string is very effective, though you'll find that getting it to display right plays merry hell with your code formatting.
For a more substantial piece of text (more than a couple of lines), embedding a text resources is good.
But, if you need to construct the string on the fly, say by looping over the commandline parameters supported by your application, then you should investigate both StringBuilder and Format Strings.
StringBuilder has methods like AppendFormat() that accept format strings, making it easy to build up lines of format.
Format Strings make it easy to combine multiple items together. Note that Format strings may be used to format things to a specific width.
To quote the MSDN page linked above:
Format Item Syntax
Each format item takes the following
form and consists of the following
components:
{index[,alignment][:formatString]}
The matching braces ("{" and "}") are
required.
Index Component
The mandatory index component, also
called a parameter specifier, is a
number starting from 0 that identifies
a corresponding item in the list of
objects ...
Alignment Component
The optional alignment component is a
signed integer indicating the
preferred formatted field width. If
the value of alignment is less than
the length of the formatted string,
alignment is ignored and the length of
the formatted string is used as the
field width. The formatted data in
the field is right-aligned if
alignment is positive and left-aligned
if alignment is negative. If padding
is necessary, white space is used. The
comma is required if alignment is
specified.
Format String Component
The optional formatString component is
a format string that is appropriate
for the type of object being formatted
...

Categories