Why does this happen with ToolStripMenuItems? - c#

When adding a ToolStripMenuItem to a form and setting RightToLeft to true and having a quote at the end of the text does it place the quote at the front of the Text?
ToolStripMenuItem1.Text = "Name \"Text\"";
ToolStripMenuItem1.RightToLeft = System.Windows.Forms.RightToLeft.Yes;
Displays as; "Name "Text
Edit: This also happens with single quotes.

By setting RightToLeft to Yes, you are asking the Windows text rendering engine to apply the text layout rules used in languages that use a right-to-left order. Arabic and Hebrew. Those rules are pretty subtle, especially because English phrases in those languages are not uncommon. It is not going to render "txeT emaN" as it normally does with Arabic or Hebrew glyphs, that doesn't make sense to anybody. It needs to identify sentences or phrases and reverse those. Quotes are special, they delineate a phrase.
Long story short, you are ab-using a feature to get alignment that was really meant to do something far more involved. Don't use it for that.

EDIT:
IF you just want to change Text Alignment then there is always ToolStripItem.Alignment or ToolStripItem.Padding to try...
The feature you are using is meant for localization and can support mixed content...
The way you use that feature seems more like an abuse... since Windows needs to make sense of "mixed content" of which you provide an extreme (no arabic at all).
I find it always hard to be sure that such behaviour though unexpected and unintuitive is really a bug...
Rendering logic for BiDi text is rather complex - for example when you have an exclamation mark somewhere in text which should be rendered RightToLeft it can lead to reversing the direction depending on the implementation...
For some insight see http://www.unicode.org/reports/tr9/
As it seems .NET is not fully compliant with the Unicode BiDi algorithm... there is even library that tries to implement it see http://sourceforge.net/projects/nbidi/

Related

C#'s StringInfo and TextElementEnumerator can't recognize graphemes properly

In C# StringInfo and TextElementEnumerator classes provide methods and properties for text elements.
And here, we can find the definition of the Text Element.
The .NET Framework defines a text element as a unit of text that is
displayed as a single character, that is, a grapheme. A text element
can be any of the following:
Yes, it says a text element is a grapheme in .NET. I also tested with some unicode characters myself, and it really seemed true until I tested one Korean letter '가'.
As we all know some Unicode characters consist of multiple code points. Also we may face code point sequences and that's the reason I'm using StringInfo and TextElementEnumerator instead of simple String.
StringInfo and TextElementEnumerator could tell if Chars were surrogate pairs correctly. And "\u0061\u0308", a Unicode character which consists of multiple code points, was recognized as one text element just as expected. But as for "\u1100\u1161", it failed to say that it was also one text element.
"\u1100" is a leading letter "ㄱ", and "\u1161" is a vowel letter "ㅏ". They can be individual characters and shown to the users just as I write here and you can see them now. But if they are used together, they are rendered as one character "가" instead of "ㄱㅏ".
There are two ways in order to represent a Korean character "가":
Using a single code point U+AC00 from Hangul Syllable.
Using two code points U+1100 and U+1161 from Jamo.
Most of the time the former is used. The latter is rarely used, to be honest, I can't imagine when it's used at all..
Anyway, the first one is just one precomposed letter and the second is a sequence of Lead and Vowel which is treated as one character. When rendered they look the exactly same and both are actually canonically equivalent.
Also the following line returns true in C# :
"\u1100\u1161".Normalize() == "\uAC00"
I wonder why Normalize() here works just fine when C# doesn't think they are one complete text element..
I thought it had something to do with my .NET's version, but it turns out it's not the case. This thing happens even in Mono too.
I tested this with ICU as well, and it could treat "\u1100\u1161" as one grapheme correctly!
I initially thought StringInfo and TextElementEnumerator could eliminate need for ICU4C in some simple cases, so I'm very disappointed now..
Here's my question :
Am I doing something wrong here?
or
A Text Element in .NET isn't a user-perceived character unlike in ICU?
The basic issue here is that per the Korean standard KS X 1026, the two jamos ㄱ and ㅏ are distinct from their combined form 가. In fact, this exact example is used in the official standard (see section 6.2).
Long story short, Microsoft attempted to follow the standard but other operating systems and applications don't necessarily do so. Hence you can get "malformed" content from other software / platforms that appears to be parsed incorrectly on Windows / in .NET, even though it is parsed "correctly" on those platforms.
You will either need to ensure your data is correctly formed in the first place (unlikely, given that the de-facto standard is to completely ignore the official standard) or you will need to use ICU (or a similar library) to deal with these cases.

Scintilla.NET regular expression based syntax highlighing

Is it possible to use regular expressions to define syntax highlighting in Scintilla? And if so, how to do it?
I have a custom language to process, which cannot be described in simple terms of keywords and delimiters. The meaning of particular structures in this language is dependent only on their position relative to keywords. I have regular expression based parser for this format, all I need is to apply regular expression defined rules as text styles.
I mean if something matches regex1, it should have style1. Is it possible? How?
If not - can I set styles for manually selected ranges? I mean to assign style number to a specified character range in editor. How to do it?
Is it possible to define Scintilla styles in code, not in xml file?
EDIT:
OK, I've found a way.
foreach (Match m in Patterns.Keyword0.Matches(Encoding.ASCII.GetString(e.RawText)))
e.GetRange(m.Index, m.Index + m.Length).SetStyle(1);
The problem is RawText property. It's byte buffer of UTF-8 encoded text. The text property contains nice UTF-16 text, but the GetRage method accepts byte offset not character offset. If I use conversion on each TextChanged event I loose almost all speed advantage from using Scintilla.
Of course the easiest way would be to change internal encoding to UTF-16, but when I do it, I get exception saying this encoding is not supported. The only one supported seems to be UTF-8 which is ridiculously hard (and slow) to process.
I'm hitting a wall here.
The key to this is to set the lexer to SCLEX_CONTAINER and then handle the SCN_STYLENEEDED notification. This means you only ever have to process the text that actually needs styling.
There are several guides linked at the top of the Scintilla Documentation that detail various aspects of implementing customs lexers, so I won't bother repeating any of that here.
As for performance: I've written custom scintilla lexers is python that decode to utf-8 when styling and have never noticed any significant issues, so I'd be amazed if you couldn't at least match that using C#.

StatusStrip Labels Text are mirrored [duplicate]

I am using a StringBuilder in C# to append some text, which can be English (left to right) or Arabic (right to left)
stringBuilder.Append("(");
stringBuilder.Append(text);
stringBuilder.Append(") ");
stringBuilder.Append(text);
If text = "A", then output is "(A) A"
But if text = "بتث", then output is "(بتث) بتث"
Any ideas?
This is a well-known flaw in the Windows text rendering engine when asked to render Right-To-Left text, Arabic or Hebrew. It has a difficult problem to solve, people often fall back to Western words and punctuation when there is no good alternative word available in the language. Brand and company names for example. The renderer tries to guess at the proper render order by looking at the code points, with characters in the Latin character set clearly having to be rendered left-to-right.
But it fumbles at punctuation, with brackets being the most visible. You have to be explicit about it so it knows what to do, you must use the Unicode Right-to-left mark, U+200F or \u200f in C# code. Conversely, use the Left-to-right mark if you know you need LTR rendering, U+200E.
Use AppendFormat instead of just Append:
stringBuilder.AppendFormat("({0}) {0}", text)
This may fix the issue, but it may - you need to look at the text value - it probably has LTR/RTL markers characters embedded. These need to either be removed or corrected in the value.
I had a similar issue and I managed to solve it by creating a function that checks each Char in Unicode. If it is from page FE then I add 202C after it as shown below. Without this it gets RTL and LTF mixed for what I wanted.
string us = string.Format("\uFE9E\u202C\uFE98\u202C\uFEB8\u202C\uFEC6\u202C\uFEEB\u202C\u0020\u0660\u0662\u0664\u0668 Aa1");

Correct Hebrew character sequence in C# and searchable PDFs

I'm testing an SDK that extracts text from a searchable PDF. One of the SDK's dependencies was recently updated, and it's causing an existing test on Hebrew text to fail. I don't know Hebrew nor enough about how the involved technologies represent right-to-left languages.
The NUnit test asserts that the extracted text matches the C# string "מנבוצץז ".
string hebrewText = reader.ReadToEnd();
Assert.AreEqual("מנבוצץז ", hebrewText);
The rasterized PDF has what I believe are the same characters, but in the opposite order.
The unit test fails with this message:
Expected: "מנבוצץז "
But was: " זץצובנמ"
Although the actual result more closely matches what I see in the rasterized PDF, I'm not completely sure the original test is wrong.
Are Hebrew characters in a C# string supposed to be read right-to-left like printed Hebrew text?
Does any part of the .NET stack tamper with the direction of Hebrew strings?
What about NUnit?
Are Hebrew characters embedded in a searchable PDF normally supposed to go in the same direction as the rasterized text?
Anything else I should know before deciding whether to "fix" this unit test?
There are various ways to encode RTL languages. The most common way (and Window's default) is to use logical ordering, which means the first letter is encoded as the first character in a string (or file). So whether visually the first letter appears on the left or right side of the screen doesn't affect the order in which they are stored.
Now as for the text appearing in Visual Studio, it depends on the version. As far as I remember, prior to Visual Studio 2010 the code editor displayed Hebrew backwards, and it was apparent as when you tried to select Hebrew text, it reversed in an odd way (which was visually confusing). It appears this issue no longer exists is Visual Studio 2010 (at least with SP1 which I just tested).
Let's take a Hebrew word for which the direction is more clear to non-Hebrew speakers than the string specified in your text:
יון
The word happens to be the Hebrew word for an ion, and on your screen, it should appear as three letters where the tallest letter is on the left and the shortest is on the right. In a .NET string, the expression "יון".Substring(0, 1) will produce the short letter, since it's the first letter in the string. The string can also be written as "\u05D9\u05D5\u05DF" where the leftmost Unicode character \u05D9 represents the short letter displayed on the right, which clearly demonstrates the order in which the letters are stored.
Since the string in your test case is nonsensical, I can't tell you whether it was a wrong test all along or if it a correct test that should pass. If the image you uploaded has been rendered correctly then it appears the actual result of your test is correct and the expected value is incorrect, and so you should fix the test.
I believe that all strings in C# will be stored internally as LTR; RTL strings will have a non-printable character (or something) denoting that they are indeed RTL.
More than likely. RTL GUIs and rendered text for example need certain properties (specifically RightToLeft and RightToLeftLayout) to be set in order to display correctly.
NUnit shouldn't. Nor should it care. IMHO a reversed string != the original string.
I couldn't comment. I'd assume that they should be whatever the test is expecting though, assuming it was passing at first.
Don't do half measures with RTL, it really doesn't like it. Either have full RTL support, or nothing. It can be pretty nasty, I wish you the best of luck!

Designing a translation API - How to handle spaces

My application consumes an external Translation API (no option to use other translation engines). I'm seeing the following unexpected behavior when I call the translation engine.
input
<b1> Hello World. </b1>
expected output
<b1> Hola a todos. </b1>
actual output
<b1>Hola a todos.</b1>
Is it proper for the API to be trimming the spaces? This feels wrong to me.
Note: it is documented to replace non-html tags with <b1></b1> tag pairs (numbers increment to keep tag pairs unique).
Update: The end result was that I had to hack around the issue, encode spaces before I call the translation API. I don't like it, but I was not able to convince the API owner change it to GIGO (Garbage In, Garbage Out).
Well, in general whitespaces are not considered part of a word so it is not really surprising that the API is doing that. Whether or not this behaviour is ok is probably debateable (at least it should be documented) but you should follow the rule "be liberal in what you accept and strict in what you produce". As you produce the tokens you should be more strict.
As far as I know, whitespace in HTML is not particularly significant, multiple spaces are collapsed to single space, newlines are ignored, etc. so it's not much of a surprise that the leading and trailing spaces in that string are being dropped. From the browser's point of view, they're equivalent.
So the question then becomes, is there an option in the API to preserve spaces or treat the incoming text as "plain text" and not html?

Categories