Split Urdu words based on nonexistent space

Split Urdu words based on nonexistent space - c#

I have a Urdu word "لاعلم" and more similar words. How can I split the word that I get "لا" and "علم" separately in an array? I have tried converting the words to unicode characters, but I can,t detect the break between "لا" and "علم".
English words can be easily separated based on spaces, but I am stuck on separating Urdu words, where there are no spaces.

There is no space because its a single word meaning "ignorant." As a matter of fact, "لا" and "علم" separated wouldn't mean anything.
Space is inserted in Urdu (and Arabic script) for a practical need to demarcate words when the font would automatically ligature it with adjoining characters. The only way one can undo the ligature is by inserting a superfluous space between characters. Technically, the ZERO WIDTH NON-JOINER (U+200C) is precisely for this purpose but human beings are slow to learn and space is easy to insert.
There are some characters that don't join with following letters, for example, "ا" wouldn't join with any following character but can with a preceding character like "ل" to form the ligature "لا." You can use this list of characters (same rules for Arabic) and write a custom toneizer that ends a word after "Right Joining" characters, ZWNJ or a space.

Related

Underscore in regex not validating

How do I add underscore as a part of my regex string.
Here is my string that checks for uppercase, lowercase, numbers and special characters. The rest of the special characters work. Validation isn't working for underscores.
#"^[^\s](?=(.*[A-Za-z]){1,})(?=(.*[\d]){1,})(?=(.*[\W]){1,})(?=(.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~]{1,})).*[^\s]$"
Any ideas?
Thanks

This is the regex that AWS Cogito uses, it should apply to your situation:
#"^(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])(?=.*[\^$*.\[\]{}\(\)?\-“!##%&\/,><’:;|_~`])\S{8,99}$"
You can check regexes at http://regexstorm.net, it's faster than building your application everytime.

I've approached it like this: I took your requirements and made them into separate positive lookaheads:
Check for:
uppercase (?=.*[A-Z])
lowercase (?=.*[a-z]) (note that I broke A-Z and a-z up into separate groups)
numbers (?=.*\d)
special characters (?=.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~])
You can then combine them in any order and I've combined them in the same order as I listed them above and anchored it with the beginning of the line using ^. Don't add any extra matches before, in-between or after the groups in your requirement that could cause the regex to enforce a certain ordering of the groups:
The lookahead for any non-word character \W makes it impossible to match Underscore1_ since it will only match on "anything other than a letter, digit or underscore" - which is all Underscore1_ contains.
The starting [^\s] (and ending [^\s]) that consumes one character is likely destroying a lot of good matches. Underscore1_ or _1scoreUnder shouldn't matter, but if you start with _ and consume it with [^\s] like you do, the later lookahead for a special character will fail (unless you have a second special character in the password).
#"^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~])"
If you have a minimum length requirement of, say, 7 characters, you just have to add .{7,}$ to the end of the regex, making it:
#"^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~]).{7,}$"
Without a minimum length, a password of one character from each group will be enough, and since there are 4 groups, a password with only 4 characters will pass the filter.
I see no point in putting an upper length limit into the regex. If the user interface has accepted a string that is thousands of characters long, then why reject it for being too long later? The length of what you store is probably going to be much smaller anyway since you'll be storing the bcrypt/scrypt/argon2/... encoded password.
Suggestion: Also add space (or even whitespaces) to the list of special characters.

In you regexp add underscore in 3rd Capturing Group regex101
#"^[^\s](?=(.*[A-Za-z]){1,})(?=(.*[\d]){1,})(?=(.*[\W_]){1,})(?=(.*[!##$%^&*()-+=\[{\]};:<>|_.\\/?,\-`'""~]{1,})).*[^\s]$"

Char.IsControl method does not recognize some characters as control

I noticed that C# 'Char.IsControl' method doesn't recognize some characters as control. For example, the following code outputs false for both values:
char pilcrow = '\u00B6';
char softHyphen = '\u00AD';
Console.Write("{0},{1}",char.IsControl(pilcrow), char.IsControl(softHyphen)); // -> 'false,false'
Is this an expected behavior? I need to escape such characters in my code.

Those aren't control characters. One is the pilcrow sign ¶ which belongs to the Punctuation, Other [Po] category , the other is the soft hyphen, a non-visible formatting character that affects how texts gets hyphenated.
There's nothing special about them, in fact you probably use the soft hyphen when writing a paragraph in Word and want to control hyphenation of some words. Word uses ¶ as the paragraph mark - a visualization of a paragraph's end. It doesn't affect formatting, it's just the common way to denote the end of paragraph. In that respect it's no different than ², ³, §, ¶, ¤, ¦, °, ±, ½, ¬ (just holding Right Alt and hitting keys)
.NET strings use Unicode so there's no need to escape these characters. You can just type them directly.
There's no problem with printing either - those characters are used in document processing after all. The soft hyphen controls how the UI or the print engine lays out text during rendering to the screen or paper.
If someone doesn't want those characters to be printed, a simple string.Remove will do the job. Removing the hyphen can affect how text is printed though, with long words moving to the next line. I added that hyphen to Removing in the previous sentence to force hyphenation. Without it, Removing would have moved to the next line

C# Regex boundary with special characters

I want to have a Regex that finds "Attributable".
I tried #"\bAttributable\b" but the \b boundary doesn't work with special characters.
For example, it wouldn't differentiate Attributable and Non-Attributable. Is there any way to Regex for Attributable and not it's negative?

Do a negative look-behind?
(?<!-)\bAttributable\b
Obviously this only checks for -s. If you want to check for other characters, put them in a character class in the negative look-behind:
(?<![-^])\bAttributable\b
Alternatively, if you just want to not match Non-Attributable but do match SomethingElse-Attributable, then put Non- in the look-behind:
(?<!Non-)\bAttributable\b

There are several ways to fix the issue like you have but it all depends on the real requirements. It is sometimes necessary to precise what "word boundary" you need in each concrete case, since \b word boundary is 1) context dependent, and 2) matches specific places in the string that you should be aware of:
Before the first character in the string, if the first character is a
word character.
After the last character in the string, if the last
character is a word character.
Between two characters in the string,
where one is a word character and the other is not a word character.
Now, here are several approaches that you may follow:
When you only care about compound words usually joined with hyphens (similar #Sweeper's answer): (?<!-)\bAttributable\b(?!-)
Only match between whitespaces or start/end of string: (?<!\S)Attributable(?!\S). NOTE: Actually, if it is what you want, you may do without a regex by using s.Split().Contains("Attributable")
Only match if not preceded with punctuation and there is no letter/digit/underscore right after: (?<!\p{P})Attributable\b
Only match if not preceded with punctation symbols but some specific ones (say, you want to match the word after a comma and a colon): (?<![^\P{P},;])Attributable\b.

Word-Counter in some hieroglyphics languages?

Is there any available library for word-counting of some hieroglyphics language (ex: chinese, japanese, korean...)?
I found that MS Word count effectively texts in these languages. Can I add reference to MS Word libraries in my .NET application to implement this function?
Or is there any other solutions to achieve this purpose?

s there any available library for word-counting of some hieroglyphics language (ex: chinese, japanese, korean...)?
Hieroglyphics? No, they're not. They're logographic characters and it's not so subtle difference. I'm sure some native speaker may explain this much better than me.
Japanese and Chinese text is made of characters exactly as western languages but one character may be a word to. Moreover they don't need spaces to separate words so our distinction characters/words can't be made using blanks as delimiters.
What Word does is to count words (assuming they'll be equal to characters) and you can do the same in your code (just don't forget it's UNICODE so you can't count bytes) counting characters. To count real words you need a dictionary (because you can't rely on spaces).
For example these strings:
这是一个示例文本
これは、サンプルのテキストです
Will be counted as 8 characters and 8 words (in Chinese) and 15 characters and 15 words in Japanese. Actually it's not (for example in Japanese it's 5 words when transliterated in romaji). Moreover don't forget in Japanese they have more than one alphabet (and one family of them are phonetic).
What's the point? What you will count? Words transliterated to one of phonetic representations (with latin characters) we use to represent them? Which one? Word counting will be pretty different and it'll actually count our concept of words (that's why, I suppose, Word counts characters).
That said now try to write this code:
string text = "这是一个示例文本";
MessageBox.Show(text.Length.ToString());
It'll display 8, as Word does (we're counting characters), in bytes (supposing an UTF-8 encoding) is 24. No sense to count spaces here. If you plan to count words in one transliteration you need to use an external library (it's not an easy task to do it by yourself), a different one for each language you want to support (somehow it's easy to auto detect the language because in Japanese they use very often hiragana/katakana characters). Which one? There are a lot of them, I don't know for Chinese but in Japanese a popular one to transliterate Kanji is Kakasi.
Korean is a complete different story, it's an alphabet exactly as latin one but character (that should be called syllable) may be composed of many letters. Again they don't need spaces so you can't rely on them for word counting. It's somehow more complicated because here you may need a dictionary even for character counting (otherwise you'll just count syllables).

How to render a standalone Unicode character (Arabic) as it would look if it was being rendered within a word?

In written Arabic, characters look differently depending on where they stand in a word. For example, the letter ta might look like this: ـثـ inside a word but look like this: ﺙ if it stands by itself. I have some Arabic text, for example:
string word = والتفويض ;
When I render word as a whole word it renders correctly. Now, I want to parse the string and print out each letter in the word one at a time. However, if I do this:
foreach(char c in word.ToCharArray())
{
Debug.Print(c.ToString());
}
The char c doesn't print out the original representation of the letter as it was rendered in the context of a word, instead it prints out the same Arabic letter as if it were rendered by itself. How can I parse my string of Arabic text so that the letters returned look the same as when they were displayed as a whole word?
I'm trying to do this in c#.

There are characters in the UCS that represent particular forms of Arabic characters. However, these do not work well when moving from one context to another.
In general if you want to indicate that a letter is joined to another, when there is no such letter to join it to, you should use U+200D ZERO WIDTH JOINER at the appropriate place (before the character to place the joiner to the right, after the character to place it to the left, or having one on either side.
Conversely, placing U+200C ZERO WIDTH NON-JOINER between characters will break their joining.
Just how well that works in practice will depend on the rendering engine processing the characters.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.