Word-Counter in some hieroglyphics languages? - c#

Is there any available library for word-counting of some hieroglyphics language (ex: chinese, japanese, korean...)?
I found that MS Word count effectively texts in these languages. Can I add reference to MS Word libraries in my .NET application to implement this function?
Or is there any other solutions to achieve this purpose?

s there any available library for word-counting of some hieroglyphics language (ex: chinese, japanese, korean...)?
Hieroglyphics? No, they're not. They're logographic characters and it's not so subtle difference. I'm sure some native speaker may explain this much better than me.
Japanese and Chinese text is made of characters exactly as western languages but one character may be a word to. Moreover they don't need spaces to separate words so our distinction characters/words can't be made using blanks as delimiters.
What Word does is to count words (assuming they'll be equal to characters) and you can do the same in your code (just don't forget it's UNICODE so you can't count bytes) counting characters. To count real words you need a dictionary (because you can't rely on spaces).
For example these strings:
这是一个示例文本
これは、サンプルのテキストです
Will be counted as 8 characters and 8 words (in Chinese) and 15 characters and 15 words in Japanese. Actually it's not (for example in Japanese it's 5 words when transliterated in romaji). Moreover don't forget in Japanese they have more than one alphabet (and one family of them are phonetic).
What's the point? What you will count? Words transliterated to one of phonetic representations (with latin characters) we use to represent them? Which one? Word counting will be pretty different and it'll actually count our concept of words (that's why, I suppose, Word counts characters).
That said now try to write this code:
string text = "这是一个示例文本";
MessageBox.Show(text.Length.ToString());
It'll display 8, as Word does (we're counting characters), in bytes (supposing an UTF-8 encoding) is 24. No sense to count spaces here. If you plan to count words in one transliteration you need to use an external library (it's not an easy task to do it by yourself), a different one for each language you want to support (somehow it's easy to auto detect the language because in Japanese they use very often hiragana/katakana characters). Which one? There are a lot of them, I don't know for Chinese but in Japanese a popular one to transliterate Kanji is Kakasi.
Korean is a complete different story, it's an alphabet exactly as latin one but character (that should be called syllable) may be composed of many letters. Again they don't need spaces so you can't rely on them for word counting. It's somehow more complicated because here you may need a dictionary even for character counting (otherwise you'll just count syllables).

Related

Combining (some) Unicode nonspacing marks with associated letters for uniform processing

I am writing a text-processing Windows app in C#. The app processes many plain text files to count characters, words, etc. To do this, the app iterates over the characters in each file. I am finding that some text files represent accented letters such as á by using the Unicode character U+00E1 (small letter A with acute) while other use a simple unaccented a (U+0061, small letter A) followed by a U+0301 (combining acute accent). There's no visual difference in how the text is rendered on screen by Notepad or other editors I've used, but the underlying character stream is obviously different.
I would like to detect and treat these two situations in the same way. In other words, I would like my app to combine a letter followed by a combining codepoint into the equivalent self-contained character. For example, I'd like to combine the sequence U+0061 U+0301 into U+00E1. As far as I know, there is no simple algorithm to do this, other than a large and error-prone lookup table for all the possible combinations of plain letters and combining characters.
I there a simpler and more direct algorithm to perform this combination?
You're referring to Unicode normalization forms. That page goes into some interesting detail, but the gist is that representing e.g. accented letters as a single codepoint (e.g. á as U+00E1) is Normalization Form C, or NFC, and as separate codepoints (e.g. á as U+0061 U+0301) is NFD.
Section 3.11 of the Unicode specification goes into the gory details of how to implement it, with some extra details here.
Luckily, you don't need to implement this yourself: string.Normalize() already exists.
"\u00E1".Normalize(NormalizationForm.FormD); // \u0061\u0301
"\u0061\u0301".Normalize(NormalizationForm.FormC); // \u00E1
That said, we've only just scratched the surface of what a "character" is. A good illustration of this uses emoji, but it applies to various scripts as well: there are modern scripts where normal characters are comprise of two codepoints, and there is no single combined codepoint available. This pops up in e.g. Tamil and Thai, as well as some eastern European langauges (IIRC).
My favourite example is 👩🏽‍🚒, or "Woman Firefighter: Medium Skin Tone". Want to guess how that's encoded? That's right, 4 different code points: U+1F469 U+1F3FD U+200D U+1F692.
U+1F469 is 👩, the Woman emoji.
U+1F3FD is "Emoji Modifier Fitzpatrick Type-4", which modifies the previous emoji to give a brown skin tone 👩🏽, rendered as 🏽 when it appears on its own.
U+200D is a "zero-width joiner", which is used to glue codepoints together into the same character
U+1F692 is 🚒, the Fire Engine emoji.
So you take a woman, add a brown skin tone, glue her to a fire engine, and you get a woman brown-skinned firefighter.
(Just for fun, try pasting 👩🏽‍🚒 into various editors and then using backspace on it. If it's rendered properly, some editors turn it into 👩🏽 and then 👩 and then delete it, while others skip various parts. However, you select it as a single character. This mirrors how editing complex characters works in some scripts).
(Another fun nugget are the flag emoji. Unicode defines "Regional Indicator Symbol Letters A-Z" (U+1F1E6 through U+1F1FF), and a flag is encoded as the country's ISO 3166-1 alpha-2 letter country code using these indicator symbols. So 🇺🇸 is 🇺 followed by 🇸. Paste the 🇸 after the 🇺 and a flag appears!)
Of course, if you're iterating over this codepoint-by-codepoint you're going to visit U+1F469 U+1F3FD U+200D U+1F692 individually, which probably isn't what you want.
If you're iterating this char-by-char you're going to do even worse due to surrogate pairs: those codepoints such as U+1F469 are simply too large to represent using a single 16-bit char, so we need to use two of them. This means that if you try to iterate over U+1F469, you'll actually find you've got two chars: 0xD83D (the high surrogate) and 0xDC69 (the low surrogate).
Instead, we need to introduce extended grapheme clusters, which represent what you'd traditionally think of as a single character. Again there's a bunch of complexity if you want to do this yourself, and again someone's helpfully done it for you: StringInfo.GetTextElementEnumerator. Note that this was a bit buggy pre-.NET 5, and didn't properly handle all EGCs.
In .NET 5, however:
// Number of chars, as 3 of the codepoints need to use surrogate pairs when
// encoded with UTF-16
"👩🏽‍🚒".Length; // 7
// Number of Unicode codepoints
"👩🏽‍🚒".EnumerateRunes().Count(); // 4
// Number of extended grapheme clusters
GetTextElements("👩🏽‍🚒").Count(); // 1
public static IEnumerable<string> GetTextElements(string s)
{
TextElementEnumerator charEnum = StringInfo.GetTextElementEnumerator(s);
while (charEnum.MoveNext())
{
yield return charEnum.GetTextElement();
}
}
I've used emoji as an understandable example here, but these issues also crop up in modern scripts, and people working with text need to be aware of them.

Split Urdu words based on nonexistent space

I have a Urdu word "لاعلم" and more similar words. How can I split the word that I get "لا" and "علم" separately in an array? I have tried converting the words to unicode characters, but I can,t detect the break between "لا" and "علم".
English words can be easily separated based on spaces, but I am stuck on separating Urdu words, where there are no spaces.
There is no space because its a single word meaning "ignorant." As a matter of fact, "لا" and "علم" separated wouldn't mean anything.
Space is inserted in Urdu (and Arabic script) for a practical need to demarcate words when the font would automatically ligature it with adjoining characters. The only way one can undo the ligature is by inserting a superfluous space between characters. Technically, the ZERO WIDTH NON-JOINER (U+200C) is precisely for this purpose but human beings are slow to learn and space is easy to insert.
There are some characters that don't join with following letters, for example, "ا" wouldn't join with any following character but can with a preceding character like "ل" to form the ligature "لا." You can use this list of characters (same rules for Arabic) and write a custom toneizer that ends a word after "Right Joining" characters, ZWNJ or a space.

Counting special UTF-8 character

I'm finding a way to count special character that form by more than one character but found no solution online!
For e.g. I want to count the string "வாழைப்பழம". It actually consist of 6 tamil character but its 9 character in this case when we use the normal way to find the length. I am wondering is tamil the only kind of encoding that will cause this problem and if there is a solution to this. I'm currently trying to find a solution in C#.
Thank you in advance =)
Use StringInfo.LengthInTextElements:
var text = "வாழைப்பழம";
Console.WriteLine(text.Length); // 9
Console.WriteLine(new StringInfo(text).LengthInTextElements); // 6
The explanation for this behaviour can be found in the documentation of String.Length:
The Length property returns the number of Char objects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than one Char. Use the System.Globalization.StringInfo class to work with each Unicode character instead of each Char.
A minor nitpick: strings in .NET use UTF-16, not UTF-8
When you're talking about the length of a string, there are several different things you could mean:
Length in bytes.  This is the old C way of looking at things, usually.
Length in Unicode code points.  This gets you closer to the modern times and should be the way how string lengths are treated, except it isn't.
Length in UTF-8/UTF-16 code units.  This is the most common interpretation, deriving from 1. Certain characters take more than one code unit in those encodings which complicates things if you don't expect it.
Count of visible “characters” (graphemes). This is usually what people mean when they say characters or length of a string.
In your case your confusion stems from the difference between 4. and 3. 3. is what C# uses, 4. is what you expect. Complex scripts such as Tamil use ligatures and diacritics. Ligatures are contractions of two or more adjacent characters into a single glyph – in your case ழை is a ligature of ழ and ை – the latter of which changes the appearance of the former; வா is also such a ligature. Diacritics are ornaments around a letter, e.g. the accent in à or the dot above ப்.
The two cases I mentioned both result in a single grapheme (what you perceive as a single character), yet they both need two actual characters each. So you end up with three code points more in the string.
One thing to note: For your case the distinction between 2. and 3. is irrelevant, but generally you should keep it in mind.

Algorithm to find keywords and keyphrases in a string

I need advice or directions on how to write an algorithm which will find keywords or keyphrases in a string.
The string contains:
Technical information written in English (GB)
Words are mostly separated by spaces
A keyword does not contain a space but it may contain a hyphen, apostrophe, colon etc.
A keyphrase may contain a space, a comma or other punctuation
If two or more keywords appear together then it is likely a keyphrase e.g. "inverter drive"
The text also contains HTML but this can be removed beforehand if necessary
Non-keywords would be words like "and", "the", "we", "see", "look" etc.
Keywords are case-insensitive e.g. "Inverter" and "inverter" are the same keyword
The algorithm has the following requirements:
Operate in a batch-processing scenario e.g. run once or twice a day
Process strings varying in length from roughly 200 to 7000 characters
Process 1000 strings in less than 1 hour
Will execute on a server with moderately good power
Written in one of the following: C#, VB.NET, or T-SQL maybe even F#, Python or Lua etc.
Does not rely on a list of predefined keywords or keyphrases
But can rely on a list of keyword exclusions e.g. "and", "the", "go" etc.
Ideally transferable to other languages e.g. doesn't rely on language-specific features e.g. metaprogramming
Output a list of keyphrases (descending order of frequency) followed by a list of keywords (descending order of frequency)
It would be extra cool if it could process up to 8000 characters in a matter of seconds, so that it could be run in real-time, but I'm already asking enough!
Just looking for advice and directions:
Should this be regarded as two separate algorithms?
Are there any established algorithms which I could follow?
Are my requirements feasible?
Many thanks.
P.S. The strings will be retrieved from a SQL Server 2008 R2 database, so ideally the language would have support for this, if not then it must be able to read/write to STDOUT, a pipe, a stream or a file etc.
The logic involved makes it complicated to be programmed in T-SQL. Choose a language like C#. First try to make a simple desktop application. Later, if you find that loading all the records to this application is too slow, you could write a C# stored procedure that is executed on the SQL-Server. Depending on the security policy of the SQL-Server, it will need to have a strong key.
To the algorithm now. A list of excluded words is commonly called a stop word list. If you do some googling for this search term, you might find stop word lists you can start with. Add these stop words to a HashSet<T> (I'll be using C# here)
// Assuming that each line contains one stop word.
HashSet<string> stopWords =
new HashSet<string>(File.ReadLines("C:\stopwords.txt"), StringComparer.OrdinalIgnoreCase);
Later you can look if a keyword candidate is in the stop word list with
If (!stopWords.Contains(candidate)) {
// We have a keyword
}
HashSets are fast. They have an access time of O(1), meaning that the time required to do a lookup does not depend on the number items it contains.
Looking for the keywords can easily be done with Regex.
string text = ...; // Load text from DB
MatchCollection matches = Regex.Matches(text, "[a-z]([:']?[a-z])*",
RegexOptions.IgnoreCase);
foreach (Match match in matches) {
if (!stopWords.Contains(match.Value)) {
ProcessKeyword(match.Value); // Do whatever you need to do here
}
}
If you find that a-z is too restrictive for letters and need accented letters you can change the regex expression to #"\p{L}([:']?\p{L})*". The character class \p{L} contains all letters and letter modifiers.
The phrases are more complicated. You could try to split the text into phrases first and then apply the keyword search on these phrases instead of searching the keywords in the whole text. This would give you the number of keywords in a phrase at the same time.
Splitting the text into phrases involves searching for sentences ending with "." or "?" or "!" or ":". You should exclude dots and colons that appear within a word.
string[] phrases = Regex.Split(text, #"[\.\?!:](\s|$)");
This searches punctuations followed either by a whitespace or an end of line. But I must agree that this is not perfect. It might erroneously detect abbreviations as sentence end. You will have to make experiments in order to refine the splitting mechanism.

Regular expression to catch letters beyond a-z

A normal regexp to allow letters only would be "[a-zA-Z]" but I'm from, Sweden so I would have to change that into "[a-zåäöA-ZÅÄÖ]". But suppose I don't know what letters are used in the alphabet.
Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?
You can use \pL to match any 'letter', which will support all letters in all languages. You can narrow it down to specific languages using 'named blocks'. More information can be found on the Character Classes documentation on MSDN.
My recommendation would be to put the regular expression (or at least the "letter" part) into a localised resource, which you can then pull out based on the current locale and form into the larger pattern.
What about \p{name} ?
Matches any character in the named character class specified by {name}.
Supported names are Unicode groups and block ranges. For example, Ll, Nd, Z,
IsGreek, IsBoxDrawing.
I don't know enough about unicode, but maybe your characters fit a unicode class?
See character categories selection with \p and \w unicode semantics.
All chars are "valid," so I think you're really asking for chars that are "generally considered to be letters" in a locale.
The Unicode specification has some guidelines, but in general the answer is "no," you would need to list the characters you decide are "letters."
Is there a way to automatically know what chars are are valid in a given locale/language or should I just make a blacklist of chars that I (think I) know I don't want?
This is not, in general, possible.
After all Engligh text does include some accented characters (e.g. in "fête" and "naïve" -- which in UK-English to be strictly correct still use accents). In some languages some of the standard letters are rarely used (e.g. y-diaeresis in French).
Then consider including foreign words are included (this will often be the case where technical terms are used). Quotations would be another source.
If your requirements are sufficiently narrowly defined you may be able to create a definition, but this requires linguistic experience in that language.
This regex allows only valid symbols through:
[a-zA-ZÀ-ÿ ]

Categories