Numbered Musical Nodes into RichTextBox - c#

Jianpu nodes are something like this:
So I want to make an application where user can specify the nodes and the output is the sound of the nodes
My problem is that I don't know how to display the nodes like the above in a RichTextBox.

There some fonts out there but you will have to test their quality.
Here is one, that is for Jianpu notes, more details here, but may not work without problems..
Here is a solution for Erhu Players & Jianpu Readers
And creating a set of notes with a free font maker is also an option.
And finally you might do it all in .Net, including all the painting, but try the fonts first!

There is Unicode 0307 (combining dot above) (looks like 1̇ ) or Unicode 0358 (combining dot above right) (looks like 1͘ ) but they don't perform very well for your task in my opinion. I think 0301 (combining acute accent) (looks like 1́ ) is better, although not very accurate.
For the bottom part 0316 (combining grave accent below) (looks like 1̖ ) is not very nice. You can try 0323 (combining dot below) (looks like 1̣).
You add the unicode characters after the normal letter and you can combine many of them (like 1̣́). Note that the results may vary among different types of fonts. The fonts I experience to support Unicode best are Arial and Times New Roman. I usually take Word, go to insert/symbol and try what looks best.
For the best results I recommend looking for a specialized font that has all the tones built in. Or create such a font by yourself. CorelDraw was able (in Version 6) to create fonts. I guess it still can in newer versions.

Related

C#'s StringInfo and TextElementEnumerator can't recognize graphemes properly

In C# StringInfo and TextElementEnumerator classes provide methods and properties for text elements.
And here, we can find the definition of the Text Element.
The .NET Framework defines a text element as a unit of text that is
displayed as a single character, that is, a grapheme. A text element
can be any of the following:
Yes, it says a text element is a grapheme in .NET. I also tested with some unicode characters myself, and it really seemed true until I tested one Korean letter '가'.
As we all know some Unicode characters consist of multiple code points. Also we may face code point sequences and that's the reason I'm using StringInfo and TextElementEnumerator instead of simple String.
StringInfo and TextElementEnumerator could tell if Chars were surrogate pairs correctly. And "\u0061\u0308", a Unicode character which consists of multiple code points, was recognized as one text element just as expected. But as for "\u1100\u1161", it failed to say that it was also one text element.
"\u1100" is a leading letter "ㄱ", and "\u1161" is a vowel letter "ㅏ". They can be individual characters and shown to the users just as I write here and you can see them now. But if they are used together, they are rendered as one character "가" instead of "ㄱㅏ".
There are two ways in order to represent a Korean character "가":
Using a single code point U+AC00 from Hangul Syllable.
Using two code points U+1100 and U+1161 from Jamo.
Most of the time the former is used. The latter is rarely used, to be honest, I can't imagine when it's used at all..
Anyway, the first one is just one precomposed letter and the second is a sequence of Lead and Vowel which is treated as one character. When rendered they look the exactly same and both are actually canonically equivalent.
Also the following line returns true in C# :
"\u1100\u1161".Normalize() == "\uAC00"
I wonder why Normalize() here works just fine when C# doesn't think they are one complete text element..
I thought it had something to do with my .NET's version, but it turns out it's not the case. This thing happens even in Mono too.
I tested this with ICU as well, and it could treat "\u1100\u1161" as one grapheme correctly!
I initially thought StringInfo and TextElementEnumerator could eliminate need for ICU4C in some simple cases, so I'm very disappointed now..
Here's my question :
Am I doing something wrong here?
or
A Text Element in .NET isn't a user-perceived character unlike in ICU?
The basic issue here is that per the Korean standard KS X 1026, the two jamos ㄱ and ㅏ are distinct from their combined form 가. In fact, this exact example is used in the official standard (see section 6.2).
Long story short, Microsoft attempted to follow the standard but other operating systems and applications don't necessarily do so. Hence you can get "malformed" content from other software / platforms that appears to be parsed incorrectly on Windows / in .NET, even though it is parsed "correctly" on those platforms.
You will either need to ensure your data is correctly formed in the first place (unlikely, given that the de-facto standard is to completely ignore the official standard) or you will need to use ICU (or a similar library) to deal with these cases.

Font conversion Kurtidev to mangal

I have a microsoft access file in which data is stored in kurtidev font. I have to convert it in to Mangal font. Is there any api available (Free) to do so? If not please suggest the ways to do it programatically.
If you are talking about Kruti Dev, these fonts use Latin/English Unicodes (U+0000 range) to encode Devanagari glyph shapes. Mangal, on the other hand, correctly uses Unicodes from the Devanagari range (U+0900). In other words, Devanagari 'Ka' (क), U+0915 in Mangal, is instead assigned to U+0064 in Kruti Dev. U+0064 is supposed to be Latin 'd'.
To get your database contents to display correctly with Mangal, you'll need to perform a transformation on the text from "Devanagari-as-Latin" Unicode into Devanagari Unicodes. That will require some sort of translation table.
A search for 'kruti dev to unicode converter' turns up a number of tools, some free, some for cost. There's even an online version. I was unable to find any source code for these tools which will likely contain the translation table that you need but you may be able to find something. Or you might be able to derive your own table by running a full complement of Kruti-encoded text through a converter and examining the output.

regex that can handle horribly misspelled words

Is there a way to create a regex will insure that five out of eight characters are present in order in a given character range (like 20 chars for example)?
I am dealing with horrible OCR/scanning, and I can stand the false positives.
Is there a way to do this?
Update: I want to match for example "mshpeln" as misspelling. I do not want to do OCR. The OCR job has been done, but is has been done poorly (i.e. it originally said misspelling, but the OCR'd copy reads "mshpeln"). I do not know what the text that I will have to match against will be (i.e. I do not know that it is "mshpeln" it could be "mispel" or any number of other combinations).
I am not trying to use this as a spell checker, but merely find the end of a capture group. As an aside, I am currently having trouble getting the all.css file, so commenting is impossible temporarily.
I think you need not regex, but database with all valid words and creative usage of functions like soundex() and/or levenshtein().
You can do this: create table with all valid words (dictionary), populate it with columns like word and snd (computed as soundex(word)), create indexes for both word and snd columns.
For example, for word mispeling you would fill snd as M214. If you use SQLite, it has soundex() implemented by default.
Now, when you get new bad word, compute soundex() for it and look it up in your indexed table. For example, for word mshpeln it would be soundex('mshpeln') = M214. There you go, this way you can get back correct word.
But this would not look anything like regex - sorry.
To be honest, I think that a project like this would be better for an actual human to do, not a computer. If the project is to large for 1 or 2 people to do easily, you might want to look into something like Amazon's Mechanical Turk where you can outsource to work for pennies per solution.
This can't be done with a regex, but it can be done with a custom algorithm.
For example, to find words that are like 'misspelling' in your body of text:
1) Preprocess. Create a Set (in the mathematical sense, collection of guaranteed to be unique elements) with all of the unique letters that are in misspelling - {e, i, g, l, m, n, p, s}
2) Split the body of text into words.
3) For each word, create a Set with all of its unique letters. Then, perform the operation of set intersection on this set and the set of the word you are matching against - this will get you letters that are contained by both sets. If this set has 5 or more characters left in it, you have a possible match here.
If the OCR can add in erroneous spaces, then consider two words at a time instead of single words. And etc based on what your requirements are.
I have no solution for this problem, in fact, here's exactly the opposite.
Correcting OCR errors is not programmaticaly possible for two reasons:
You cannot quantify the error that was made by the OCR algorithm as it can goes between 0 and 100%
To apply a correction, you need to know what the maximum error could be in order to set an acceptable level.
Let nello world be the first guess of "hello world", which is quite similar. Then, with another font that is written in "painful" yellow or something, a second guess is noiio verio for the same expression. How should a computer know that this word would have been similar if it was better recognized?
Otherwise, given a predetermined error, mvp's solution seems to be the best in my opinion.
UPDATE:
After digging a little, I found a reference that may be relevant: String similarity measures

Why does this happen with ToolStripMenuItems?

When adding a ToolStripMenuItem to a form and setting RightToLeft to true and having a quote at the end of the text does it place the quote at the front of the Text?
ToolStripMenuItem1.Text = "Name \"Text\"";
ToolStripMenuItem1.RightToLeft = System.Windows.Forms.RightToLeft.Yes;
Displays as; "Name "Text
Edit: This also happens with single quotes.
By setting RightToLeft to Yes, you are asking the Windows text rendering engine to apply the text layout rules used in languages that use a right-to-left order. Arabic and Hebrew. Those rules are pretty subtle, especially because English phrases in those languages are not uncommon. It is not going to render "txeT emaN" as it normally does with Arabic or Hebrew glyphs, that doesn't make sense to anybody. It needs to identify sentences or phrases and reverse those. Quotes are special, they delineate a phrase.
Long story short, you are ab-using a feature to get alignment that was really meant to do something far more involved. Don't use it for that.
EDIT:
IF you just want to change Text Alignment then there is always ToolStripItem.Alignment or ToolStripItem.Padding to try...
The feature you are using is meant for localization and can support mixed content...
The way you use that feature seems more like an abuse... since Windows needs to make sense of "mixed content" of which you provide an extreme (no arabic at all).
I find it always hard to be sure that such behaviour though unexpected and unintuitive is really a bug...
Rendering logic for BiDi text is rather complex - for example when you have an exclamation mark somewhere in text which should be rendered RightToLeft it can lead to reversing the direction depending on the implementation...
For some insight see http://www.unicode.org/reports/tr9/
As it seems .NET is not fully compliant with the Unicode BiDi algorithm... there is even library that tries to implement it see http://sourceforge.net/projects/nbidi/

Best way to extracting only the bold text from a PDF

iTextSharp is a great tool, I can use
PdfTextExtractor.GetTextFromPage(reader, iPage) + " ";
and it works great, but is there a way to extract only the bold text (e.g. the headlines) from the pdf, and not everything?
Any solution is useful, regardless of the programing language. Thank you
From within iText, You need to use the classes from the com.itextpdf.text.pdf.parser package.
Specifically, you'll need to use a PdfTextExtractor with a custom TextExtractionStrategy that checks the font name. Bold fonts USUALLY have the world "bold" in their name.
Potential Issues:
1) Not everything that looks like text is rendered with fonts and letters. It can be paths or a bitmap. The only way to extract such text is with OCR, and there's no way to get font info.
2) Font Encoding. The bytes that map to the glyphs you're seeing in the PDF may not have a map from those bytes to actual character information.
3) Not all bold-looking text is made with a bold font. Some bold text is made by stroking the text outline with a fairly thin line as well as the usual filling. In this case, the text render mode will be set to "stroke & fill" instead of the usual "fill". This is pretty rare, but it does happen from time to time.
An easy way to test for problems 1 and 2 is to attempt to copy and paste the text within Reader/Acrobat. If you can't select it, it's almost certainly paths or an image. If you can select it but the characters come out as random junk when pasted, then iText will come up with the same junk.
Problem 3 isn't that hard to test for programattically, though you have to handle it on a case by case basis. You need to call TextRenderInfo.getTextRenderMode(). 0 is fill (the standard way of doing things), and 2 is "stroke and fill".
So your TextExtractionStrategy can stub out beginTextBlock, endTextBlock, renderImage, and getResultantText. In your renderText implementation, you'll have to check the font name (for "bold", case insensitive) and the text render mode. If either of those is the case, it's part of on of your headings.
All this is supposing that you are dealing with arbitrary PDF files. If all your PDFs come from the same source, you can start cutting corners. I'll leave that as an Exercise For The Reader.
One of your best bets for this job surely is TET by pdflib.com with its ability to extract to the TETML format. Available for Windows, Mac OS X, Linux, Solaris, AIX, HP-UX...
I'm not sure if it does indeed recognize "headlines" as such (because PDF does not know much of structural markups, only visual ones) -- but it surely can tell you exact position and font used by each string of characters.

Categories