Match.Value and international characters

Match.Value and international characters - c#

UPDATE May this post be helpful for coders using RichTextBoxes. The Match is correct for a normal string, I did not see this AND I did not see that "ä" transforms to "\e4r" in the richTextBox.Rtf! So the Match.Value is correct - human error.
A RegEx finds the correct text but Match.Value is wrong because it replaces the german "ä" with "\'e4"!
Let example_text = "Primär-ABC" and lets use the following code
String example_text = "<em>Primär-ABC</em>";
Regex em = new Regex(#"<em>[^<]*</em>" );
Match emMatch = em.Match(example_text); //Works!
Match emMatch = em.Match(richtextBox.RTF); //Fails!
while (emMatch.Success)
{
string matchValue = emMatch.Value;
Foo(matchValue) ...
}
then the emMatch.Value returns "Prim\'e4r-ABC" instead of "Primär-ABC".
The German ä transforms to \'e4!
Because I want to work with the exact string, i would need
emMatch.Value to be Primär-ABC - how do I achieve that?

In what context are you doing this?
string example_text = "<em>Ich bin ein Bärliner</em>";
Regex em = new Regex(#"<em>[^<]*</em>" );
Match emMatch = em.Match(example_text);
while (emMatch.Success)
{
Console.WriteLine(emMatch.Value);
emMatch = emMatch.NextMatch();
}
This outputs <em>Ich bin ein Bärliner</em> in my console
The problem probably isn't that you're getting the wrong value back, it's that you're getting a representation of the value that isn't displayed correctly. This can depend on a lot of things. Try writing the value to a text file using UTF8 encoding and see if it still is incorrect.
Edit: Right. The thing is that you are getting the text from a WinForms RichTextBox using the Rtf property. This will not return the text as is, but will return the RTF representation of the text. RTF is not plain text, it's a markup format to display rich text. If you open an RTF document in e.g. Notepad you will see that it has a lot of weird codes in it - including \'e4 for every 'ä' in your RTF document. If you would've used some markup (like bold text, color etc) in the RTF box, the .Rtf property would return that code as well, looking something like {\rtlch\fcs1 \af31507 \ltrch\fcs0 \cf6\insrsid15946317\charrsid15946317 test}
So use the .Text property instead. It will return the actual plain text.

Related

IWebElement.Text Property convert's Emoji's

OpenQA.Selenium.IWebElement.Text
If you call Text Property and the HTML inner Text is "❌", you will get ":x:".
If you take this Text "✅", the Return Value is ":white_check_mark:".
I looked up different Emoji and Unicode References, but I could find, which code/charset this is.
A Solution for me would be
If there is a method with the original Text as return value.
or if there is a library/packed, converting the code back to Unicode.
Code
string cItemText = cItem.FindElement(By.ClassName("im_message_photo_caption")).Text;
Update
Found out, that this code :white_check_mark:, is something called shortcode/shortcodes, but no clue how to convert back to unicode.

Did you tried below code to get the unicode ?
Add a reference to Microsoft.JScript.dll
string result = Microsoft.JScript.GlobalObject.escape(inputString);
https://social.msdn.microsoft.com/Forums/vstudio/en-US/9a09cb14-5eb3-4b74-9cf1-ac9e0ae641fc/convert-string-to-unicode?forum=csharpgeneral

uGUI Text Field, How to Remove "Replacement Characters" (uFFFD aka �)?

Using the uGUI Text component, I'm getting "replacement characters" aka � and I can't find a way to remove them.
I'm getting a string from the Instagram api which contains unicode characters for both non-alphabet language characters (for Japanese for example) which I need.
However, the unicode characters for the emojis come in as replacement characters aka �.
I don't require the emojis and they can be stripped out however, I can't find a method to do this.
I'm unable to use TextMeshPro as I'm unable generate a font asset with all the unicode characters need to display the various languages (this could be user error but when I try the process hangs).
I notice these � characters don't appear in the Inspector or console so there must be a way to ignore or remove them.
I'm setting the string like this
body.text = System.Uri.UnescapeDataString(postData.text);
I've tried a number of things that haven't worked including
body.text = body.text.Replace('\uFFFD','\'');//doesn't work
body.text = Regex.Replace(body.text, #"^[\ufffd]", string.Empty);//doesn't work
I've also tried breaking up the string as a char array. When I try to print to console I get this error when it hits a replacement character:
foreach (char item in postData.text.ToCharArray())
print(item); //Error: UTF-16 to UTF-8 conversion failed because the input string is invalid
Any help with this would be greatly appreciated!
Thank you.
Unity 2018.4.4, c#

Found the answer!
This post provided a solution: How do I remove emoji characters from a string?
body.text = Regex.Replace(body.text, #"\p{Cs}", "");

Can not read turkish characters from text file to string array

I am trying to do some kind of sentence processing in turkish, and I am using text file for database. But I can not read turkish characters from text file, because of that I can not process the data correctly.
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt");
textBox1.Text = Tempdatabase[5];
Output:

It's probably an encoding issue. Try using one of the Turkish code page identifiers.
var Tempdatabase =
File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.GetEncoding("iso-8859-9"));

You can fiddle around using Encoding as much as you like. This might eventually yield the expected result, but bear in mind that this may not work with other files.
Usually, C# processes strings and files using Unicode by default. So unless you really need something else, you should try this instead:
Open your text file in notepad (or any other program) and save it as an UTF-8 file. Then, you should get the expected results without any modifications in your code. This is because C# reads the file using the encoding you saved it with. This is default behavior, which should be preferred.
When you save your text file as UTF-8, then C# will interpret it as such.
This also applies to .html files inside Visual Studio, if you notice that they are displayed incorrectly (parsed with ASCII)

The file contains the text in a specific Turkish character set, not Unicode. If you don't specify any other behaviour, .net will assume Unicode text when reading text from a text file. You have two possible solutions:
Either change the text file to use Unicode (for example utf8) using an external text editor.
Or specify a specific character set to read for example:
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.Default);
This will use the local character set of the Windows system.
string[] Tempdatabase = File.ReadAllLines(#"C:\Users\dialogs.txt", Encoding.GetEncoding("Windows-1254");
This will use the Turkish character set defined by Microsoft.

RichTextBox inserting text

For Windows Forms.
I am trying to insert text into the .rtf field of a RichTextBox.
I have tried two methods. When I use .Rtf.insert, nothing happens at all.
When I edit the .rtf string based on the selected text positions myself, I either end up adding gibberish to the thing or getting an error that says that the file format is invalid. My best guess is that this is because the .rtf string is in .rtf format and the selection index that I am using is based on the plain text string and so I am inserting the text in the wrong location in the .rtf string and messing up the RTF code.
But knowing what the problem is (if I am correct) hasn't helped me solve it.
Is there a way to get .rtf.insert to work correctly, or is there a way to translate the selected text indexes to the actual .rtf text positions so that something like the code below would work? I am assuming that the RichTextBox itself must know how to translate the one index into another because it can insert characters when the user types just fine.
Here is my code snippet. The point of the code is to insert a marker into the text that will later be parsed and replaced with a student's first name. There will be other such codes. "codeLeader" and "codeEnder" are just the strings I use to surround the codes with. In this case I am using "[*" and *]" to indicate that there is a code I will need to parse, but I put them into separate strings so that I can easily change it if I wish. I have actually already written the parsing code, which works just fine on rich text. It is just inserting the text into the richTextBox itself that is the problem. In other words, if I were to type the codes by hand it would work just fine. But this would be troublesome for the user because some of the codes will use index numbers.
private void studentFirstNameCode_Click(object sender, EventArgs e)
{
string ins = f1ref.codeLeader;
ins += "SNFirst" + f1ref.codeEnder;
int start = editorField.richTextBox1.SelectionStart;
if (start == -1) { start = 0; }
int end = start + editorField.richTextBox1.SelectionLength;
if (end == -1) { end = 0; }
string pre = editorField.richTextBox1.Rtf.Substring(0, start);
string post = editorField.richTextBox1.Rtf.Substring(end);
string newstring = pre + ins + post;
editorField.richTextBox1.Rtf = newstring;
// this also doesn't work. gives no result at all.
// editorField.richTextBox1.Rtf.Insert(start, newstring);
}

I don't think that you need to use the RTF property to simple insert a text inside the RichTextBox actual text. In particular because you don't seem to add an RTF formatted text.
If you don't want to use RTF then the simplest way to accomplish your goal is just one line of code
editorField.SelectedText = yourParameterText;
This will work as you have pasted the text from the clipboard in the selected position (eventually replacing text if something is selected) and the base work of correctly formatting your text inside the RTF is done by the control itself

I have found a work-around by using .SendKeys. This makes the text appear a bit slowly (as if typed very quickly) so isn't optimal, but it does work.
It is enough for a workable solution, but I am still troubled by the problem. It seems like this issue should have a more elegant solution than this.

Find and replace question regarding RegEx.Replace

I have a text file and I want to be able to change all instances of:
T1M6 to N1T1M6
The T will always be a different value depending on the text file loaded. So example it could sometimes be
T2M6 and that would need to be turned into N2T2M6. The N(value) must match the T(value). The M6 will always be M6.
Another example:
T9M6 would translate to N9T9M6
Here is my code to do the loading of the text file:
StreamReader reader = new StreamReader(fDialog.FileName.ToString());
string content = reader.ReadToEnd();
reader.Close();
Here is RegEx.Replace statement that I came up with. Not sure if it is right.
content = Regex.Replace(content, #"(T([-\d.]))M6", "N1$1M6");
It seems to work at searching for T5M6 and turning it into N1T5M6.
But I am unsure how to turn the N(value) into the value that T is. For example N5T5M6.
Can someone please show me how to do modify my code to handle this?
Thanks.

Like this:
string content = File.ReadAllText(fDialog.FileName.ToString());
content = Regex.Replace(content, #"T([-\d.])M6", "N$1T$1M6");
Also, you should probably replace [-\d.] with \d or -?\d\.?

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Match.Value and international characters - c#

Related

IWebElement.Text Property convert's Emoji's

uGUI Text Field, How to Remove "Replacement Characters" (uFFFD aka �)?

Can not read turkish characters from text file to string array

RichTextBox inserting text

Find and replace question regarding RegEx.Replace

Categories

Resources