Character sequence messes up the order of characters? - c#

I had to handle an exception, by catching it and matching the message and if the message contains a certain error code, do something (not relevant).
The exceptions message is this (in English, but the code and the gibberish after it is the same in any language):
$-5002 - $make sure that the consumed quantity of the component item would not cause the item's stock to fall below zero [ige1.whscode][line: 1] , 'production order no: 20580033 line: 1' [الرسالة 3559-7]
I had to work with the code 7-3559 (as displayed). In my code, I just did a e.Message.Contains("7-3559") and it failed to catch the exception. Wondering what went wrong I copy pasted the error massage to regex101.com and after a bit of trial and error I realized that e.Message.Contains("3559-7") is the real code and it works. I just don't know why. What messes up the string to display it in such a way that 7- is actually -7 and also behind 3559?
I guess I should also mention I am working with Visual Studio 2019 and C#.
Check out the regex here.
HxD:

This is a common issue encountered when using bidirectional text, in other words, a text that contains both texts directionality: Right-to-Left (RTL) such as Arabic texts, and Left-to-Right (LTR).
Here we have the Arabic text mixed with English text so some rules will be applied to the text to determine the directionality. You may find details about this here.
In short, the text you see in the debugger is how the text will appear when you print it but not how it is represented in memory.
Here I use Linqpad to paste the text and the editor has immediately transformed it into the representation in memory. And once printed, the text is shown with a different directionality.

Related

How to determine which of several different errors might have caused XmlException?

The system I'm working on uses DataSet.ReadXml(XmlReader) to read an XML file and load its contents to a DataSet. The XML file is from a business partner and may not always be well-formed, but this system is expected to perform reasonable corrections to the input.
We've seen errors in the XML input files, such as:
Case 1: In the middle of a string value, use of characters such as
'<', '>', or my favorite, '&', which causes "An error occurred while parsing
EntityName. Line x, position y."
Case 2: In the middle of a string value, weird constructs such as
"<3" so that the text depicts a heart, which causes "Name cannot
begin with the '3' character. Line x, position y."
Case 3: Invalid characters for the given encoding, which causes
"Invalid character in the given encoding. Line x, position y."
If some simple rules are adopted, these errors can be addressed programmatically:
Case 1: Replace the offending character with its XML character entity
("&" becomes "&", etc.
Case 2: Replace the "<" in "<3" with a space, so that it becomes " 3"
Case 3: Replace the invalid character with a space
However, all of these errors raise the same exception: System.Xml.XmlException
I would like to take an appropriate action when any of these errors are encountered, but what's the best way to do that? These three different errors all have the same HRESULT value (-2146232000), and so far the only way I have been able to differentiate amongst them is by inspection of the XmlException.Message string property.
String comparison seems a lousy way to determine the exact cause of the error. Were I to follow that approach, the code would break should the exception message change in future versions of .NET. It would also not be portable to some languages.
Therefore, how does one programmatically differentiate between the various types of errors that could be represented in an XmlException?
EDIT
In the comments below I've received advice about the importance of ensuring that XML data is of high quality. I don't disagree, but as my question states, it's outside my control and I can do nothing about it. So, as well-intentioned as your remarks are, they miss the point. If you know a good way to differentiate amongst the very many errors that can be presented by the System.Xml.XmlException class, please, share your knowledge. Thank you.
Rather than trying to parse non-XML with an XML parser and catching the errors, if you really want to process non-XML then I would try preprocessing it with a parser for the particular non-XML grammar that you want to accept. Before you ever submit the data to an XML parser, run it through a Perl script or similar that recognizes the patterns that you want to convert to XML, then run the result through an XML parser.

Why do I get an CS1056 Unexpected character '' on this code

I'm getting this unexpected character '' error and I don't understand why.
var list = new List<MyModel>();
list.Add(new MyModel() {
variable1 = 942,
variable2 = 2001,
variable3 = "my text",
variable4 = 123
​}); // CS1056 Unexpected character '' on this line
From what the error says and the actual error code I got from an Online compiler after copy/pasting, Your code on this line contains a character that is not visible but that the compiler is trying to interpret. Simply try erase every character starting at your closing bracket towards your number 3 and press Enter again It should be working (it did work for me)
I just deleted the file Version=v4.0.AssemblyAttributes.cs(1,1,1,1) located in my temp folder C:\Users\MyUser\AppData\Local\Temp and then it works perfectly.
For .NET Core you have to delete .NETCoreApp,Version=v2.1.AssemblyAttributes.cs
As mentioned by Daneau in the accepted answer, the problem is by a character that is not visible in the IDE.
Here are several solutions to find the invisible character with Notepad++.
Solutions 1: Show Symbol
Copy the code to Notepad++,
Select View -> Show Symbol -> Show All Characters
This can show invisible control characters.
Solutions 2: Convert to ANSI
Copy the code to Notepad++,
Select Encoding- > Convert to ANSI
This will convert the invisible character to ? if it is a none ANSI character.
Solutions 3: Remove none ASCII characters
Copy the code to Notepad++,
Open the Find window (Ctrl+F)
Select the Replace tab
in "Find what" write: [^\x00-\x7F]
Leave "Replace with" empty
In "Search Mode" select "Regular expression"
Find and remove the none ASCII characters
This will remove none ASCII characters.
Note: This can remove valid non ASCII characters (in strings and comments) so try to skip those if you have any.
Tip: Use HEX-Editor plugin
Use Notepad++ HEX-Editor plugin to see the binary code of text. Any character out of the range of 0x00 - 0x7F (0 - 127) is a non ASCII character and a suspect of being the problem.
Just reporting my direct experience.
As Daneau wrote, I had a character (ASCII DLE, I copied while messing up a zebra printer) hiding in the text. I could not afford to rewrite everything, so I used notepad++ "View->Show Symbol->Show All Characters" feature.
I apologize for not commenting Daneau entry, but I don't have enough reputation.
Write the code again without copying it. That worked for me
go to C:\Users\UserName\AppData\Local\Temp\ and clear the data or remove the file specified in the error, that will solve the issue.
VS will add the required file on auto, no worries.
I got this error when I moved my application from one folder to another, I resolved this by deleting the Debug folder inside the obj folder.
It indeed has to do with copy pasting code and characters that you cannot see. The easiest way to fix it is by passing your copy pasted code into a note application or simple text program which will automatically remove these invisible characters. After that simply copy the code from the text editor and paste it into your IDE.
For some reason this happened to me on every project in my solution. My fix was to delete all bin and obj folders in my solution.

Correct Hebrew character sequence in C# and searchable PDFs

I'm testing an SDK that extracts text from a searchable PDF. One of the SDK's dependencies was recently updated, and it's causing an existing test on Hebrew text to fail. I don't know Hebrew nor enough about how the involved technologies represent right-to-left languages.
The NUnit test asserts that the extracted text matches the C# string "מנבוצץז ".
string hebrewText = reader.ReadToEnd();
Assert.AreEqual("מנבוצץז ", hebrewText);
The rasterized PDF has what I believe are the same characters, but in the opposite order.
The unit test fails with this message:
Expected: "מנבוצץז "
But was: " זץצובנמ"
Although the actual result more closely matches what I see in the rasterized PDF, I'm not completely sure the original test is wrong.
Are Hebrew characters in a C# string supposed to be read right-to-left like printed Hebrew text?
Does any part of the .NET stack tamper with the direction of Hebrew strings?
What about NUnit?
Are Hebrew characters embedded in a searchable PDF normally supposed to go in the same direction as the rasterized text?
Anything else I should know before deciding whether to "fix" this unit test?
There are various ways to encode RTL languages. The most common way (and Window's default) is to use logical ordering, which means the first letter is encoded as the first character in a string (or file). So whether visually the first letter appears on the left or right side of the screen doesn't affect the order in which they are stored.
Now as for the text appearing in Visual Studio, it depends on the version. As far as I remember, prior to Visual Studio 2010 the code editor displayed Hebrew backwards, and it was apparent as when you tried to select Hebrew text, it reversed in an odd way (which was visually confusing). It appears this issue no longer exists is Visual Studio 2010 (at least with SP1 which I just tested).
Let's take a Hebrew word for which the direction is more clear to non-Hebrew speakers than the string specified in your text:
יון
The word happens to be the Hebrew word for an ion, and on your screen, it should appear as three letters where the tallest letter is on the left and the shortest is on the right. In a .NET string, the expression "יון".Substring(0, 1) will produce the short letter, since it's the first letter in the string. The string can also be written as "\u05D9\u05D5\u05DF" where the leftmost Unicode character \u05D9 represents the short letter displayed on the right, which clearly demonstrates the order in which the letters are stored.
Since the string in your test case is nonsensical, I can't tell you whether it was a wrong test all along or if it a correct test that should pass. If the image you uploaded has been rendered correctly then it appears the actual result of your test is correct and the expected value is incorrect, and so you should fix the test.
I believe that all strings in C# will be stored internally as LTR; RTL strings will have a non-printable character (or something) denoting that they are indeed RTL.
More than likely. RTL GUIs and rendered text for example need certain properties (specifically RightToLeft and RightToLeftLayout) to be set in order to display correctly.
NUnit shouldn't. Nor should it care. IMHO a reversed string != the original string.
I couldn't comment. I'd assume that they should be whatever the test is expecting though, assuming it was passing at first.
Don't do half measures with RTL, it really doesn't like it. Either have full RTL support, or nothing. It can be pretty nasty, I wish you the best of luck!

C# Unknown Text Found

I'm creating a program to transfer text from a word document to a database. During some testing I came across some text inside a textbox after setting it's text to a table cell range as follows:
textBox1.Text = oDoc.Tables[1].Cell(1, 3).Range.Text;
What appeared in the form was:
What wasn't expected was the dot at the end of the text and I have no idea what it is supposed to represent. The dot can be highlighted but if you try and copy and paste it nothing appears. You can delete the dot manually. Can anyone help me identify what this is?
The identification bit shouldn't be too hard:
string text = oDoc.Tables[1].Cell(1, 3).Range.Text;
textBox1.Text = ((int) text[4]).ToString("x4");
That will give you the Unicode UTF-16 code unit for that character... you can then find out what it is on the Unicode web site. (I usually look at the Charts page or the directory of PDFs and guess which chart it will be in based on the numbering - it's not ideal, and there are probably better ways, but it's always worked well enough for me...)
Of course when you've identified it you'll still need to work out what the heck it's doing there... does the original Word document just have "HOLD"?

Is it possible to programmatically 'clean' emails?

Does anyone have any suggestions as to how I can clean the body of incoming emails? I want to strip out disclaimers, images and maybe any previous email text that may be also be present so that I am left with just the body text content. My guess is it isn't going to be possible in any reliable way, but has anyone tried it? Are there any libraries geared towards this sort of thing?
In email, there is couple of agreed markings that mean something you wish to strip. You can look for these lines using regular expressions. I doubt you can't really well "sanitize" your emails, but some things you can look for:
Line starting with "> " (greater than then whitespace) marks a quote
Line with "-- " (two hyphens then whitespace then linefeed) marks the beginning of a signature, see Signature block on Wikipedia
Multipart messages, boundaries start with --, beyond that you need to do some searching to separate the message body parts from unwanted parts (like base64 images)
As for an actual C# implementation, I leave that for you or other SOers.
A few obvious things to look at:
if the mail is anything but pure plain text, the message will be multi-part mime. Any part whose type is "image/*" (image/jpeg, etc), can probably be dropped. In all likelyhood any part whose type is not "text/*" can go.
A HTML message will probably have a part of type "multipart/alternative" (I think), and will have 2 parts, one "text/plain" and one "text/html". The two parts should be just about equivalent, so you can drop the HTML part. If the only part present is the HTML bit, you may have to do a HTML to plain text conversion.
The usual format for quoted text is to precede the text by a ">" character. You should be able to drop these lines, unless the line starts ">From", in which case the ">" has been inserted to prevent the mail reader from thinking that the "From " is the start of a new mail.
The signature should start with "-- \r\n", though there is a very good chance that the trailing space will be missing.
Version 3 of OSBF-Lua has a mail-parsing library that will handle the MIME and split a message into its MIME parts and so on. I currently have a mess of Lua scripts that do
stuff like ignore most non-text attachments, prefer plain text to HTML, and so on. (I also wrap long lines to 80 characters while trying to preserve quoting.)
As far as removing previously quoted mail, the suggestions above are all good (you must subscribe to some ill-mannered mailing lists).
Removing disclaimers reliably is probably going to be hard. My first cut would be simply to maintain a library of disclaimers that would be stripped off the end of each mail message; I would write a script to make it easy for me to add to the library. For something more sophisticated I would try some kind of machine learning.
I've been working on spam filtering since Feb 2007 and I've learned that anything to do with email is a mess. A good rule of thumb is that whatever you want to do is a lot harder than you think it is :-(
Given your question "Is it possible to programmatically ‘clean’ emails?", I'd answer "No, not reliably".
The danger you face isn't really a technological one, but a sociological one.
It's easy enough to spot, and filter out, some aspects of the messages - like images. Filtering out signatures and disclaimers is, likewise, possible to achieve (though more of a challenge).
The real problem is the cost of getting it wrong.
What happens if your filter happens to remove a critical piece of the message? Can you trace it back to find the missing piece, or is your filtering desctructive? Worse, would you even notice that the piece was missing?
There's a classic comedy sketch I saw years ago that illustrates the point. Two guys working together on a car. One is underneath doing the work, the other sitting nearby reading instructions from a service manual - it's clear that neither guy knows what he's doing, but they're doing their best.
Manual guy, reading aloud: "Undo the bold in the centre of the oil pan ..." [turns page]
Tool guy: "Ok, it's out."
Manual guy: "... under no circumstances."
If you creating your own application i'd look into Regex, to find text and replace it. To make the application a little nice, i'd create a class Called Email and in that class i have a property called RAW and a property called Stripped.
Just some hints, you'll gather the rest when you look into regex!
SigParser has an assembly you can use in .NET. It gives you the body back in both HTML and text forms with the rest of the stuff stripped out. If you give it an HTML email it will convert the email to text if you need that.
var parser = new SigParser.EmailParsing.EmailParser();
var result = await parser.GetCleanedBodyAsync(new SigParser.EmailParsing.Models.CleanedBodyInput {
FromEmailAddress = "john.smith#example.com",
FromName = "John Smith",
TextBody = #"Hi Mark,
This is my message.
Thanks
John Smith
888-333-4434"
});
// This would print "Hi Mark,\r\nThis is my message."
Console.WriteLine(result.CleanedBodyPlain);

Categories