Why are some unintended symbols added to my string? - c#

I wrote a console application which fetches strings from some fields in a Sharepoint list. Then I simply write the strings to console. This works fine for the most fields. There is one MultiLineTextField with RichText enabled where i had to remove all the html-tags, that causes this issue.
Even after all the tags are removed the strings seem to contain question marks which were never added to the string. The most weird thing about this is when I set a breakpoint and look into the string's value there are no question marks, but they suddenly appear on the console output.
The only thing I could think of was to Trim the string. Because sometimes they appear in front of the actual string sometimes they are at the and of it, but never in between.
So this is what I tried:
myString = myString.Trim();
myString = myString.Replace("?",string.Empty);
But this does not solve the issue. Besides this would not be a smart solution in case one of the strings would be supposed to contain question marks. For detailed code please see the link above.
Also Convert.ToBase64String(Encoding.UTF8.GetBytes(myString)) gives me the following output:
4oCLTWVobCwgRWllciwgV2Fzc2VyLCBIYWNrZmxlaXNjaCA=

There are probably some non-printing unicode (or possibly low ASCII) characters in the end of the string. The console has a different encoding, and will often render such as ?. Basically: use the indexer (yourString[n]) or yourString.ToCharArray() to investigate what is actually in the string aroung the location of the ?.
With the edit, we can see that the string has a zero-width space (decimal 8203) at the start:

Sounds like you're maybe having a problem with unicode characters. Chances are you're outputting the string as ASCII instead of Unicode. Take a look at this question as it sounds like you may be experiencing the same problem.

Related

Converting "bad" characters to their equivalent without a direct string.Replace and a list

I have done my research and everything I've found either does nothing or is too Leeroy Jenkins and replaces everything else that it shouldn't. It's possible that I'm phrasing everything wrong in my search and so coming up with nothing.
I have to replace all the wrong characters that rich text programs (and older programs) autocorrect for the user because the user then copy/pasts directly into a web form.
For example, the "funky" apostrophe (’) converted to the regular apostrophe (') and the quotation marks and everything else.
I've tried UTF en/decoding, diacritic removal (not at all what I need), and a direct brute force string.Replace isn't reasonable, really.
Here's some example text that has all the bad stuff:
"They’re taking the hobbits to Isengaurd with bad apostrophe’s instead of good one's. It’s just how they roll."
Note that the only good apostrophe is in one's and already have one rendered result of this (It’s) so I need to convert it back (along with all the other baddies) without a string.Replace and a list of characters to watch for.
What ought I be doing here?
To clarify: I need to convert the bad characters to good equivalents before data is submitted AND I need to catch existing stuff that was rendered after it was saved. So I need to do two things here.

Double.Parse fails for "10.00" retrieved value

The screenshot sums up the problem:
I have no control over the retrieved value. It comes in with some funky format that I can't figure out and the parsing fails even though it looks totally normal. Typing the value in manually works just fine.
How can I "normalize" the retrieved value so Decimal.Parse does not fail?
For reference, here is the string that fails (copied and pasted):
"‎10.00"
First I would check your regional settings to eliminate anything as simple as a difference in expected decimal separator.
If that draws a blank then if the string 10.00 parses successfully then a string that looks like 10.00 but which fails to parse cannot actually be 10.00.
Inspect and determine the character code of each character of the string and confirm that it really is 10.00 and not some exotic Unicode that has the same appearance but which is actually different (which may also include characters which are not even visible when displayed).
You might have some kind of special character hidden in the string you are retrieving.
Try this:
Double.Parse(Regex.Replace(decimalValue, #"[^0-9.,]+", ""))
You might need to add using statement for System.Text.RegularExpressions
I would replace only the one problematic character, that is the safest option:
s = s.Replace("\u200E", "");
As Jeroen Mostert mentioned in a comment, there is a non-printed character in your decimalValue.
This is a similar question which should help you deal with that.
https://stackoverflow.com/a/15259355/7636764
Edit:
Using the the string output = new string(input.Where(c => char.IsLetter(c) || char.IsDigit(c)).ToArray()); part of the solution, but also include in || char.IsPunctuation(c) after IsDigit will get your desired result.

Decimal.TryParse Fails to parse integer value

What am I missing:
decVal = Decimal.Parse(myAr[0]);
Or
Decimal.TryParse(myAr[0], out decVal);
Fails !
Input string was not in correct foramt.
myAr[0] is "678016".
Tried to add NumberStyle.Any and CultureInfo.InvarialtCulture but got the same results.
More info on the string:
it is concatenated with some letters in hebrew and a "\u200e" space between them. and then I use split(' ') to get the numbers back.
This is probably the source of this error, but when I check the myAr[0] in the watch it is pure string....
Guys I've found the answer, I'll rewrite the question for future generation.
The Original string was a concatenation of letters and numbers separated with a special sequence to preserve the order in a rtl situation: "\u200E".
The number where extracted later using string.split(' ') which seems to work OK (in the watch) be it caused the problem.
once I used string.split("\u200e").ToCharArray() I got the same results, but now the decimal.Parse is working.
It looks like the special char was still inside the string, invisible to the watch.
This is weird, on my machine (.NET 4) even this works:
Decimal.TryParse("asdf123&*", out someDecimal);
By works I mean that TryParse returns false, no exception is thrown.
Parse method may throw an exception - maybe you have some whitespace or string literally contains " (quotes)?

formatting strings with backslash

I'm a newbie to c# so hopefully this one isn't too hard for a few of you.
I'm trying to build a string that has a \ in it and I am having difficulty getting just one backslash to show up even though I am adding additional escape chars or ignoring them all together. Can someone show me what I am doing wrong?
What I want my string to look like:
"10.20.14.103\sql08"
What I've tried so far:
I added an additional character to make the compiler happy but it did not escape it.
ip = string.Format("{0}\\\\{1}", ip, instancename); // output has 2 \'s
I told it to ignore escapes, it decided to ignore me instead
string temp = #"192.168.1.200\sql08"; // output has 2 \'s
Can someone help me make sense of this? (The richtext editor here seems to do a better job with it than VS2010 is doing, lol)
I'm guessing you're getting confused by the debugger.
If you hover your mouse over a local variable in VS, strings will be escaped so a single \ will display as \\.
To see what your string really is, output it somewhere for display (e.g., to the console) or hover your mouse on the variable, click on the arrow next to the little magnifying glass that appears, and select "Text Visualizer."
If you're looking at these strings in the debugger (i.e., by hovering the mouse over the variable or using a watch), the debugger adds escape characters to the display string so that it's a valid string expression. If you want to view the string verbatim in this fashion, click on the magnifying glass on the right side of the tooltip or watch entry with the string in it.
I'm guessing you're looking at the values in the debugger and seeing that they have two slashes.
That's normal. The debugger will show two slashes even though the actual string representation will only have one. Just another hump to get over when getting used to the debugger.
Be assured that when you actually use your strings, they will still only have a single slash (using either of your methods).
string requiredString = string.Format(#"{0}\\{1}",str1,str2);

C#: How do you go upon constructing a multi-lined string during design time?

How would I accomplish displaying a line as the one below in a console window by writing it into a variable during design time then just calling Console.WriteLine(sDescription) to display it?
Options:
-t Description of -t argument.
-b Description of -b argument.
If I understand your question right, what you need is the # sign in front of your string. This will make the compiler take in your string literally (including newlines etc)
In your case I would write the following:
String sDescription =
#"Options:
-t Description of -t argument.";
So far for your question (I hope), but I would suggest to just use several WriteLines.
The performance loss is next to nothing and it just is more adaptable.
You could work with a format string so you would go for this:
string formatString = "{0:10} {1}";
Console.WriteLine("Options:");
Console.WriteLine(formatString, "-t", "Description of -t argument.");
Console.WriteLine(formatString, "-b", "Description of -b argument.");
the formatstring makes sure your lines are formatted nicely without putting spaces manually and makes sure that if you ever want to make the format different you just need to do it in one place.
Console.Write("Options:\n\tSomething\t\tElse");
produces
Options:
Something Else
\n for next line, \t for tab, for more professional layouts try the field-width setting with format specifiers.
http://msdn.microsoft.com/en-us/library/txafckwd.aspx
If this is a /? screen, I tend to throw the text into a .txt file that I embed via a resx file. Then I just edit the txt file. This then gets exposed as a string property on the generated resx class.
If needed, I embed standard string.Format symbols into my txt for replacement.
Personally I'd normally just write three Console.WriteLine calls. I know that gives extra fluff, but it lines the text up appropriately and it guarantees that it'll use the right line terminator for whatever platform I'm running on. An alternative would be to use a verbatim string literal, but that will "fix" the line terminator at compile-time.
I know C# is mostly used on windows machines, but please, please, please try to write your code as platform neutral. Not all platforms have the same end of line character. To properly retrieve the end of line character for the currently executing platform you should use:
System.Environment.NewLine
Maybe I'm just anal because I am a former java programmer who ran apps on many platforms, but you never know what the platform of the future is.
The "best" answer depends on where the information you're displaying comes from.
If you want to hard code it, using an "#" string is very effective, though you'll find that getting it to display right plays merry hell with your code formatting.
For a more substantial piece of text (more than a couple of lines), embedding a text resources is good.
But, if you need to construct the string on the fly, say by looping over the commandline parameters supported by your application, then you should investigate both StringBuilder and Format Strings.
StringBuilder has methods like AppendFormat() that accept format strings, making it easy to build up lines of format.
Format Strings make it easy to combine multiple items together. Note that Format strings may be used to format things to a specific width.
To quote the MSDN page linked above:
Format Item Syntax
Each format item takes the following
form and consists of the following
components:
{index[,alignment][:formatString]}
The matching braces ("{" and "}") are
required.
Index Component
The mandatory index component, also
called a parameter specifier, is a
number starting from 0 that identifies
a corresponding item in the list of
objects ...
Alignment Component
The optional alignment component is a
signed integer indicating the
preferred formatted field width. If
the value of alignment is less than
the length of the formatted string,
alignment is ignored and the length of
the formatted string is used as the
field width. The formatted data in
the field is right-aligned if
alignment is positive and left-aligned
if alignment is negative. If padding
is necessary, white space is used. The
comma is required if alignment is
specified.
Format String Component
The optional formatString component is
a format string that is appropriate
for the type of object being formatted
...

Categories