I've seached a long time, and it seems that my problem is world-wide known. But, all the answers that are given, won't work for me. Most of the time, people say 'there is no problem'.
The problem: I'm programming a POS solution, and I'm using a Epson POS printer. To print the buttom to the receipt, I'm storing a string in the database. This is, so users can adjust the text at the bottom of the receipt. But, when I'm pulling the string out of the database, C# adds slashes to the string, so my excape characters won't work. I know, that usualy is not a problem, but in my case it is, because my ECS/POS commands won't work.
I've already tried some scripts, which replaces the double \ with a single \, but they don't work. (eg. String.Replace(#'\\',#'\').
Problem:
I have a sting: "foo \n bar"
Needs to print as:
foo
bar
C# adds slashes: "foo \\n bar"
Now it's printed as:
foo \n bar
Anyone an idea?
The problem is a misunderstanding of how C# handles strings. Take the following sample code:
string foo = "a\nb";
int fooLength = foo.Length; \\ 3 characters.
int bar = (int)(foo[1]); \\ 10 = linefeed character.
versus:
string foo = #"a\nb"; \\ NB: # prefix!
int fooLength = foo.Length; \\ 4 characters.
int bar = (int)(foo[1]); \\ 92 = backslash character.
The first example uses a string literal ("a\nb") which is interpreted by the C# compiler to yield three characters. The second example uses a verbatim string literal, due the prefix #, that suppresses the interpretation of the string.
Note that the debugger is designed to add to the confusion by displaying strings with escape codes added, e.g. string foo = "a\nb" + (Char)9; results in a string that the debugger shows as "a\nb\t". If you use the "text visualizer" in the debugger (by clicking on the magnifying glass when examining the the variable's value) you can see the difference between literal and interpreted characters.
Databases are, as a rule, designed to accept and return string values without interpretation. That way you needn't worry about names like "Delete D'table". Neither the presence of a SQL keyword, nor punctuation used in SQL statements, should present a problem in a data column.
Now the OP's issue should be becoming clearer. The string retrieved from the database does not contain a linefeed, but instead contains the characters '\' and 'n'. .NET has no reason to change those values when the string is read from the database and written to a printer. Unfortunately, the debugger confounds the difference. (Use the text visualizer as described above.)
The solution involves adding code to reproduce the C# compiler's processing of escape sequences. (This should include escaping escape characters!) Alternatively, tokens can be added that are suitable for the application at hand, e.g. occurrences of «ESC» could be replaced with an ASCII escape character. This can be employed for longer sequences, for example if a print uses several characters to introduce a font change then write the code to replace «SetFont» with the correct sequence. More generally, you can replace a snippet with a dynamic value, e.g. «Now» could be replaced with the current date/time when the receipt is being printed. (Register number, cashier name, store hours, ... .) This makes the values in the database more human readable than embedded Unicode oddities and more flexible than fixed strings.
Left as an exercise for the reader: extend snippets to support formatting and null value substitution. «Now|DD.MM.YY hh:mm» to specify a format, «Discount|*|n/a» to specify a value ("n/a") to be displayed if the field is null.
Related
I am doing a bulk import to PostgreSQL from C# and one of the records gives me this error:
22021: invalid byte sequence for encoding "UTF8": 0x00
I googled it and the general advice is that this refers to a null field but in my instance this is not the case. I tracked down the string that causes the error and it is this:
Addresses the following: Let $A$ be a Banach algebra, and let $\sum:\0\rightarrow I\rightarrow\mathfrak A\overset\pi\to\longrightarrow A\rightarrow 0$ be an extension of $A$, where $\mathfrak A$ is a Banach algebra and $I$ is a closed ideal in $\mathfrak A$.
I am reading this from an XML file and have UTF-8 defined on the file stream.
The escaped string on my deserialized C# class is:
"Addresses the following: Let $A$ be a Banach algebra, and let $\\sum\\:\\0\\rightarrow I\\rightarrow\\mathfrak A\\overset\\pi\\to\\longrightarrow A\\rightarrow 0$ be an extension of $A$, where $\\mathfrak A$ is a Banach algebra and $I$ is a closed ideal in $\\mathfrak A$."
Obviously something is not right with the string. I am guessing some sort of mathmatical symbols should be there but what exactly about this is breaking the import and making PostgreSQL report that it is a null field? What format should that be read in?
If I manually overwite this field the import works so it is 100% an issue with this string.
Since it's a bulk import, I'm assuming you're creating a file or some kind of big string to send to Postgres? In that case the strings probably have escape characters enabled, as opposed to executing this via, say, a prepared statement. So it's probably that \0 in your string that Postgres is escaping and interpreting as a 0x00.
from the docs: https://www.postgresql.org/docs/8.3/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS
PostgreSQL also accepts "escape" string constants, which are an extension to the SQL standard. An escape string constant is specified by writing the letter E (upper or lower case) just before the opening single quote, e.g. E'foo'. (When continuing an escape string constant across lines, write E only before the first opening quote.) Within an escape string, a backslash character () begins a C-like backslash escape sequence, in which the combination of backslash and following character(s) represents a special byte value. \b is a backspace, \f is a form feed, \n is a newline, \r is a carriage return, \t is a tab. Also supported are \digits, where digits represents an octal byte value, and \xhexdigits, where hexdigits represents a hexadecimal byte value. (It is your responsibility that the byte sequences you create are valid characters in the server character set encoding.) Any other character following a backslash is taken literally. Thus, to include a backslash character, write two backslashes (\). Also, a single quote can be included in an escape string by writing \', in addition to the normal way of ''.
So if your bulk statement is prepending strings with E, like E'hello', don't do that.
Hey I have an issue with Regex.Escape I'm trying to feed it an Email from TextBox Controll. The function recieves "test#test.test". What I expect to get is this "test#test\.test" Regex.Escape escapes the dot character. Hovever what I get instead is "test#test\\.test" which is very confusing. I plan on handing that string down to an SQL query and I'm worried abut users misbehaving.
holder.address = Regex.Escape(EmailAddressInput.Text);
This is how I assign resulting string to field in holder class.
I have been researching this problem on my own but most sources (including MSDN) suggest to prefix the dot ("the special character") with one backslash.
As it is right now backslash escapes backslash and result is a badly formatted email address.
var s = "test#test\\.test"; means the s holds the test#test\.test string. Your issue does not exist. There is a single backslash. Click the magnifier button on the right - you will see that in the Text Visualizer.
Regex has to have \\ because its escaping the \
the string itself actually only has one \ in it.
How do I escape with the #-sign when using variables?
File.Delete(#"c:\test"); // WORKS!
File.Delete(#path); // doesn't work :(
File.Delete(#"c:\test"+path); // WORKS
Anyone have any idea? It's the 2nd example I want to use!
Strings prefixed with # character are called verbatim string literals (whose contents do not need to be escaped).
Therefore, you can only use # with string literals, not string variables.
So, just File.Delete(path); will do, after you assign the path in advance of course (from a verbatim string or some other string).
Verbatim strings are just a syntactic nicety to be able to type strings containing backslashes (paths, regexes) easier. The declarations
string path = "C:\\test";
string path = #"C:\test";
are completely identical in their result. Both result in a string containing C:\test. Note that either option is just needed because the C# language treats \ in strings as special.
The # is not some magic pixie dust needed to make paths work properly, it has a defined meaning when prefixed to strings, in that the strings are interpreted without the usual \ escape sequences.
The reason your second example doesn't work like you expect is that # prefixed to a variable name does something different: It allows you to use reserved keywords as identifiers, so that you could use #class as an identifier, for example. For identifiers that don't clash with keywords the result is the same as without.
If you have a string variable containing a path, then you can usually assume that there is no escaping needed at all. After all it already is in a string. The things I mentioned above are needed to get text from source code correctly through the compiler into a string at runtime, because the compiler has different ideas. The string itself is just data that's always represented the same.
This still means that you have to initialise the string in a way that backslashes survive. If you read it from somewhere no special treatment should be necessary, if you have it as a constant string somewhere else in the code, then again, one of the options at the top has to be used.
string path = #"c:\test";
File.Delete(path);
This will work only on a string. The "real" string is "c:\\test".
Read more here.
There's a major problem with your understanding of the # indicator.
#"whatever string" is a literal string specifier verbatim string literal. What it does is tells the C# compiler to not look for escape sequences. Normally, "\" is an escape sequence in a string, and you can do things like "\n" to indicate a new line or "\t" to indicate a tab. However, if you have #"\n", it tells the compiler "no, I really want to treat the backslash as a backslash character, not an escape sequence."
If you don't like literal mode, the way to do it is to use "\\" anywhere you want a single backslash, because the compiler knows to treat an escaped backslash as the single character.
In either case, #"\n" and "\\n" will produce a 2-character string in memory, with the characters '\' and 'n'. It doesn't matter which way you get there; both are ways of telling the compiler you want those two characters.
In light of this, #path makes no sense, because you don't have any literal characters - just a variable. By the time you have the variable, you already have the characters you want in memory. It does compile ok, as explained by Joey, but it's not logically what you're looking for.
If you're looking for a way to get rid of occurrences of \\ within a variable, you simply want String.Replace:
string ugly = #"C:\\foo";
ugly = ugly.Replace(#"\\", #"\");
First and third are actual paths hence would work.
Second would not even compile and would work if
string path = #"c:\test";
File.Delete(path);
I have kind of a weird problem that I am attempting to resolve with some elegant regular expressions.
The system I am working on was initially designed to accept an incoming string and through a pattern matching method, alter the string which it then returns. A very simplistic example is:
Incoming string:
The dog & I went to the park and had a great time...
Outgoing string:
The dog {&} I went to the park and had a great time {...}
The punctuation mapper wraps key characters or phrases and wraps them in curly braces. The original implementation was a one way street and was never meant for how it is currently being applied and as a result, if it is called incorrectly, it is very easy for the system to "double" wrap a string as it is just doing a simple string replace.
I spun up Regex Hero this morning and started working on some pattern matches and having not written a regular expression in nearly a year, quickly hit a wall.
My first idea was to match a character (i.e. &) but only if it wasn't wrapped in braces and came up with [^\{]&[^\}], which is great but of course catches any instance of the ampersand so long as it is not preceded by a curly brace, including white spaces and would not work in a situation where there were two ampersands back to back (i.e. && would need to be {&}{&} in the outgoing string. To make matters more complicated, it is not always a single character as ellipsis (...) is also one of the mapped values.
Every solution I noodle over either hits a barrier because there is an unknown number of occurrences of a particular value in the string or that the capture groups will either be too greedy or finally, cannot compensate for multiple values back to back (i.e. a single period . vs ellipsis ...) which the original dev handled by processing ellipsis first which covered the period in the string replace implementation.
Are there any regex gurus out there that have any ideas on how I can detect the undecorated (unwrapped) values in a string and then perform their replacements in an ungreedy fashion that can also handle multiple repeated characters?
My datasource that I am working against is a simple key value pair that contains the value to be searched for and the value to replace it with.
Updated with example strings:
Undecorated:
Show Details...
Default Server:
"Smart" 2-Way
Show Lender's Information
Black & White
Decorated:
Show Details{...}
Default Server{:}
{"}Smart{"} 2-Way
Show Lender{'}s Information
Black {&} White
Updated With More Concrete Examples and Datasource
Datasource (SQL table, can grow at any time):
TaggedValue UntaggedValue
{:} :
{&} &
{<} <
{$} $
{'} '
{} \
{>} >
{"} "
{%} %
{...} ...
{...} …
{:} :
{"} “
{"} ”
{'} `
{'} ’
Broken String: This is a string that already has stuff {&} other stuff{!} and {...} with {_} and {#} as well{.} and here are the same characters without it & follow by ! and ... _ & . &&&
String that needs decoration: Show Details... Default Server: "Smart" 2-Way Show Lender's Information Black & White
String that would pass through the method untouched (because it was already decorated): The dog {&} I went to the park and had a great time {...}
The other "gotcha" in moving to regex is the need to handle escaping, especially of backslashes elegantly due to their function in regular expressions.
Updated with output from #Ethan Brown
#Ethan Brown,
I am starting think that regex, while elegant might not be the way to go here. The updated code you provided, while closer still does not yield correct results and the number of variables involved may exceed the regex logics capability.
Using my example above:
'This is a string that already has stuff {&} other stuff{!} and {...} with {_} and {#} as well{.} and here are the same characters without it & follow by ! and ... _ & . &&&'
yields
This is a string that already has stuff {&} other stuff{!} and {...} with {_} and {#} as well{.} and here are the same characters without it {&} follow by {!} and {...} {_} {&} . {&&}&
Where the last group of ampersands which should come out as {&}{&}{&} actually comes out as {&&}&.
There is so much variability here (i.e. need to handle ellipsis and wide ellipsis from far east languages) and the need to utilize a database as the datasource is paramount.
I think I am just going to write a custom evaluator which I can easily enough write to perform this type of validation and shelve the regex route for now. I will grant you credit for your answer and work as soon as I get in front of a desktop browser.
This kind of problem can be really tough, but let me give you some ideas that might help out. One thing that's really going to give you headaches is handling the case where the punctuation appears at the beginning or end of the string. Certainly that's possible to handle in a regex with a construct like (^|[^{])&($|[^}]), but in addition to that being painfully hard to read, it also has efficiency issues. However, there's a simple way to "cheat" and get around this problem: just pad your input string with a space on either end:
var input = " " + originalInput + " ";
When you're done you can just trim. Of course if you care about preserving input at the beginning or end, you'll have to be more clever, but I'm going to assume for argument's sake that you don't.
So now on to the meat of the problem. Certainly, we can come up with some elaborate regular expressions to do what we're looking for, but often the answer is much much simpler if you use more than one regular expression.
Since you've updated your answer with more characters, and more problem inputs, I've updated this answer to be more flexible: hopefully it will meet your needs better as more characters get added.
Looking over your input space, and the expressions you need quoted, there are really three cases:
Single-character replacements (! becomes {!}, for example).
Multi-character replacements (... becomes {...}).
Slash replacement (\ becomes {})
Since the period is included in the single-character replacements, order matters: if you replace all the periods first, then you will miss ellipses.
Because I find the C# regex library a little clunky, I use the following extension method to make this more "fluent":
public static class StringExtensions {
public static string RegexReplace( this string s, string regex, string replacement ) {
return Regex.Replace( s, regex, replacement );
}
}
Now I can cover all of the cases:
// putting this into a const will make it easier to add new
// characters in the future
const string normalQuotedChars = #"\!_\\:&<\$'>""%:`";
var output = s
.RegexReplace( "(?<=[^{])\\.\\.\\.(?=[^}])", "{$&}" )
.RegexReplace( "(?<=[^{])[" + normalQuotedChars + "](?=[^}])", "{$&}" )
.RegexReplace( "\\\\", "{}" );
So let's break this solution down:
First we handle the ellipses (which will keep us from getting in trouble with periods later). Note that we use a zero-width assertions at the beginning and end of the expression to exclude expressions that are already quoted. The zero-width assertions are necessary, because without them, we'd get into trouble with quoted characters right next to each other. For example, if you have the regex ([^{])!([^}]), and your input string is foo !! bar, the match would include the space before the first exclamation point and the second exclamation point. A naive replacement of $1!$2 would therefore yield foo {!}! bar because the second exclamation point would have been consumed as part of the match. You'd have to end up doing an exhaustive match, and it's much easier to just use zero-width assertions, which are not consumed.
Then we handle all of the normal quoted characters. Note that we use zero-width assertions here for the same reasons as above.
Finally, we can find lone slashes (note we have to escape it twice: once for C# strings and again for regex metacharacters) and replace that with empty curly brackets.
I ran all of your test cases (and a few of my own invention) through this series of matches, and it all worked as expected.
I'm no regex god, so one simple way:
Get / construct the final replacement string(s) - ex. "{...}", "{&}"
Replace all occurrences of these in the input with a reserved char (unicode to the rescue)
Run your matching regex(es) and put "{" or whatever desired marker(s).
Replace reserved char(s) with the original string.
Ignoring the case where your original input string has a { or } character, a common way to avoid re-applying a regex to an already-escaped string is to look for the escape sequence and remove it from the string before applying your regex to the remainders. Here's an example regex to find things that are already escaped:
Regex escapedPattern = new Regex(#"\{[^{}]*\}"); // consider adding RegexOptions.Compiled
The basic idea of this negative-character class pattern comes from regular-expressions.info, a very helpful site for all thing regex. The pattern works because for any inner-most pair of braces, there must be a { followed by non {}'s followed by a }
Run the escapedPattern on the input string, find for each Match get the start and end indices in the original string and substring them out, then with the final cleaned string run your original pattern match again or use something like the following:
Regex punctPattern = new Regex(#"[^\w\d\s]+"); // this assumes all non-word,
// digit or space chars are punctuation, which may not be a correct
//assumption
And replace Match.Groups[1].Value for each match (groups are a 0 based array where 0 is the whole match, 1 is the first set of parentheses, 2 is the next etc.) with "{" + Match.Groups[1].Value + "}"
I have to parse some files that contain some string that has characters in them that I need to escape. To make a short example you can imagine something like this:
var stringFromFile = "This is \\n a test \\u0085";
Console.WriteLine(stringFromFile);
The above results in the output:
This is \n a test \u0085
, but I want the text escaped. How do I do this in C#? The text contains unicode characters too.
To make clear; The above code is just an example. The text contains the \n and unicode \u00xx characters from the file.
Example of the file contents:
Fisika (vanaf Grieks, \u03C6\u03C5\u03C3\u03B9\u03BA\u03CC\u03C2,
\"Natuurlik\", en \u03C6\u03CD\u03C3\u03B9\u03C2, \"Natuur\") is die
wetenskap van die Natuur
Try it using: Regex.Unescape(string)
Should be the right way.
Att.
Don't use the # symbol -- this interprets the string as 100% literal. Just take it off and all shall be well.
EDIT
I may have been a bit hasty with my reply. I think what you're asking is: how can I have C# turn the literal string '\n' into a newline, when read from a file (similar question for other escaped literals).
The answer is: you write it yourself. You need to search for "\\n" and convert it to "\n". Keep in mind that in C#, it's the compiler not the language that changes your strings into actual literals, so there's not some library call to do this (actually there could be -- someone look this up, quick).
EDIT
Aha! Eureka! Behold:
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.unescape.aspx
Since you are reading the string from a file, \n is not read as a unicode character but rather as two characters \ and n.
I would say you probably need a search an replace function to convert string "\n" to its unicode character '\n' and so on.
I don't think there's any easy way to do this. Because it's the job of lexical analyzer to parse literals.
I would try generating and compiling a class via CodeDOM with the string inserted there as constant. It's not very fast but it will do all escaping.