Remove unicode/ascii characters from string - c#

Our end users still copy and paste things from Word and Excel into form fields and we end up with a lot of unwanted characters in our database tables. I've tried a bunch of things to remove unwanted characters from strings. The latest is a character like the following
I have tried the following to no avail:
summary = Regex.Replace(summary, #"[^\u0000-\u007F]+", string.Empty);
summary = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(summary));
Does saving it to the database somehow change it's value?!?!
This does find the offending string in the DB
select *
from Project
where CharIndex(CHAR(2), summary) > 0
The server error that gets thrown is this:
System.ArgumentException: '', hexadecimal value 0x02, is an invalid character
which is why I tried the Regex solution first (\u0002 seems to be the offending character as far as C# is concerned)

Related

c#.net regex to remove certain non ascii chars does not work

I'm newbie to .net, I use script task in SSIS. I am trying to load a file to Database that has some characters like below. This looks like a data copied from word where - has turned to –
Sample text:
Correction – Spring Promo 2016
Notepad++ shows:
I used the regex in .net script [^\x00-\x7F] but even though it falls in the range it gets replaced. I do not want these characters be altered. What am I missing here?
If I don't replace I get a truncation error as I believe these characters take more than a bit size.
Edit: I added sample rows. First two rows have problem and last two are okay.
123|NA|0|-.10000|Correction – Spring Promo 2016|.000000|gift|2013-06-29
345|NA|1|-.50000|Correction–Spring Promo 2011|.000000|makr|2012-06-29
117|ER|0|12.000000|EDR - (WR) US STATE|.000000|TEST MARGIN|2016-02-30
232|TV|0|.100000|UFT / MGT v8|.000000|test. second|2006-06-09
After good long weekend :) I am beginning to think that this is due to code page error. The exact error message when loading the flat file is as below.
Error: Data conversion failed. The data conversion for column "NAME" returned status value 4 and status text "Text was truncated or one or more characters had no match in the target code page.".
This is what I do in my ssis package.
Script task that validates the flat files.
The only validation that affect the contents of the file is to check the number of delimited columns in the file is same as what it should be for that file. I need to read each line (if there is an extra pipe delimiter (user entry), remove that line from the file and log that into custom table).
Using the StreamWriter class, I write all the valid lines to a temp file and rename/move the file at the end.
apologies but I have just noticed that this process changes all such lines above to something like this.
Notepad: Correction � Spring Promo 2016
How do I stop my script task doing this? (which should be the solution)
If that's not easy, option 2 being..
My connection managers are flat file source and OLEDB destination. The OLEDB uses the default code page which is 1252. If these characters are not a match in code page 1252, what should I be using? Are there any other workarounds without changing the code page?
Script task:
foreach (string file in files)... some other checks
{
var tFile = Path.GetTempFileName();
using (StreamReader rFile = new StreamReader(file))
using (var swriter = new StreamWriter(tFile))
{
string line;
while ((line = rFile.ReadLine()) != null)
{
NrDelimtrInLine = line.Count(x => x == '|') + 1;
if (columnCount == NrDelimtrInLine)
{
swriter.WriteLine(line);
}
}}}
Thank you so much.
It's not clear to me what you intend since "I do not want these characters to be altered" seems mutually exclusive with "they must be replaced to avoid truncation". I would need to see the code to give you further advice.
In general I recommend always testing your regex patterns outside of code first. I usually use http://regexr.com
If you want to match your special characters:
If you want to match anything except your special characters:

Why do I get an CS1056 Unexpected character '' on this code

I'm getting this unexpected character '' error and I don't understand why.
var list = new List<MyModel>();
list.Add(new MyModel() {
variable1 = 942,
variable2 = 2001,
variable3 = "my text",
variable4 = 123
​}); // CS1056 Unexpected character '' on this line
From what the error says and the actual error code I got from an Online compiler after copy/pasting, Your code on this line contains a character that is not visible but that the compiler is trying to interpret. Simply try erase every character starting at your closing bracket towards your number 3 and press Enter again It should be working (it did work for me)
I just deleted the file Version=v4.0.AssemblyAttributes.cs(1,1,1,1) located in my temp folder C:\Users\MyUser\AppData\Local\Temp and then it works perfectly.
For .NET Core you have to delete .NETCoreApp,Version=v2.1.AssemblyAttributes.cs
As mentioned by Daneau in the accepted answer, the problem is by a character that is not visible in the IDE.
Here are several solutions to find the invisible character with Notepad++.
Solutions 1: Show Symbol
Copy the code to Notepad++,
Select View -> Show Symbol -> Show All Characters
This can show invisible control characters.
Solutions 2: Convert to ANSI
Copy the code to Notepad++,
Select Encoding- > Convert to ANSI
This will convert the invisible character to ? if it is a none ANSI character.
Solutions 3: Remove none ASCII characters
Copy the code to Notepad++,
Open the Find window (Ctrl+F)
Select the Replace tab
in "Find what" write: [^\x00-\x7F]
Leave "Replace with" empty
In "Search Mode" select "Regular expression"
Find and remove the none ASCII characters
This will remove none ASCII characters.
Note: This can remove valid non ASCII characters (in strings and comments) so try to skip those if you have any.
Tip: Use HEX-Editor plugin
Use Notepad++ HEX-Editor plugin to see the binary code of text. Any character out of the range of 0x00 - 0x7F (0 - 127) is a non ASCII character and a suspect of being the problem.
Just reporting my direct experience.
As Daneau wrote, I had a character (ASCII DLE, I copied while messing up a zebra printer) hiding in the text. I could not afford to rewrite everything, so I used notepad++ "View->Show Symbol->Show All Characters" feature.
I apologize for not commenting Daneau entry, but I don't have enough reputation.
Write the code again without copying it. That worked for me
go to C:\Users\UserName\AppData\Local\Temp\ and clear the data or remove the file specified in the error, that will solve the issue.
VS will add the required file on auto, no worries.
I got this error when I moved my application from one folder to another, I resolved this by deleting the Debug folder inside the obj folder.
It indeed has to do with copy pasting code and characters that you cannot see. The easiest way to fix it is by passing your copy pasted code into a note application or simple text program which will automatically remove these invisible characters. After that simply copy the code from the text editor and paste it into your IDE.
For some reason this happened to me on every project in my solution. My fix was to delete all bin and obj folders in my solution.

How to extract single quote from SQL into an .XSD in C#?

I'm trying to store Regex values in the DB, later to be used for custom validation.
I'm storing a Regex like this:
[a-zA-Z\''\"]+
(Two single qoutes in order to get one in the DB):
[a-zA-Z\'\"]+
When i extract this regex, I get an error while Filling the dataset:
Incorrect syntax near '\'. Unclosed quotation mark after the character string ']+''.
UPDATE #TEMP2 SET CandidateNameRegex='[a-zA-Z\'\"]+'
I've tried different variations:
'[a-zA-Z\''\"]+'
'[a-zA-Z\''\""]+'
'[a-zA-Z\''\\"]+'
'#[a-zA-Z\''\"]+'
'[a-zA-Z\'\"]+'
But none seem to do the trick.
So, How do we extract single quote from the DB without breaking the string?

How to carriage return C# string without \r\n?

This is my problem.
A user can enter text into a text area in the browser. Which is then emailed out to users.
What I want to know is that how do I handle carriage return? If I enter \r\n for carriage return, the email (which is a plain text email) has actual \r\n in it.
In other words:
On the SQL server end
Case 1:
if I do this before the email gets sent
(notice the line break after line 1)
update emails
set
body='line 1
line 2'
where
id=100
the email goes out correctly
Case 2:
update emails
set
body='line 1'+char(13) + char(10) +'line 2'
where
id=100
This email also goes out correctly
Case 3:
However if I do this
update emails
set
body='line 1 \r\n line 2',
where
id=100
the email would have the actual text \r\n in it.
How do I simulate case 1/2 through c# ?
SQL literals (at least those in SQL Server) do not support such escape sequences (although you can just hit enter within the string literal so that it spans multiple lines). See this answer for some alternatives if writing it as an SQL string is a requirement.
If running the SQL programmatically from C#, use parameters which will handle this just fine:
sqlCommand.CommandText = "update emails set body=#body where id=#id"
sqlCommand.Parameters.AddWithValue("#body", "line 1 \r\n line2");
Note that the handling of the string literal (and conversion of the \r and \n character escape sequences) happens in C# and the value (with CR and LF characters) is passed to SQL.
If the above didn't address the problem, keep reading.
4.10.13 The textarea element:
For historical reasons, the element's value is normalised in three different ways for three different purposes. The raw value is the value as it was originally set. It is not normalized. The API value is the value used in the value IDL attribute. It is normalized so that line breaks use "LF" (U+000A) characters. Finally, there is the form submission value. [Upon form submission the textarea] is normalized so that line breaks use U+000D CARRIAGE RETURN U+000A LINE FEED (CRLF) character pairs, and in addition, if necessary given the element's wrap attribute, additional line breaks are inserted to wrap the text at the given width.
Note that CR and LF represent characters and not the two-character sequence of \ followed by either the r or n characters - this form is often found in string literals. If it appears as such then something is doing the incorrect conversion and putting (or leaving) the \ there. Or, perhaps there is some misguided "add slashes" hack somewhere?
As pointed out, while URL decode is likely wrong, it won't directly do this conversion. However, if the conversion happened previously before being "URL Encoded", then it will (correctly) decode to (incorrect) values.
In either case, it's a bug. So find out where the incorrect data conversion is introduced and fix it (attach a debugger and/or monitor the network traffic for clues) - the required information to isolate where is simply not present in the post.
Use whatever c#'s string replace method is to replace "\\r\\n" with "\r\n" and that should fix it.

Unexpected chars in Excel

I am using ADO.NET to fill a datatable from an Excel (xls) worksheet.
I got unexpected chars. At first I thought they came somehow during the import and so I tried to emininate them in the C# program but nothing I tried worked.
Finally I traced the chars back to Excel and I was able to use the replace function in Excel to replace the char with ''. These chars show up as blanks in Excel and I only found them by working backwards from their location in the datatable which I had dumped to a text file.
In Excel I also tried the clear formatting function. But that didn't do the job.
How do I filter the input in the datatable for only ascii chars (33 to 127)?
What kind of string do I get when I turn the datatable (typeof(System.String)) column into a string. I don't seem to be able to identify the chars when I convert the string to an array of chars.
Any suggestions? Since these chars were unexpected I want to be sure the spreadsheet input is filtered to keep only the visible printing chars and blank spaces. The text being imported should be just text, no numeric data...
The unexpected char that appears in the text file when I dump the table is ÿ.
Does your origin fields contain carriage return ("\r"), newline ("\n"), tab ("\t") characters (Jon Skeet answering even outside stackoverflow) or NULL fields?
Try striping all those characters before sending the information to the database.
Thanks, Voyager, for your reply.
Not that I could tell. There were some nulls from empty cells but I had gotten rid of them. I tried to filter away any \r, \n, \t and other non printing chars. I've done that sort of thing in C many times, but I didn't seem to be able it in the C# program.
Finally I dropped down to the excel worksheet itself and with a vba macro (module or whatever it is called) got rid of all the offending chars ( less than 32 and greater than 126) There were a lot hanging around.
But all the data passes throught the C# program , one program vs many spreadsheets, so of course I'd prefer to fix the issue in Excel.

Categories