SQL Service Broker Queue Handling Foreign Characters in the message

SQL Service Broker Queue Handling Foreign Characters in the message - c#

Setup: I have a form built in asp.net/c# that, on submit, XML serializes it's object model and calls a stored procedure with that XML serialized data as the sole parameter. The stored procedure sends that data to a sql broker queue. The message sent to the broker queue must be valid XML that obeys the message contract set on the queue. That message is picked up by BizTalk and processed accordingly.
Problem: Originally the data submitted to me was just regular English characters (essentially held to ASCII charset) but a requirement is on the horizon to support foreign characters as well. In my testing, I've noticed that if I try to submit something with foreign characters (chinese, arabic, etc), I get an error in the queue and the message that gets to BizTalk ends up with "?????" in place of the foreign characters. I've added the utf=16 xml header to the top of the document, but that doesn't seem to help.
Question: Is there a way I can cast the incoming XML message as nvarchar and still have it be considered valid XML by the queue? I don't want to change the actual type on the queue or recreate it. I'd prefer to change the message in the stored proc alone in some way that allows it to get on the queue.
Thanks in advance for your help.

I ended up handling this by encoding the characters using HTML5 and then security escaping them. I ran into some issues using the HttpUtility library to handle this encoding so I added the method that I used to handle the encoding.
I wish I could give direct credit for this, I can't remember where I found this but thank you to whomever it was:
private string EncodeToHTML(string text)
{
// call the normal HtmlEncode first
char[] chars = HttpUtility.HtmlEncode(text).ToCharArray();
StringBuilder encodedValue = new StringBuilder();
foreach (char c in chars)
{
if ((int)c > 127) // above normal ASCII
encodedValue.Append("&#" + (int)c + ";");
else
encodedValue.Append(c);
}
return encodedValue.ToString();
}

Related

Send XML message to MSMQ without any formatting

I need to send an XMLDocument object to MSMQ. I don’t have a class to deserialize it into (the xml may vary). The default formatter, XMLMessageFormatter, will “pretty print” the object. This causes a problem since
<text></text>
Will be converted to
<text>
</text>
(I.e. cr + spaces). The message is being read by a process using the default XMLMessageFormatter and hasn’t been an issue whilst nodes have data in them. This is an issue, however, further down the line, as a process (out of my control) will interpret these new characters as data and cause an error.
I know I could write some code to convert them using IsEmpty = true giving <text /> but I’d like a solution that doesn’t alter the object at all.
BinaryMessageFormatter will prefix the data with BOM data (receiver is not expecting that) and ActiveXMessageFormatter will double byte the string (again causing issues the other end).
I would rather like to avoid having to write a custom message formatter. I’ve tried including some options in the XMLMessageFormatter but they’ve had little effect. Any ideas would be very much appreciated.

MSMQ operates on raw blobs. You do not have to use a formatter unless you want to.
To send a message and get it back byte-for-byte identical, use the BodyStream property.
Example:
var queue = new MessageQueue(#".\private$\queueName");
var msg = new Message();
msg.BodyStream = new MemoryStream(Encoding.UTF8.GetBytes("<root><test></test></root>"));
queue.Send(msg);
Resultant message:

IBM MQ server encodes messages

I'm sending XML to an IBM MQ Queue that contains a CDATA section. That CDATA section contains these special characters: $§)#ÜÖ&!^. For some reason, they are showing up within the MQ Queue as $�)#��&!^. This causes the other send to take it off the queue with these characters and ending up having an invalid signature because the messages no longer match up.
We've verified that the message when we do a .Put() does contain an XML string with those special characters. I've ensured that the message has .CharacterSet property assigned to it that matches what we will eventually pull off the queue.
What other places can possibly be auto-encoding the special characters when it's put on the queue? Our application is in a .NET windows environment, but the MQ server is on a Linux box. Is this something to consider?
string xmlMsg = "<message><data><![CDATA[<value>$§)#ÜÖ&!^</value>]]</data></message>"; // This is in a CDATA section.
mQMessage = new MQMessage
{
CharacterSet = 1208,
};
mQMessage.WriteBytes(xmlMsg);
_queue.Put(mQMessage);

By default MQ doesn't change the character set of your message. So by default it is the responsibility of the sending and receiving applications to agree and maintain a character set that suits both.
You can request MQ to do character set conversion either in the receiving application, when that calls a get, or on the sender channels when the message is transmitted between queue managers. But even if you request character set conversion from MQ, it is still the sending applications responsibility to actually write the data into the message using the character set the application is setting on the MQ message header.
Based on your code it seems your sending application doesn't use the correct character set when it writes the bytes to the message. If you use WriteBytes, you need to manually convert the string into bytes using the desired character set.
I'd suggest you to use the WriteString method, which is designed to use the chracter set specified in the CharacterSet property:
The WriteString method converts from Unicode to the character set encoded in CharacterSet. If CharacterSet is set to its default value, MQC.MQCCSI_Q_MGR, which is 0, no conversion takes place and CharacterSet is set to 1200. If you set CharacterSet to some other value, WriteString converts from Unicode to the alternate value.
https://www.ibm.com/support/knowledgecenter/SSFKSJ_7.5.0/com.ibm.mq.ref.dev.doc/q111220_.htm
And by the way, for debugging character set issues you have to be very careful what tools you use to check the message, as your tool needs to be able to interpret the character set of the message. For example MQ Explorer uses the character set of your workstation where you run it, so it will show every message with that one character set, so is not suitable to debug these issues. The best is to get the message off the queue without asking the QM for conversion with rfhutil for example, save it to a file and look at it with a hex editor.

Making strings that may include utf-8 characters safe for MySQL in C#.Net

I have a program written in C#.net that downloads our Amazon.com orders and stores them in our local databases.
I ran into an issue where a customer who purchased a product entered a utf8 character (℅) - (\xe2\x84\x85) into an address. This seems like a pretty reasonable thing to do, but my program choked when it ran across this order until I put in the following fix.
//get the Address2 subnode
XmlNode Address2Node = singleOrder.SelectSingleNode("ShippingAddress/AddressLine2");
if (Address2Node != null)
{
GlobalClass.Address2 = Address2Node.InnerXml;
//** c/o Unicode character messed up program.
if (GlobalClass.Address2.Contains("℅"))
{
GlobalClass.Address2 = GlobalClass.Address2.Replace("℅", "c/o");
// Console.WriteLine(GlobalClass.Address2.Substring(0,1));
}
GlobalClass.Address2 = GlobalClass.Address2.Replace("'", "''");
}
else
{
GlobalClass.Address2 = "";
}
Obviously, this will only work in this one field and with this one utf8 character. Without this fix, when I tried to use Mysql to insert it, I received an error message which basically amounted to there being an error in my Mysql statement and by the time that it was sent to MySQL, it was basically saying to INSERT a record with a string like '\xE2\x84\x85..." plus the rest of the string.
Clearly I have no control over what Amazon is going to allow in the shipping address fields, so I need to account for any odd characters that may come through but I have no idea how to do that. I had hoped that just allowing for utf8 in my connection string (charset=utf8;) would fix it, but that didn't do anything - still had the same error. Perhaps my Google skills are lacking, but I can't seem to find a way to allow for any odd character that may come my way and I don't want to have to wait until someone types it to fix the error.
UPDATE:
What about sending "SET NAMES utf8" as a query? This is sort of out of my MySQL knowledge and I don't want to mess anything up, but would this work? And if so, would all programs that I have that use this database need to send that same query?
UPDATE 2: For those who keep asking for the exception error message, it is:
'MySql.Data.MySqlClient.MySqlException' occurred in MySql.Data.dll
Additional information: Incorrect string value: '\xE2\x84\x85 Yo...' for column 'ShipAddressLine2' at row 1
UPDATE 3: From this discussion: SET NAMES utf8 in MySQL? I tried sending "SET NAMES 'cp1250'" and I was surprised to see that this allowed the insert to go through with the ℅ character in there. I gather that perhaps if before I retrieve the info that I send "SET CHARSET 'utf8'" as a query before another MySQL query to retrieve it that perhaps I will get the correct character? I'm encouraged that it went through my program by sending the "SET NAMES 'cp1250'" query, but I want to know what encoding set to use (CP1250 is Eastern European and while we have customers from around the globe, most of our customers are in the United States) and make sure this is sound practice before I go changing all my programs to include this. Anybody?

In case someone else has this issue, I first managed to avoid the error by sending the MySQL Command: SET NAMES 'latin1' to the server before storing data. This allows any of the utf8 characters to be stored without causing a MySQL error (I tested it with several odd characters). This, however, stored the utf8 characters in a cryptic format, so I am going with a better answer below:
In my current solution, I edited the MySQL table and changed the character set for the relevant column that might receive utf8 data. I changed the column's character set to utf8mb4 and the column's collation to utf8mb4_general_ci. This allowed the data to be stored properly so the utf8 characters are correct.
In addition, when setting the connection string, I added charset=utf8mb4;.
string MyConString = "SERVER=*****;" + "DATABASE=******;" + "UID=********;" + "PASSWORD=*********;" + "charset = utf8mb4;" ;
although, as far as I can tell, it save the content to the field the same whether I include the charset= parameter or not.

Deserializing ServiceBus content in Azure Logic App

I'm trying to read the content body of a message in an Azure Logic App, but I'm not having much success. I have seen a lot of suggestions which say that the body is base64 encoded, and suggest using the following to decode:
#{json(base64ToString(triggerBody()?['ContentData']))}
The base64ToString(...) part is decoding the content into a string correctly, but the string appears to contain a prefix with some extra serialization information at the start:
#string3http://schemas.microsoft.com/2003/10/Serialization/�3{"Foo":"Bar"}
There are also some extra characters in that string that are not being displayed in my browser. So the json(...) function doesn't accept the input, and gives an error instead.
InvalidTemplate. Unable to process template language expressions in
action 'HTTP' inputs at line '1' and column '2451': 'The template
language function 'json' parameter is not valid. The provided value
#string3http://schemas.microsoft.com/2003/10/Serialization/�3{"Foo":"bar" }
cannot be parsed: Unexpected character encountered while parsing value: #. Path '', line 0, position 0.. Please see https://aka.ms/logicexpressions#json for usage details.'.
For reference, the messages are added to the topic using the .NET service bus client (the client shouldn't matter, but this looks rather C#-ish):
await TopicClient.SendAsync(new BrokeredMessage(JsonConvert.SerializeObject(item)));
How can I read this correctly as a JSON object in my Logic App?

This is caused by how the message is placed on the ServiceBus, specifically in the C# code. I was using the following code to add a new message:
var json = JsonConvert.SerializeObject(item);
var message = new BrokeredMessage(json);
await TopicClient.SendAsync(message);
This code looks fine, and works between different C# services no problem. The problem is caused by the way the BrokeredMessage(Object) constructor serializes the payload given to it:
Initializes a new instance of the BrokeredMessage class from a given object by using DataContractSerializer with a binary XmlDictionaryWriter.
That means the content is serialized as binary XML, which explains the prefix and the unrecognizable characters. This is hidden by the C# implementation when deserializing, and it returns the object you were expecting, but it becomes apparent when using a different library (such as the one used by Azure Logic Apps).
There are two alternatives to handle this problem:
Make sure the receiver can handle messages in binary XML format
Make sure the sender actually uses the format we want, e.g. JSON.
Paco de la Cruz's answer handles the first case, using substring, indexOf and lastIndexOf:
#json(substring(base64ToString(triggerBody()?['ContentData']), indexof(base64ToString(triggerBody()?['ContentData']), '{'), add(1, sub(lastindexof(base64ToString(triggerBody()?['ContentData']), '}'), indexof(base64ToString(triggerBody()?['ContentData']), '}')))))
As for the second case, fixing the problem at the source simply involves using the BrokeredMessage(Stream) constructor instead. That way, we have direct control over the content:
var json = JsonConvert.SerializeObject(item);
var bytes = Encoding.UTF8.GetBytes(json);
var stream = new MemoryStream(bytes);
var message = new BrokeredMessage(stream, true);
await TopicClient.SendAsync(message);

You can use the substring function together with indexOf and lastIndexOf to get only the JSON substring.
Unfortunately, it's rather complex, but it should look something like this:
#json(substring(base64ToString(triggerBody()?['ContentData']), indexof(base64ToString(triggerBody()?['ContentData']), '{'), add(1, sub(lastindexof(base64ToString(triggerBody()?['ContentData']), '}'), indexof(base64ToString(triggerBody()?['ContentData']), '}')))))
More info on how to use these functions here.
HTH

Paco de la Cruz solution worked for me, though I had to swap out the last '}' in the expression for a '{', otherwise it finds the wrong end of the data segment.
I also split it into two steps to make it a little more manageable.
First I get the decoded string out of the message into a variable (that I've called MC) using:
#{base64ToString(triggerBody()?['ContentData'])}
then in another logic app action do the substring extraction:
#{substring(variables('MC'),indexof(variables('MC'),'{'),add(1,sub(lastindexof(variables('MC'),'}'),indexof(variables('MC'),'{'))))}
Note that the last string literal '{' is reversed from Paco's solution.
This is working for my test cases, but I'm not sure how robust this is.
Also, I've left it as a String, I do the conversion to JSON later in my logic app.
UPDATE
We have found that just occasionally (2 in several hundred runs) the text that we want to discard can contain the '{' character.
I have modified our expression to explicitly locate the start of the data segment, which for me is:
'{"IntegrationRequest"'
so the substitution becomes:
#{substring(variables('MC'),indexof(variables('MC'),'{"IntegrationRequest"'),add(1,sub(lastindexof(variables('MC'),'}'),indexof(variables('MC'),'{"IntegrationRequest"'))))}

MySqlException incorrect string value [duplicate]

After noticing an application tended to discard random emails due to incorrect string value errors, I went though and switched many text columns to use the utf8 column charset and the default column collate (utf8_general_ci) so that it would accept them. This fixed most of the errors, and made the application stop getting sql errors when it hit non-latin emails, too.
Despite this, some of the emails are still causing the program to hit incorrect string value errrors: (Incorrect string value: '\xE4\xC5\xCC\xC9\xD3\xD8...' for column 'contents' at row 1)
The contents column is a MEDIUMTEXT datatybe which uses the utf8 column charset and the utf8_general_ci column collate. There are no flags that I can toggle in this column.
Keeping in mind that I don't want to touch or even look at the application source code unless absolutely necessary:
What is causing that error? (yes, I know the emails are full of random garbage, but I thought utf8 would be pretty permissive)
How can I fix it?
What are the likely effects of such a fix?
One thing I considered was switching to a utf8 varchar([some large number]) with the binary flag turned on, but I'm rather unfamiliar with MySQL, and have no idea if such a fix makes sense.

UPDATE to the below answer:
The time the question was asked, "UTF8" in MySQL meant utf8mb3. In the meantime, utf8mb4 was added, but to my knowledge MySQLs "UTF8" was not switched to mean utf8mb4.
That means, you'd need to specifically put "utf8mb4", if you mean it (and you should use utf8mb4)
I'll keep this here instead of just editing the answer, to make clear there is still a difference when saying "UTF8"
Original
I would not suggest Richies answer, because you are screwing up the data inside the database. You would not fix your problem but try to "hide" it and not being able to perform essential database operations with the crapped data.
If you encounter this error either the data you are sending is not UTF-8 encoded, or your connection is not UTF-8. First, verify, that the data source (a file, ...) really is UTF-8.
Then, check your database connection, you should do this after connecting:
SET NAMES 'utf8mb4';
SET CHARACTER SET utf8mb4;
Next, verify that the tables where the data is stored have the utf8mb4 character set:
SELECT
`tables`.`TABLE_NAME`,
`collations`.`character_set_name`
FROM
`information_schema`.`TABLES` AS `tables`,
`information_schema`.`COLLATION_CHARACTER_SET_APPLICABILITY` AS `collations`
WHERE
`tables`.`table_schema` = DATABASE()
AND `collations`.`collation_name` = `tables`.`table_collation`
;
Last, check your database settings:
mysql> show variables like '%colla%';
mysql> show variables like '%charac%';
If source, transport and destination are utf8mb4, your problem is gone;)

MySQL’s utf-8 types are not actually proper utf-8 – it only uses up to three bytes per character and supports only the Basic Multilingual Plane (i.e. no Emoji, no astral plane, etc.).
If you need to store values from higher Unicode planes, you need the utf8mb4 encodings.

The table and fields have the wrong encoding; however, you can convert them to UTF-8.
ALTER TABLE logtest CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
ALTER TABLE logtest DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
ALTER TABLE logtest CHANGE title title VARCHAR(100) CHARACTER SET utf8 COLLATE utf8_general_ci;

"\xE4\xC5\xCC\xC9\xD3\xD8" isn't valid UTF-8. Tested using Python:
>>> "\xE4\xC5\xCC\xC9\xD3\xD8".decode("utf-8")
...
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid data
If you're looking for a way to avoid decoding errors within the database, the cp1252 encoding (aka "Windows-1252" aka "Windows Western European") is the most permissive encoding there is - every byte value is a valid code point.
Of course it's not going to understand genuine UTF-8 any more, nor any other non-cp1252 encoding, but it sounds like you're not too concerned about that?

I solved this problem today by altering the column to 'LONGBLOB' type which stores raw bytes instead of UTF-8 characters.
The only disadvantage of doing this is that you have to take care of the encoding yourself. If one client of your application uses UTF-8 encoding and another uses CP1252, you may have your emails sent with incorrect characters. To avoid this, always use the same encoding (e.g. UTF-8) across all your applications.
Refer to this page http://dev.mysql.com/doc/refman/5.0/en/blob.html for more details of the differences between TEXT/LONGTEXT and BLOB/LONGBLOB. There are also many other arguments on the web discussing these two.

First check if your default_character_set_name is utf8.
SELECT default_character_set_name FROM information_schema.SCHEMATA S WHERE schema_name = "DBNAME";
If the result is not utf8 you must convert your database. At first you must save a dump.
To change the character set encoding to UTF-8 for all of the tables in the specified database, type the following command at the command line. Replace DBNAME with the database name:
mysql --database=DBNAME -B -N -e "SHOW TABLES" | awk '{print "SET foreign_key_checks = 0; ALTER TABLE", $1, "CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci; SET foreign_key_checks = 1; "}' | mysql --database=DBNAME
To change the character set encoding to UTF-8 for the database itself, type the following command at the mysql> prompt. Replace DBNAME with the database name:
ALTER DATABASE DBNAME CHARACTER SET utf8 COLLATE utf8_general_ci;
You can now retry to to write utf8 character into your database. This solution help me when i try to upload 200000 row of csv file into my database.

Although your collation is set to utf8_general_ci, I suspect that the character encoding of the database, table or even column may be different.
ALTER TABLE tabale_name MODIFY COLUMN column_name VARCHAR(255)
CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;

In general, this happens when you insert strings to columns with incompatible encoding/collation.
I got this error when I had TRIGGERs, which inherit server's collation for some reason.
And mysql's default is (at least on Ubuntu) latin-1 with swedish collation.
Even though I had database and all tables set to UTF-8, I had yet to set my.cnf:
/etc/mysql/my.cnf :
[mysqld]
character-set-server=utf8
default-character-set=utf8
And this must list all triggers with utf8-*:
select TRIGGER_SCHEMA, TRIGGER_NAME, CHARACTER_SET_CLIENT, COLLATION_CONNECTION, DATABASE_COLLATION from information_schema.TRIGGERS
And some of variables listed by this should also have utf-8-* (no latin-1 or other encoding):
show variables like 'char%';

I got a similar error (Incorrect string value: '\xD0\xBE\xDO\xB2. ...' for 'content' at row 1). I have tried to change character set of column to utf8mb4 and after that the error has changed to 'Data too long for column 'content' at row 1'.
It turned out that mysql shows me wrong error. I turned back character set of column to utf8 and changed type of the column to MEDIUMTEXT. After that the error disappeared.
I hope it helps someone.
By the way MariaDB in same case (I have tested the same INSERT there) just cut a text without error.

That error means that either you have the string with incorrect encoding (e.g. you're trying to enter ISO-8859-1 encoded string into UTF-8 encoded column), or the column does not support the data you're trying to enter.
In practice, the latter problem is caused by MySQL UTF-8 implementation that only supports UNICODE characters that need 1-3 bytes when represented in UTF-8. See "Incorrect string value" when trying to insert UTF-8 into MySQL via JDBC? for details. The trick is to use column type utf8mb4 instead of type utf8 which doesn't actually support all of UTF-8 despite the name. The former type is the correct type to use for all UTF-8 strings.

In my case, Incorrect string value: '\xCC\x88'..., the problem was that an o-umlaut was in its decomposed state. This question-and-answer helped me understand the difference between o¨ and ö. In PHP, the fix for me was to use PHP's Normalizer library. E.g., Normalizer::normalize('o¨', Normalizer::FORM_C).

The solution for me when running into this Incorrect string value: '\xF8' for column error using scriptcase was to be sure that my database is set up for utf8 general ci and so are my field collations. Then when I do my data import of a csv file I load the csv into UE Studio then save it formatted as utf8 and Voila! It works like a charm, 29000 records in there no errors. Previously I was trying to import an excel created csv.

I have tried all of the above solutions (which all bring valid points), but nothing was working for me.
Until I found that my MySQL table field mappings in C# was using an incorrect type: MySqlDbType.Blob . I changed it to MySqlDbType.Text and now I can write all the UTF8 symbols I want!
p.s. My MySQL table field is of the "LongText" type. However, when I autogenerated the field mappings using MyGeneration software, it automatically set the field type as MySqlDbType.Blob in C#.
Interestingly, I have been using the MySqlDbType.Blob type with UTF8 characters for many months with no trouble, until one day I tried writing a string with some specific characters in it.
Hope this helps someone who is struggling to find a reason for the error.

If you happen to process the value with some string function before saving, make sure the function can properly handle multibyte characters. String functions that cannot do that and are, say, attempting to truncate might split one of the single multibyte characters in the middle, and that can cause such string error situations.
In PHP for instance, you would need to switch from substr to mb_substr.

I added binary before the column name and solve the charset error.
insert into tableA values(binary stringcolname1);

Hi i also got this error when i use my online databases from godaddy server
i think it has the mysql version of 5.1 or more. but when i do from my localhost server (version 5.7) it was fine after that i created the table from local server and copied to the online server using mysql yog i think the problem is with character set
Screenshot Here

To fix this error I upgraded my MySQL database to utf8mb4 which supports the full Unicode character set by following this detailed tutorial. I suggest going through it carefully, because there are quite a few gotchas (e.g. the index keys can become too large due to the new encodings after which you have to modify field types).

There's good answers in here. I'm just adding mine since I ran into the same error but it turned out to be a completely different problem. (Maybe on the surface the same, but a different root cause.)
For me the error happened for the following field:
#Column(nullable = false, columnDefinition = "VARCHAR(255)")
private URI consulUri;
This ends up being stored in the database as a binary serialization of the URI class. This didn't raise any flags with unit testing (using H2) or CI/integration testing (using MariaDB4j), it blew up in our production-like setup. (Though, once the problem was understood, it was easy enough to see the wrong value in the MariaDB4j instance; it just didn't blow up the test.) The solution was to build a custom type mapper:
package redacted;
import javax.persistence.AttributeConverter;
import java.net.URI;
import java.net.URISyntaxException;
import static java.lang.String.format;
public class UriConverter implements AttributeConverter<URI, String> {
#Override
public String convertToDatabaseColumn(URI attribute) {
return attribute.toString();
}
#Override
public URI convertToEntityAttribute(String field) {
try {
return new URI(field);
}
catch (URISyntaxException e) {
throw new RuntimeException(format("could not convert database field to URI: %s", field));
}
}
}
Used as follows:
#Column(nullable = false, columnDefinition = "VARCHAR(255)")
#Convert(converter = UriConverter.class)
private URI consulUri;
As far as Hibernate is involved, it seems it has a bunch of provided type mappers, including for java.net.URL, but not for java.net.URI (which is what we needed here).

In my case that problem was solved by changing Mysql column encoding to 'binary' (data type will be changed automatically to VARBINARY). Probably I will not be able to filter or search with that column, but I'm no need for that.

In my case ,first i've meet a '???' in my website, then i check Mysql's character set which is latin now ,so i change it into utf-8,then i restart my project ,then i got the same error with you , then i found that i forget to change the database's charset and change into utf-8, boom,it worked.

I tried almost every steps mentioned here. None worked. Downloaded mariadb. It worked. I know this is not a solution yet this might help somebody to identify the problem quickly or give a temporary solution.
Server version: 10.2.10-MariaDB - MariaDB Server
Protocol version: 10
Server charset: UTF-8 Unicode (utf8)

I had a table with a varbinary column that I wanted to convert to utf8mb4 varchar. Unfortunately some of the existing data was invalid UTF-8 and the ALTER query returned Incorrect string value for various rows.
I tried every suggestion I could find regarding cast / convert / char_length = length etc. but nothing in SQL detected the erroneous values, other than the ALTER query returning bad rows one by one. I would love a pure SQL solution to remove the bad values. Sadly this solution is not pretty
I ended up select *'ing the entire table into PHP, where the erroneous rows could be detected en-masse by:
if (empty(htmlspecialchars($row['whatever'])))

The problem can also be caused by the client if the charset is not set to utf8mb4. so even if every Database, Table and Column is set to utf8mb4 you will still get an error, for instance in PyCharm.
For Python, set the charset of the connection in the MySQL Connector connect method:
mydb = mysql.connector.connect(
host="IP or Host",
user="<user>",
passwd="<password>",
database="<yourDB>",
# set charset to utf8mb4 to support emojis
charset='utf8mb4'
)

I know i`m late to the ball but someone else might come accross the problem i had with this and be happy to read my workaround.
I have come accross this problem with french characters. turns out i the text I was copying had encoding the accents on some charaatcers as 2 chars and others as single chars...
i couldn`t find how to set my table to accept the strings so i ended up changing the diacritics in my text import.
here is a list of them as double characters to search for them in your texts.
ùòìàè
áéíóú
ûôêâî
ç

1 - You have to declare in your connection the propertie of enconding UTF8. http://php.net/manual/en/mysqli.set-charset.php.
2 - If you are using mysql commando line to execute a script, you have to use the flag, like:
Cmd: C:\wamp64\bin\mysql\mysql5.7.14\bin\mysql.exe -h localhost -u root -P 3306 --default-character-set=utf8 omega_empresa_parametros_336 < C:\wamp64\www\PontoEletronico\PE10002Corporacao\BancoDeDadosModelo\omega_empresa_parametros.sql

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.