UTF-16 XML to HTML in .NET

UTF-16 XML to HTML in .NET - c#

I have a string stored in a SQL Server database table column that is currently a VarChar(Max) but the content is UTF-16 XML. Here is a sample:
<?xml version="1.0" encoding="utf-16" standalone="yes"?><Content><control name="txtGeneral" value="Hi Bryan,
This is a sample message stored in the database that I need to get out in HTML. I can&amp;#39;t seem to figure out how to get it out into HTML.
Thanks!
Robot.
-----Original Message-----
Date: 08-21-15 19:57
From: System Test, Microsoft Corp
To: Framework.NET
Subject: RE: RE: RE: RE:
" /></Content>
The data, stored raw, is not XML/datatype but I can do the conversion in my select (see below). I am pulling it out via .NET/ADO so I have it locally in a string for display in HTML. I just need to convert it for a textbox or HTML element so that it is displayed on the screen.
I can parse in t-sql the element (#value) I want but this does not do the encoding changes for me. Here is my sample query:
SELECT TOP 1 CONVERT(XML,CONVERT(NVARCHAR(MAX),m.Content)).value('(/Content/control/#value)[1]', 'varchar(max)')
FROM Messages m
WHERE MessageID = 85713;
I can use either .NET or t-sql for the conversion. I will be selecting only a single message at a time so performance should not be an issue.
This is what I would like it to look like:
Hi Bryan,
This is a sample message stored in the database that I need to get out in HTML. I can&amp;#39;t seem to figure out how to get it out into HTML.
Thanks!
Robot.
-----Original Message-----
Date: 08-21-15 19:57
From: System Test, Microsoft Corp
To: Framework.NET
Subject: RE: RE: RE: RE:
convert via: https://r12a.github.io/apps/conversion/
Thanks!

There are many serious flaws:
Do not store XML on string base, use the native XML type
Do not handle XML as a string, use the native XML methods
If - for any reason - you have to deal with it on string level use NVARCHAR(MAX)
Never use 1-byte-encoded VARCHAR(MAX). This will nead extra conversions and can lead to silly errors.
Do not store the xml-declaration <?xml blah ?>. This is needed to specify a file's encoding. Within SQL-Server an XML is always unicode / UCS 2
If you can change the above, you should really consider to do this. If not, here's an approach:
First cast the VARCHAR(MAX) to NVARCHAR(MAX), then to XML. Together with NVARCHAR(MAX) the UTF-16 will no longer disturb. Then use .value() to retrieve the value of the so named attribute.
DECLARE #mockMessages TABLE(Content VARCHAR(MAX));
INSERT INTO #mockMessages VALUES
('<?xml version="1.0" encoding="utf-16" standalone="yes"?><Content><control name="txtGeneral" value="Hi Bryan,
This is a sample message stored in the database that I need to get out in HTML. I can&amp;#39;t seem to figure out how to get it out into HTML.
Thanks!
Robot.
-----Original Message-----
Date: 08-21-15 19:57
From: System Test, Microsoft Corp
To: Framework.NET
Subject: RE: RE: RE: RE:
" /></Content>');
SELECT CAST(CAST(m.Content AS NVARCHAR(MAX)) AS XML).value(N'(/Content/control/#value)[1]',N'nvarchar(max)')
FROM #mockMessages AS m;
The same is - in principles - valid for .Net.
UPDATE: Some words about encoding
SQL-Server does neither support UTF-8, nor real UTF-16. There is a 1-byte encoding, which is extended ASCII (codepage/character mapping) and a 2-byte encoding, which is unicode / UCS-2 (which is almost UTF-8, at least with 99% of the usually seen characters). If you need your output UTF-8 encoded you must do this in your application. In almost any case you consider SQL Server's XML output (in UCS-2) as UTF-16. The communication between SQL-Server and .Net-code is unicode by default

Related

Remove control characters but keep ½ characters etc

I have a T-SQL stored procedure that returns a flattened list of results by using the 'for xml' command to convert to XML.
I experience a problem occasionally where data from a third party that contains control characters is streamed into one of the varchar fields converted.
I resolved this by base 64 encoding the varchar before performing the conversion:
cast(InvalidText as varbinary) as FixedText
I then decode this from base 64 in my C# application.
This works great, except when the text includes a symbol such as ½. After decoding these characters, they are shown as �.
I need to display these characters. Is there a way I can solve both problems?
EDIT: I have tried specifying UTF-8 encoding when sending my XML into my C# application. This has not helped.
Here's a simplified example of what's happening:
SQL:
select cast('Take ½ of the total' as varbinary) for xml path ('result'), type;
Then I pass this encoded string to my C# application.
C#:
using System;
using System.Text;
public class Program
{
public static void Main()
{
var encodedText = "VGFrZSC9IG9mIHRoZSB0b3RhbA=="; // From SQL encoding above
var decodedText = Encoding.UTF8.GetString(
Convert.FromBase64String(encodedText));
Console.WriteLine(decodedText);
}
}
Console output: Take � of the total
Manually adding at the start of the XML document produces the same results.

I'm not quite sure about your issue, but I think, that you might be digging in the the wrong spot.
SQL-Server knows two kinds of string to work with:
1-byte encoding: extended ASCII, where the collation defines all non-plain-latin characters
2-byte encoding: UCS-2, which is almost the same as UTF-16
Just to mention it: Starting with v2019 there are special collations supporting UTF8
As long as you don't mix 1- and 2-byte strings in binary approaches, this works pretty well.
Try this:
SELECT 'A½B' AS UsingASCII
,CAST('A½B' AS VARBINARY(MAX)) AS UsingASCIIasBinary
,N'A½B' AS UsingUCS2
,CAST(N'A½B' AS VARBINARY(MAX)) AS UsingUCS2asBinary
FOR XML PATH('')
returns
Text binary base64
A½B 0x41BD42 Qb1C
A½B 0x4100BD004200 QQC9AEIA
You can see the HEX codes 41, BD and 42 for the three characters, and the 00 to make it 2-byte UCS2.
The code points 41 and 42 are "A" and "B", while the code point BD stands for your special character.
SQL-Server's results are no miracles...
In SQL-Server the FOR XML statement will return a native XML, which output format is NVARCHAR(MAX) by default. For sure this will not be UTF8.
Reconvert the base64 from the example above
DECLARE #xml XML=
N'<binaryASCII>Qb1C</binaryASCII>
<binaryUCS2>QQC9AEIA</binaryUCS2>';
SELECT #xml.value('(/binaryASCII)[1]','varbinary(max)')
,CAST(#xml.value('(/binaryASCII)[1]','varbinary(max)') AS VARCHAR(MAX)) ReconvertedFromASCII
,#xml.value('(/binaryUCS2)[1]','varbinary(max)')
,CAST(#xml.value('(/binaryUCS2)[1]','varbinary(max)') AS NVARCHAR(MAX)) ReconvertedFromUCS2;
Reading base64 in T-SQL needs a little XML-hack:
Your base64 example:
SELECT CAST(CAST('VGFrZSC9IG9mIHRoZSB0b3RhbA==' AS XML)
.value('.','varbinary(max)') AS VARCHAR(MAX));
My system returns the "half" symbol correctly. This lets me assume, that your standard collation maps another/no character to this code point.
Try to find out your default collation and check the involved columns' collations and read about COLLATE.

Arabic_CI_AS to utf8 in C#

I have a DataBase in Sql Server with collection Arabic_CI_AS and i need to compare some string data with another Postgres Database with Utf8 character set. Also i use C# for convert & compare. It easy done when string contains just one word (in these cases i should just replace 'ي' to 'ی'), but for long string special with '(' charachter has problem.
I cant do it! I try some suggested solution such as:
var enc = Encoding.GetEncoding(1256);
byte[] encBytes = enc.GetBytes(customer.name);
customer.name = Encoding.UTF8.GetString(encBytes, 0, encBytes.Length);
or:
SELECT cast (name as nvarchar) as NewName
from Customer
But they dont work! Can anyone help me?
Example of input and output, see tooltips on the right:

maybe this can help you to change your collation dynamically
SELECT name collate SQL_Latin1_General_CP1_CI_AS
from Customer
or
SELECT name collate Persian_100_CI_AI
from Customer
or
you can try this in c# side
string _Value=string.Empty;
byte[] enBuff= Encoding.GetEncoding("windows-1256").GetBytes(customer.name);
customer.name= Encoding.GetEncoding("windows-1252").GetString(enBuff);
you can choose another collations too.
you should change many collation and Encoding number to get wanted result.

SQL Server does not support utf-8 strings. If you have to deal with characters other than plain-latin it is strongly recommended to use NVARCHAR instead of VARCHAR with an arabic collation.
Many people think, that NVARCHAR is utf-16 while VARCHAR is utf-8. This is not true! The second is extended ASCII and is using 1 byte in any case, while utf-8 will encode some characters with more than one byte.
So - the most important question is: WHY?
SQL Server can take your string into a NVARCHAR variable, cast it to a chain of bytes and re-cast it to the former string:
DECLARE #str NVARCHAR(MAX)=N'(نماینده اراک)';
SELECT #str
,CAST(#str AS VARBINARY(MAX))
,CAST(CAST(#str AS VARBINARY(MAX)) AS NVARCHAR(MAX));
The problem with the ) is - quite probably! - that your arabic letters are right-to-left while the ) is left-to-right. I wanted to paste the result of the query above into this answer but did not manage to get the closing ) to the original place... You try to edit, delete, replace, but you get something else... Somehow funny, but not a question of bad encoding but one of buggy editors...
Anyway, SQL-Server is not your issue. You must read the string as NVARCHAR out of SQL-Server. C# is working with unicode strings and not a collated 1-byte string. Every conversion carries the chance to destroy your text.
If your target (or the tooltip you show us) is not capable to show the string properly, it might be perfectly okay, but the editor is not...
If you pass such an UTF-8 string back to SQL-Server, you'll get a mess...
The only place, where UTF-8 makes sense is written to a file or transmitted via small band. If a text contains very many plain latin characters and just a few strange letters (like ver often XML, HTML) you can save quite some diskspace or band with. With a far-east text you'd even bloat you text. Some of these characters will need 3 or even 4 bytes to be encoded.
Within your database and application you should stick with unicode.

How to setup encoding for Bosnian (or Croatian or Slovenian) characters set using MySql and Umbraco 4.7.1

I am having problem displaying characters š and ž on frontend when I insert it as textstring data type inside Umbraco 4.7.1. Umbraco uses MySql database.
I noticed that this is not the problem when I save those characters with a rich text editor. I looked at the database and all rich text editor values are stored in the XML inside CDATA, but textstring data type isn't inside CDATA.
All other Bosnian specific characters (čćđ) are html encoded as čćđ but š and ž are saved as s and z.
When i try to change textstring database data type to ntext instead of varchar, it works (because it stores in CDATA) but i can not do that because then I will lose all of my existing data.
My HTML encoding charset is iso-8859-1.
What to do here?

Funnily enough, I was discussing the Croatian alphabet (in a non-computing context) with someone recently and they gave me a gem-of-a-link, which states the following with regards to the Croatian alphabet:
The 8-bit ISO 8859-2 (Latin-2) standard was developed by ISO. 1
The ISO 8859-1 2 only has partial support for languages that use a similar character set,whilst ISO 8859-2 3 provides full support for Bosnian, Croatian and many other languages.
Changing your encoding should fix the problem.
1 http://en.wikipedia.org/wiki/Gaj%27s_Latin_alphabet#Computing
2 http://en.wikipedia.org/wiki/ISO_8859-1
3 http://en.wikipedia.org/wiki/ISO_8859-2

Actually windows-1251 helped. It encodes wanted characters.

After a while I figured out that I can use default encoding (UTF8), but I have to change database collations. So I changed collations on every table column that had varchar or ntext and now it works fully and that is the best solution I found so far.

Insert Russian Language data into database from an array

My query looks like:
string str = string.Format("Insert into [MyDB].[dbo].[tlb1] ([file_path],[CONTENT1],[CONTENT2]) values ('{0}','{1}','{2}');", fullpath, _val[0], _val[1]);
Now when I insert data into database if array _val[] contains data in english language it insert correctly but when array contains data in Russian Language in database this show like ???????????????????????
Is there a way to insert data in Russian Language from an array.

According to this (Archived) Microsoft Support Issue:
You must precede all Unicode strings with a prefix N when you deal with Unicode string constants in SQL Server

First of all, you should use prepared statements and let the database driver insert the placeholders correctly (i.e. SqlCommand with parameters). Then the issue should go away (as well as any potential SQL injection problems).
As a quick fix in your case: Prefix the string literals you're inserting with N:
... values (N'{0}',N'{1}',N'{2}')
This causes the literals to be Unicode literals, not arbitrary-legacy-codepage ones and thus preventing the conversion from Unicode to the legacy codepage (which results in question marks for characters that cannot be represented).

It seems that the datatype of columns [Content1] and [Content2] is nchar. You should convert the columns to nvarchar which is used to store unicode data.

First of all you must see Database codepage at server. May be non-Unicode CP in database, but data from your app comes in Unicode format.

MySqlException incorrect string value [duplicate]

After noticing an application tended to discard random emails due to incorrect string value errors, I went though and switched many text columns to use the utf8 column charset and the default column collate (utf8_general_ci) so that it would accept them. This fixed most of the errors, and made the application stop getting sql errors when it hit non-latin emails, too.
Despite this, some of the emails are still causing the program to hit incorrect string value errrors: (Incorrect string value: '\xE4\xC5\xCC\xC9\xD3\xD8...' for column 'contents' at row 1)
The contents column is a MEDIUMTEXT datatybe which uses the utf8 column charset and the utf8_general_ci column collate. There are no flags that I can toggle in this column.
Keeping in mind that I don't want to touch or even look at the application source code unless absolutely necessary:
What is causing that error? (yes, I know the emails are full of random garbage, but I thought utf8 would be pretty permissive)
How can I fix it?
What are the likely effects of such a fix?
One thing I considered was switching to a utf8 varchar([some large number]) with the binary flag turned on, but I'm rather unfamiliar with MySQL, and have no idea if such a fix makes sense.

UPDATE to the below answer:
The time the question was asked, "UTF8" in MySQL meant utf8mb3. In the meantime, utf8mb4 was added, but to my knowledge MySQLs "UTF8" was not switched to mean utf8mb4.
That means, you'd need to specifically put "utf8mb4", if you mean it (and you should use utf8mb4)
I'll keep this here instead of just editing the answer, to make clear there is still a difference when saying "UTF8"
Original
I would not suggest Richies answer, because you are screwing up the data inside the database. You would not fix your problem but try to "hide" it and not being able to perform essential database operations with the crapped data.
If you encounter this error either the data you are sending is not UTF-8 encoded, or your connection is not UTF-8. First, verify, that the data source (a file, ...) really is UTF-8.
Then, check your database connection, you should do this after connecting:
SET NAMES 'utf8mb4';
SET CHARACTER SET utf8mb4;
Next, verify that the tables where the data is stored have the utf8mb4 character set:
SELECT
`tables`.`TABLE_NAME`,
`collations`.`character_set_name`
FROM
`information_schema`.`TABLES` AS `tables`,
`information_schema`.`COLLATION_CHARACTER_SET_APPLICABILITY` AS `collations`
WHERE
`tables`.`table_schema` = DATABASE()
AND `collations`.`collation_name` = `tables`.`table_collation`
;
Last, check your database settings:
mysql> show variables like '%colla%';
mysql> show variables like '%charac%';
If source, transport and destination are utf8mb4, your problem is gone;)

MySQL’s utf-8 types are not actually proper utf-8 – it only uses up to three bytes per character and supports only the Basic Multilingual Plane (i.e. no Emoji, no astral plane, etc.).
If you need to store values from higher Unicode planes, you need the utf8mb4 encodings.

The table and fields have the wrong encoding; however, you can convert them to UTF-8.
ALTER TABLE logtest CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
ALTER TABLE logtest DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
ALTER TABLE logtest CHANGE title title VARCHAR(100) CHARACTER SET utf8 COLLATE utf8_general_ci;

"\xE4\xC5\xCC\xC9\xD3\xD8" isn't valid UTF-8. Tested using Python:
>>> "\xE4\xC5\xCC\xC9\xD3\xD8".decode("utf-8")
...
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: invalid data
If you're looking for a way to avoid decoding errors within the database, the cp1252 encoding (aka "Windows-1252" aka "Windows Western European") is the most permissive encoding there is - every byte value is a valid code point.
Of course it's not going to understand genuine UTF-8 any more, nor any other non-cp1252 encoding, but it sounds like you're not too concerned about that?

I solved this problem today by altering the column to 'LONGBLOB' type which stores raw bytes instead of UTF-8 characters.
The only disadvantage of doing this is that you have to take care of the encoding yourself. If one client of your application uses UTF-8 encoding and another uses CP1252, you may have your emails sent with incorrect characters. To avoid this, always use the same encoding (e.g. UTF-8) across all your applications.
Refer to this page http://dev.mysql.com/doc/refman/5.0/en/blob.html for more details of the differences between TEXT/LONGTEXT and BLOB/LONGBLOB. There are also many other arguments on the web discussing these two.

First check if your default_character_set_name is utf8.
SELECT default_character_set_name FROM information_schema.SCHEMATA S WHERE schema_name = "DBNAME";
If the result is not utf8 you must convert your database. At first you must save a dump.
To change the character set encoding to UTF-8 for all of the tables in the specified database, type the following command at the command line. Replace DBNAME with the database name:
mysql --database=DBNAME -B -N -e "SHOW TABLES" | awk '{print "SET foreign_key_checks = 0; ALTER TABLE", $1, "CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci; SET foreign_key_checks = 1; "}' | mysql --database=DBNAME
To change the character set encoding to UTF-8 for the database itself, type the following command at the mysql> prompt. Replace DBNAME with the database name:
ALTER DATABASE DBNAME CHARACTER SET utf8 COLLATE utf8_general_ci;
You can now retry to to write utf8 character into your database. This solution help me when i try to upload 200000 row of csv file into my database.

Although your collation is set to utf8_general_ci, I suspect that the character encoding of the database, table or even column may be different.
ALTER TABLE tabale_name MODIFY COLUMN column_name VARCHAR(255)
CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;

In general, this happens when you insert strings to columns with incompatible encoding/collation.
I got this error when I had TRIGGERs, which inherit server's collation for some reason.
And mysql's default is (at least on Ubuntu) latin-1 with swedish collation.
Even though I had database and all tables set to UTF-8, I had yet to set my.cnf:
/etc/mysql/my.cnf :
[mysqld]
character-set-server=utf8
default-character-set=utf8
And this must list all triggers with utf8-*:
select TRIGGER_SCHEMA, TRIGGER_NAME, CHARACTER_SET_CLIENT, COLLATION_CONNECTION, DATABASE_COLLATION from information_schema.TRIGGERS
And some of variables listed by this should also have utf-8-* (no latin-1 or other encoding):
show variables like 'char%';

I got a similar error (Incorrect string value: '\xD0\xBE\xDO\xB2. ...' for 'content' at row 1). I have tried to change character set of column to utf8mb4 and after that the error has changed to 'Data too long for column 'content' at row 1'.
It turned out that mysql shows me wrong error. I turned back character set of column to utf8 and changed type of the column to MEDIUMTEXT. After that the error disappeared.
I hope it helps someone.
By the way MariaDB in same case (I have tested the same INSERT there) just cut a text without error.

That error means that either you have the string with incorrect encoding (e.g. you're trying to enter ISO-8859-1 encoded string into UTF-8 encoded column), or the column does not support the data you're trying to enter.
In practice, the latter problem is caused by MySQL UTF-8 implementation that only supports UNICODE characters that need 1-3 bytes when represented in UTF-8. See "Incorrect string value" when trying to insert UTF-8 into MySQL via JDBC? for details. The trick is to use column type utf8mb4 instead of type utf8 which doesn't actually support all of UTF-8 despite the name. The former type is the correct type to use for all UTF-8 strings.

In my case, Incorrect string value: '\xCC\x88'..., the problem was that an o-umlaut was in its decomposed state. This question-and-answer helped me understand the difference between o¨ and ö. In PHP, the fix for me was to use PHP's Normalizer library. E.g., Normalizer::normalize('o¨', Normalizer::FORM_C).

The solution for me when running into this Incorrect string value: '\xF8' for column error using scriptcase was to be sure that my database is set up for utf8 general ci and so are my field collations. Then when I do my data import of a csv file I load the csv into UE Studio then save it formatted as utf8 and Voila! It works like a charm, 29000 records in there no errors. Previously I was trying to import an excel created csv.

I have tried all of the above solutions (which all bring valid points), but nothing was working for me.
Until I found that my MySQL table field mappings in C# was using an incorrect type: MySqlDbType.Blob . I changed it to MySqlDbType.Text and now I can write all the UTF8 symbols I want!
p.s. My MySQL table field is of the "LongText" type. However, when I autogenerated the field mappings using MyGeneration software, it automatically set the field type as MySqlDbType.Blob in C#.
Interestingly, I have been using the MySqlDbType.Blob type with UTF8 characters for many months with no trouble, until one day I tried writing a string with some specific characters in it.
Hope this helps someone who is struggling to find a reason for the error.

If you happen to process the value with some string function before saving, make sure the function can properly handle multibyte characters. String functions that cannot do that and are, say, attempting to truncate might split one of the single multibyte characters in the middle, and that can cause such string error situations.
In PHP for instance, you would need to switch from substr to mb_substr.

I added binary before the column name and solve the charset error.
insert into tableA values(binary stringcolname1);

Hi i also got this error when i use my online databases from godaddy server
i think it has the mysql version of 5.1 or more. but when i do from my localhost server (version 5.7) it was fine after that i created the table from local server and copied to the online server using mysql yog i think the problem is with character set
Screenshot Here

To fix this error I upgraded my MySQL database to utf8mb4 which supports the full Unicode character set by following this detailed tutorial. I suggest going through it carefully, because there are quite a few gotchas (e.g. the index keys can become too large due to the new encodings after which you have to modify field types).

There's good answers in here. I'm just adding mine since I ran into the same error but it turned out to be a completely different problem. (Maybe on the surface the same, but a different root cause.)
For me the error happened for the following field:
#Column(nullable = false, columnDefinition = "VARCHAR(255)")
private URI consulUri;
This ends up being stored in the database as a binary serialization of the URI class. This didn't raise any flags with unit testing (using H2) or CI/integration testing (using MariaDB4j), it blew up in our production-like setup. (Though, once the problem was understood, it was easy enough to see the wrong value in the MariaDB4j instance; it just didn't blow up the test.) The solution was to build a custom type mapper:
package redacted;
import javax.persistence.AttributeConverter;
import java.net.URI;
import java.net.URISyntaxException;
import static java.lang.String.format;
public class UriConverter implements AttributeConverter<URI, String> {
#Override
public String convertToDatabaseColumn(URI attribute) {
return attribute.toString();
}
#Override
public URI convertToEntityAttribute(String field) {
try {
return new URI(field);
}
catch (URISyntaxException e) {
throw new RuntimeException(format("could not convert database field to URI: %s", field));
}
}
}
Used as follows:
#Column(nullable = false, columnDefinition = "VARCHAR(255)")
#Convert(converter = UriConverter.class)
private URI consulUri;
As far as Hibernate is involved, it seems it has a bunch of provided type mappers, including for java.net.URL, but not for java.net.URI (which is what we needed here).

In my case that problem was solved by changing Mysql column encoding to 'binary' (data type will be changed automatically to VARBINARY). Probably I will not be able to filter or search with that column, but I'm no need for that.

In my case ,first i've meet a '???' in my website, then i check Mysql's character set which is latin now ,so i change it into utf-8,then i restart my project ,then i got the same error with you , then i found that i forget to change the database's charset and change into utf-8, boom,it worked.

I tried almost every steps mentioned here. None worked. Downloaded mariadb. It worked. I know this is not a solution yet this might help somebody to identify the problem quickly or give a temporary solution.
Server version: 10.2.10-MariaDB - MariaDB Server
Protocol version: 10
Server charset: UTF-8 Unicode (utf8)

I had a table with a varbinary column that I wanted to convert to utf8mb4 varchar. Unfortunately some of the existing data was invalid UTF-8 and the ALTER query returned Incorrect string value for various rows.
I tried every suggestion I could find regarding cast / convert / char_length = length etc. but nothing in SQL detected the erroneous values, other than the ALTER query returning bad rows one by one. I would love a pure SQL solution to remove the bad values. Sadly this solution is not pretty
I ended up select *'ing the entire table into PHP, where the erroneous rows could be detected en-masse by:
if (empty(htmlspecialchars($row['whatever'])))

The problem can also be caused by the client if the charset is not set to utf8mb4. so even if every Database, Table and Column is set to utf8mb4 you will still get an error, for instance in PyCharm.
For Python, set the charset of the connection in the MySQL Connector connect method:
mydb = mysql.connector.connect(
host="IP or Host",
user="<user>",
passwd="<password>",
database="<yourDB>",
# set charset to utf8mb4 to support emojis
charset='utf8mb4'
)

I know i`m late to the ball but someone else might come accross the problem i had with this and be happy to read my workaround.
I have come accross this problem with french characters. turns out i the text I was copying had encoding the accents on some charaatcers as 2 chars and others as single chars...
i couldn`t find how to set my table to accept the strings so i ended up changing the diacritics in my text import.
here is a list of them as double characters to search for them in your texts.
ùòìàè
áéíóú
ûôêâî
ç

1 - You have to declare in your connection the propertie of enconding UTF8. http://php.net/manual/en/mysqli.set-charset.php.
2 - If you are using mysql commando line to execute a script, you have to use the flag, like:
Cmd: C:\wamp64\bin\mysql\mysql5.7.14\bin\mysql.exe -h localhost -u root -P 3306 --default-character-set=utf8 omega_empresa_parametros_336 < C:\wamp64\www\PontoEletronico\PE10002Corporacao\BancoDeDadosModelo\omega_empresa_parametros.sql

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.