How to remove BOM from an encoded base64 UTF string?

How to remove BOM from an encoded base64 UTF string? - c#

I have a file encoded in base64 using openssl base64 -in en -out en1 in a command line in MacOS and I am reading this file using the following code:
string fileContent = File.ReadAllText(Path.Combine(AppContext.BaseDirectory, MConst.BASE_DIR, "en1"));
var b1 = Convert.FromBase64String(fileContent);
var str1 = System.Text.Encoding.UTF8.GetString(b1);
The string I am getting has a ? before the actual file content. I am not sure what's causing this, any help will be appreciated.
Example Input:
import pandas
import json
Encoded file example:
77u/DQppbXBvcnQgY29ubmVjdG9yX2FwaQ0KaW1wb3J0IGpzb24NCg0K
Output based on the C# code:
?import pandas
import json

Normally, when you read UTF (with BOM) from a text file, the decoding is handled for you behind the scene. For example, both of the following lines will read UTF text correctly regardless of whether or not the text file has a BOM:
File.ReadAllText(path, Encoding.UTF8);
File.ReadAllText(path); // UTF8 is the default.
The problem is that you're dealing with UTF text that has been encoded to a Base64 string. So, ReadAllText() can no longer handle the BOM for you. You can either do it yourself by (checking and) removing the first 3 bytes from the byte array or delegate that job to a StreamReader, which is exactly what ReadAllText() does:
var bytes = Convert.FromBase64String(fileContent);
string finalString = null;
using (var ms = new MemoryStream(bytes))
using (var reader = new StreamReader(ms)) // Or:
// using (var reader = new StreamReader(ms, Encoding.UTF8))
{
finalString = reader.ReadToEnd();
}
// Proceed to using finalString.

Related

Encoding issue with spanish file in C#

I have a file store online in an azure blob storage in spanish. Some word have special charactere (for example : Almacén)
When I open the file in notepad++, the encoding is ANSI.
So now I try to read the file with the code :
using StreamReader reader = new StreamReader(Stream, Encoding.UTF8);
blobStream.Seek(0, SeekOrigin.Begin);
var allLines = await reader.ReadToEndAsync();
the issue is that "allLines" are not proper encoding, I have some issue like : Almac�n
I have try some solution like this one :
C# Convert string from UTF-8 to ISO-8859-1 (Latin1) H
but still not working
(the final goal is to "merge" two csv so I read the stream of both, remove the header and concatenate the string to push it again. If there is a better solution to merge csv in c# that can skip this encoding issue I am open to it also)

You are trying to read a non-UTF8 encoded file as if it was UTF8 encoded. I can replicate this issue with
var s = "Almacén";
using var memStream = new MemoryStream(Encoding.GetEncoding(28591).GetBytes(s));
using var reader = new StreamReader(memStream, Encoding.UTF8);
var allLines = await reader.ReadToEndAsync();
Console.WriteLine(allLines); // writes "Almac�n" to console
You should be attempting to read the file with encoding iso-8859-1 "Western European (ISO)" which is codepage 28591.
using var reader = new StreamReader(Stream, Encoding.GetEncoding(28591));
var allLines = await reader.ReadToEndAsync();

Which character encoding should I use for Tab-delimited flat file?

We are calling Report Type ‘_GET_MERCHANT_LISTINGS_DATA_’ of MWS API using C# web Application.
Some times we got � Character instead of single quote, space or for any other special characters while encoding data.
We have used Encoding.GetEncoding(1252) method to encode StreamReader.
We are using below code.
Stream s = reportRequest.Report;
StreamReader stream_reader = new StreamReader(s);
string reportResponseText = stream_reader.ReadToEnd();
byte[] byteArray = Encoding.GetEncoding(1252).GetBytes(reportResponseText);
MemoryStream stream = new MemoryStream(byteArray);
StreamReader filestream = new StreamReader(stream);
We also have tried ‘Encoding.UTF8.GetBytes(reportResponseText)’ but not useful.
Could anyone please suggest us correct method to encode data in correct format?

Dynamic loaded XML and XmlTextReader character encoding issue

I load a XML like this:
var url = Application.dataPath + #"/config.xml";
var www = new WWW(url);
while (!www.isDone)
{
yield return new WaitForSeconds(0.2f);
}
After that I create a XmlTextReader in order to parse that XML:
GameSettings.ParseXML(new XmlTextReader(new StringReader(www.text)));
But I'm having problem with character encoding (é,ç,ã,ê, etc). What can I do make it works?

If you use WWW.text, the function expects the web page contents encoded in UTF-8 or ASCII but your customer uses Windows-1252.
Like Bart already suggested, the best way would be to request that the customer just uses UTF-8. If that is not possible und you are sure that the customer always uses Windows-1252 you can convert the encoding inside your application.
Encoding windows1252 = Encoding.GetEncoding("Windows-1252");
Encoding utf8 = Encoding.UTF8;
byte[] windowsBytes = www.bytes;
byte[] utf8Bytes = Encoding.Convert(windows1252, utf8, windowsBytes);
string converted_xml = utf8.GetString(utf8Bytes);

Bytes read as UTF8 string and converted to Base64

Forgive the lengthy setup here but I thought it may help to have the context...
I am implementing a custom digital signature validation method in as part of a WCF service. We're using a custom method because various differing interpretations of some industry standards but the details there aren't all that relevant.
In this particular scenario, I am receiving an MTOM/XOP encoded request where the root MIME part contains a digital signature and the signature DigestValue and SignatureValue pieces are split up into separate MIME parts.
The MIME parts that contain the signature DigestValue and SignatureValue data is binary encoded so it is literally a bunch of raw bytes in the web request like this:
Content-Id: <c18605af-18ec-4fcb-bec7-e3767ef6fe53#example.jaxws.sun.com>
Content-Type: application/octet-stream
Content-Transfer-Encoding: binary
[non-printable-binary-data-goes-here]
--uuid:eda4d7f2-4647-4632-8ecb-5ba44f1a076d
I am reading the contents of the message in as a string (using the default UTF8 encoding) like this (see the requestAsString parameter below):
MessageBuffer buffer = request.CreateBufferedCopy(int.MaxValue);
try
{
using (MemoryStream mstream = new MemoryStream())
{
buffer.WriteMessage(mstream);
mstream.Position = 0;
using (StreamReader sr = new StreamReader(mstream))
{
requestAsString = sr.ReadToEnd();
}
request = buffer.CreateMessage();
}
}
After I read the MTOM/XOP message in, I am attempting to re-organize the multiple MIME parts into one SOAP message where the signature DigestValue and SignatureValue elements are restored to the original SOAP envelope (and not as attachments). So basically I am taking decoding the MTOM/XOP request.
Unfortunately, I am having trouble reading the DigestValue and SignatureValue pieces correctly. I need to read the bytes out of the message and get the base64 string representation of that data.
Despite all the context above, it seems the core problem is reading the binary data in as a string (UTF8 encoded) and then converting it to a proper base64 representation.
Here is what I am seeing in my test code:
This is my example base64 string:
string base64String = "mowXMw68eLSv9J1W7f43MvNgCrc=";
I can then get the byte representation of that string. This yields an array of 20 bytes:
byte[] base64Bytes = Convert.FromBase64String(base64String);
I then get the UTF8 encoded version of those bytes:
string decodedString = UTF8Encoding.UTF8.GetString(base64Bytes);
Now the strange part... if I convert the string back to bytes as follows, I get an array of bytes that is 39 bytes long:
byte[] base64BytesBack = UTF8Encoding.UTF8.GetBytes(decodedString);
So obviously at this point, when I convert back into a base64 string, it doesn't match the original value:
string base64StringBack = Convert.ToBase64String(base64BytesBack);
base64StringBack is set to "77+977+9FzMO77+9eO+/ve+/ve+/vVbvv73vv703Mu+/vWAK77+9"
What am I doing wrong here? If I switch to using UTF8Encoding.Unicode.GetString() and UTF8Encoding.Unicode.GetBytes(), it works as expected:
string base64String = "mowXMw68eLSv9J1W7f43MvNgCrc=";
// First get an array of bytes from the base64 string
byte[] base64Bytes = Convert.FromBase64String(base64String);
// Get the Unicode representation of the base64 bytes.
string decodedString = UTF8Encoding.Unicode.GetString(base64Bytes);
byte[] base64BytesBack = UTF8Encoding.Unicode.GetBytes(decodedString);
string base64StringBack = Convert.ToBase64String(base64BytesBack);
Now base64StringBack is set to "mowXMw68eLSv9J1W7f43MvNgCrc=" so it seems I am mis-using the UTF8 encoding somehow or it is behaving differently than I would expect.

Arbitrary binary data cannot be decoded into an UTF8 encoded string and then encoded back to the same binary data. The paragraph "Invalid byte sequences" in http://en.wikipedia.org/wiki/UTF-8 points that out.
I am a bit confused as to why you want the data encoded/decoded as UTF8.

Ok, I took a different approach to reading the MTOM/XOP message:
Instead of relying on my own code to parse the MIME parts by hand, I just used XmlDictionaryReader.CreateMtomReader() to get an XmlDictionaryReader and read the message into an XmlDocument (being careful to preserve whitespace on the XmlDocument so digital signatures aren't broken):
MessageBuffer buffer = request.CreateBufferedCopy(int.MaxValue);
messageContentType = WebOperationContext.Current.IncomingRequest.ContentType;
try
{
using (MemoryStream mstream = new MemoryStream())
{
buffer.WriteMessage(mstream);
mstream.Position = 0;
if (messageContentType.Contains("multipart/related;"))
{
Encoding[] encodings = new Encoding[1];
encodings[0] = Encoding.UTF8;
// MTOM
using (XmlDictionaryReader reader = XmlDictionaryReader.CreateMtomReader(mstream, encodings, messageContentType, XmlDictionaryReaderQuotas.Max))
{
XmlDocument msgDoc = new XmlDocument();
msgDoc.PreserveWhitespace = true;
msgDoc.Load(reader);
requestAsString = msgDoc.OuterXml;
reader.Close();
}
}
else
{
// Text
using (StreamReader sr = new StreamReader(mstream))
{
requestAsString = sr.ReadToEnd();
}
}
request = buffer.CreateMessage();
}
}
finally
{
buffer.Close();
}

Cannot write to rtf file after replacing inside string with utf8 characters

I have a rtf file in which I have to make some text replacements with some language specific characters (UTF8). After the replacements I try to save to a new rtf file but either the characters are not set right(strange characters) or the file is saved with all the rtf raw code and all the formatting.
Here is my code:
var fs = new FileStream(#"F:\projects\projects\RtfEditor\Test.rtf", FileMode.Open, FileAccess.Read);
//reads the file in a byte[]
var sb = FileWorker.ReadToEnd(fs);
var enc = Encoding.GetEncoding(1250);
//var enc = Encoding.UTF8;
var sbs = enc.GetString(sb);
var sbsNew = sbs.Replace("#test/#", "ă î â șșțț");
//first writting aproach
var fsw = new FileStream(#"F:\projects\projects\RtfEditor\diac.rtf", FileMode.Create, FileAccess.Write);
fsw.Write(enc.GetBytes(sbsNew), 0, enc.GetBytes(sbsNew).Length);
fsw.Flush();
fsw.Close();
In this aproach, the result file is the right one but the characters "șșțț" are shown as "????".
//second writing aproach
using (StreamWriter sw = new StreamWriter(fsw, Encoding.UTF8))
{
sw.Write(sbsNew);
sw.Flush();
}
In this aproach, the result file is a rtf file but with all rtf raw code and formatting and the special characters are saved right (șșțț appear correcty, no more ????)

A RTF file can directly contain 7-bit characters only. Everything else needs to be encoded into escape sequences. More detailed information can be found in e.g. this Wikipedia article.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

How to remove BOM from an encoded base64 UTF string? - c#

Related

Encoding issue with spanish file in C#

Which character encoding should I use for Tab-delimited flat file?

Dynamic loaded XML and XmlTextReader character encoding issue

Bytes read as UTF8 string and converted to Base64

Cannot write to rtf file after replacing inside string with utf8 characters

Categories

Resources