Which character encoding should I use for Tab-delimited flat file?

Which character encoding should I use for Tab-delimited flat file? - c#

We are calling Report Type ‘_GET_MERCHANT_LISTINGS_DATA_’ of MWS API using C# web Application.
Some times we got � Character instead of single quote, space or for any other special characters while encoding data.
We have used Encoding.GetEncoding(1252) method to encode StreamReader.
We are using below code.
Stream s = reportRequest.Report;
StreamReader stream_reader = new StreamReader(s);
string reportResponseText = stream_reader.ReadToEnd();
byte[] byteArray = Encoding.GetEncoding(1252).GetBytes(reportResponseText);
MemoryStream stream = new MemoryStream(byteArray);
StreamReader filestream = new StreamReader(stream);
We also have tried ‘Encoding.UTF8.GetBytes(reportResponseText)’ but not useful.
Could anyone please suggest us correct method to encode data in correct format?

Related

How to remove BOM from an encoded base64 UTF string?

I have a file encoded in base64 using openssl base64 -in en -out en1 in a command line in MacOS and I am reading this file using the following code:
string fileContent = File.ReadAllText(Path.Combine(AppContext.BaseDirectory, MConst.BASE_DIR, "en1"));
var b1 = Convert.FromBase64String(fileContent);
var str1 = System.Text.Encoding.UTF8.GetString(b1);
The string I am getting has a ? before the actual file content. I am not sure what's causing this, any help will be appreciated.
Example Input:
import pandas
import json
Encoded file example:
77u/DQppbXBvcnQgY29ubmVjdG9yX2FwaQ0KaW1wb3J0IGpzb24NCg0K
Output based on the C# code:
?import pandas
import json

Normally, when you read UTF (with BOM) from a text file, the decoding is handled for you behind the scene. For example, both of the following lines will read UTF text correctly regardless of whether or not the text file has a BOM:
File.ReadAllText(path, Encoding.UTF8);
File.ReadAllText(path); // UTF8 is the default.
The problem is that you're dealing with UTF text that has been encoded to a Base64 string. So, ReadAllText() can no longer handle the BOM for you. You can either do it yourself by (checking and) removing the first 3 bytes from the byte array or delegate that job to a StreamReader, which is exactly what ReadAllText() does:
var bytes = Convert.FromBase64String(fileContent);
string finalString = null;
using (var ms = new MemoryStream(bytes))
using (var reader = new StreamReader(ms)) // Or:
// using (var reader = new StreamReader(ms, Encoding.UTF8))
{
finalString = reader.ReadToEnd();
}
// Proceed to using finalString.

Encoding issue with spanish file in C#

I have a file store online in an azure blob storage in spanish. Some word have special charactere (for example : Almacén)
When I open the file in notepad++, the encoding is ANSI.
So now I try to read the file with the code :
using StreamReader reader = new StreamReader(Stream, Encoding.UTF8);
blobStream.Seek(0, SeekOrigin.Begin);
var allLines = await reader.ReadToEndAsync();
the issue is that "allLines" are not proper encoding, I have some issue like : Almac�n
I have try some solution like this one :
C# Convert string from UTF-8 to ISO-8859-1 (Latin1) H
but still not working
(the final goal is to "merge" two csv so I read the stream of both, remove the header and concatenate the string to push it again. If there is a better solution to merge csv in c# that can skip this encoding issue I am open to it also)

You are trying to read a non-UTF8 encoded file as if it was UTF8 encoded. I can replicate this issue with
var s = "Almacén";
using var memStream = new MemoryStream(Encoding.GetEncoding(28591).GetBytes(s));
using var reader = new StreamReader(memStream, Encoding.UTF8);
var allLines = await reader.ReadToEndAsync();
Console.WriteLine(allLines); // writes "Almac�n" to console
You should be attempting to read the file with encoding iso-8859-1 "Western European (ISO)" which is codepage 28591.
using var reader = new StreamReader(Stream, Encoding.GetEncoding(28591));
var allLines = await reader.ReadToEndAsync();

reading stream with right encoding in C#

I'm trying to read a stream with iso-8859-1 encoding with C#:
using (var reader = new StreamReader(stream,System.Text.Encoding.GetEncoding("iso-8859-1")))
{
var current_enc = reader.CurrentEncoding; //value is UTF8
i set the encoding with iso-8859-1 but it's not really set after.
Some one has seen this behaviour?

I find a parameter of StreamReader detectEncodingFromByteOrderMarks.
If it is to false, there isn't detect encoding and take yours.
using (StreamReader reader = new StreamReader(stream,System.Text.Encoding.GetEncoding("iso-8859-1"), false))

.NET StreamReader encoding behaviour

I am trying to understand the unicode encoding behaviour and came across the following,
I am writing to a file a string using Encoding.Unicode using
StreamWriter(fileName,false, Encoding.Unicode);
I am reading from the same file but use ASCII intentionally.
StreamReader(fileName,false, Encoding.ASCII);
When I read the string using ReadLine to my surprise it is giving back the same unicode string.
I expected the string to contain ? or other characters with double the length of the original string.
What is happening here?
Code Snippet
string test= "سشصضطظع";//some random arabic set
StreamWriter s = new StreamWriter(fileName,false, Encoding.UTF8);
s.Write(input);
s.Flush();
s.Close();
StreamReader s = new StreamReader(fileName, encoding);
string ss = s.ReadLine();
s.Close();
//In string ss I expect to be a ascii with Double the length of test
If I call StreamReader s = new StreamReader(fileName, encoding, false);
then it gives the expected result.`
Thanks

The parameter detectEncodingFromByteOrderMarks should be set to false when creating StreamReader object.

Converting a byte[] string back to byte[] array

I have one scenario with class like this.
Class Document
{
public string Name {get;set;}
public byte[] Contents {get;set;}
}
Now I am trying to implement the import export functionality where I keep the document in binary so the document will be in json file with other fields and the document will be something in this format.
UEsDBBQABgAIAAAAIQCitGbRsgEAALEHAAATAAgCW0NvbnRlbnRfVHlwZXNdLnhtbCCiBAIooAACAAAAAAA==
Now when I upload this file back, I get this file as a string and I get the same data but when I try to convert this in binary bytes[] the file become corrupt.
How can I achieve this ?
I use something like this to convert
var ss = sr.ReadToEnd();
MemoryStream stream = new MemoryStream();
StreamWriter writer = new StreamWriter(stream);
writer.Write(ss);
writer.Flush();
stream.Position = 0;
var bytes = default(byte[]);
bytes = stream.ToArray();

This looks like base 64. Use:
System.Convert.ToBase64String(b)
https://msdn.microsoft.com/en-us/library/dhx0d524%28v=vs.110%29.aspx
And
System.Convert.FromBase64String(s)
https://msdn.microsoft.com/en-us/library/system.convert.frombase64string%28v=vs.110%29.aspx

You need to de-code it from base64, like this:
Assuming you've read the file into ss as a string.
var bytes = Convert.FromBase64String(ss);

There are several things going on here. You need to know the encoding for the default StreamWriter, if it is not specified it defaults to UTF-8 encoding. However, .NET strings are always either UNICODE or UTF-16.
MemoryStream from string - confusion about Encoding to use
I would suggest using System.Convert.ToBase64String(someByteArray) and its counterpart System.Convert.FromBase64String(someString) to handle this for you.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Which character encoding should I use for Tab-delimited flat file? - c#

Related

How to remove BOM from an encoded base64 UTF string?

Encoding issue with spanish file in C#

reading stream with right encoding in C#

.NET StreamReader encoding behaviour

Converting a byte[] string back to byte[] array

Categories

Resources