Encoding issue with spanish file in C# - c#

I have a file store online in an azure blob storage in spanish. Some word have special charactere (for example : Almacén)
When I open the file in notepad++, the encoding is ANSI.
So now I try to read the file with the code :
using StreamReader reader = new StreamReader(Stream, Encoding.UTF8);
blobStream.Seek(0, SeekOrigin.Begin);
var allLines = await reader.ReadToEndAsync();
the issue is that "allLines" are not proper encoding, I have some issue like : Almac�n
I have try some solution like this one :
C# Convert string from UTF-8 to ISO-8859-1 (Latin1) H
but still not working
(the final goal is to "merge" two csv so I read the stream of both, remove the header and concatenate the string to push it again. If there is a better solution to merge csv in c# that can skip this encoding issue I am open to it also)

You are trying to read a non-UTF8 encoded file as if it was UTF8 encoded. I can replicate this issue with
var s = "Almacén";
using var memStream = new MemoryStream(Encoding.GetEncoding(28591).GetBytes(s));
using var reader = new StreamReader(memStream, Encoding.UTF8);
var allLines = await reader.ReadToEndAsync();
Console.WriteLine(allLines); // writes "Almac�n" to console
You should be attempting to read the file with encoding iso-8859-1 "Western European (ISO)" which is codepage 28591.
using var reader = new StreamReader(Stream, Encoding.GetEncoding(28591));
var allLines = await reader.ReadToEndAsync();

Related

Is there a way to not generate a file via CSV Helper?

If the any opportunity to not generate csv file in system?
I don't wanna to store it in my application and can we generate it smth on fly?
As return I want to converted csv to base64.
var path = Path.Combine(Directory.GetCurrentDirectory(), "test.csv");
await using var writer = new StreamWriter(path);
await using (var csv = new CsvWriter(writer, CultureInfo.InvariantCulture))
{
await csv.WriteRecordsAsync(list);
}
var bytes = await File.ReadAllBytesAsync(path);
return Convert.ToBase64String(bytes);
A StreamWriter can write to any stream, including a MemoryStream:
using var ms=new MemoryStream();
using var writer = new StreamWriter(ms);
...
return Convert.ToBase64String(ms.GetBuffer());
CSV files are text files though, so converting them to BASE64 isn't very useful. StreamWriter uses UTF8 encoding by default so it already handles any language.
It would be better to keep the text as text, especially if it's going to be stored in a text field in a database. This can be done by reading the bytes using a StreamReader
using var reader=new StreamReader(ms);
ms.Position=0;
var csvText=reader.ReadToEnd();
var csvText

How to remove BOM from an encoded base64 UTF string?

I have a file encoded in base64 using openssl base64 -in en -out en1 in a command line in MacOS and I am reading this file using the following code:
string fileContent = File.ReadAllText(Path.Combine(AppContext.BaseDirectory, MConst.BASE_DIR, "en1"));
var b1 = Convert.FromBase64String(fileContent);
var str1 = System.Text.Encoding.UTF8.GetString(b1);
The string I am getting has a ? before the actual file content. I am not sure what's causing this, any help will be appreciated.
Example Input:
import pandas
import json
Encoded file example:
77u/DQppbXBvcnQgY29ubmVjdG9yX2FwaQ0KaW1wb3J0IGpzb24NCg0K
Output based on the C# code:
?import pandas
import json
Normally, when you read UTF (with BOM) from a text file, the decoding is handled for you behind the scene. For example, both of the following lines will read UTF text correctly regardless of whether or not the text file has a BOM:
File.ReadAllText(path, Encoding.UTF8);
File.ReadAllText(path); // UTF8 is the default.
The problem is that you're dealing with UTF text that has been encoded to a Base64 string. So, ReadAllText() can no longer handle the BOM for you. You can either do it yourself by (checking and) removing the first 3 bytes from the byte array or delegate that job to a StreamReader, which is exactly what ReadAllText() does:
var bytes = Convert.FromBase64String(fileContent);
string finalString = null;
using (var ms = new MemoryStream(bytes))
using (var reader = new StreamReader(ms)) // Or:
// using (var reader = new StreamReader(ms, Encoding.UTF8))
{
finalString = reader.ReadToEnd();
}
// Proceed to using finalString.

How to read uploaded CSV UTF-8 for processing with CsvHelper?

My WebAPI allows a user to upload a CSV file and then parses the file. I use CsvHelper to do the heavy lifting of reading the CSV and mapping it to domain objects.
However, I have one customer who's files are in CSV UTF-8 format. The code that works for "vanilla" (ASCII) CSV files hurls when it tries to deal with CSV UTF-8.
Is there a way to import the CSV UTF-8 data and convert it to ASCII CSV so that my code will continue to work?
My current code looks like this:
//In my WebAPI Controller
//fileToProcess is IFormFile
byte[] fileBytes = new byte[fileToProcess.Length];
using(var stream = fileToProcess.OpenReadStream())
{
await stream.ReadAsync(fileBytes);
stream.Close();
}
var result = await ProcessFileAsync(fileBytes);
return OK(result);
...
//In a Parsing Class
public async Task<List<Client>> ProcessFileAsync(byte[] fileBytes)
{
List<Client> result = null;
var fileText = Encoding.Default.GetString(fileBytes);
using(var reader = new StringReader(fileText))
{
using(var csv = new CsvReader(reader))
{
csv.RegisterClassMap<ClientMap>();
result = csv.GetRecords<T>().ToList();
await PostProcess(result);
}
}
return result;
}
The problem is that CSV UTF-8 has the BOM so when CsvHelper tries to process a mapping that references the first column header
Map(c => c.ClientId).Name("CLIENT ID");
it fails because the column name includes the BOM.
So, my questions are:
How can I tell if the file coming in is UTF-8 or ASCII.
How do I convert the UTF-8 to ASCII so it can be processed normally?
NOTE
I did try the following:
fileBytes = Encoding.Convert(Encoding.UTF8, Encoding.ASCII, fileBytes);
However, this replaced the BOM with a ? which still causes CsvHelper to fail.
By doing this:
var fileText = Encoding.Default.GetString(fileBytes);
using(var reader = new StringReader(fileText))
... you're locking yourself into a specific encoding at the point of converting it to a string. Encoding.Default is can vary by platform and CLR implementation.
The StreamReader class is designed to read text from a stream (which you can wrap around the raw bytes with a MemoryStream) and is capable of detecting the encoding for you if you let it. Try this instead:
using (var stream = new MemoryStream(fileBytes))
using (var reader = new StreamReader(stream))
In your case, you could use the incoming stream directly by changing ProcessFileAsync to accept the stream.
using (var stream = fileToProcess.OpenReadStream())
{
var result = await ProcessFileAsync(stream);
return OK(result);
}
public async Task<List<Client>> ProcessFileAsync(Stream stream)
{
using (var reader = new StreamReader(stream))
{
using (var csv = new CsvReader(reader))
{
csv.RegisterClassMap<ClientMap>();
List<Client> result = csv.GetRecords<Client>().ToList();
await PostProcess(result);
return result;
}
}
}
As long as the BOM is present, this will also support UTF16-encoded and UTF32-encoded files (and pretty much anything else that can be detected) because it'll see the U+FEFF code point in whichever encoding it uses.

Which character encoding should I use for Tab-delimited flat file?

We are calling Report Type ‘_GET_MERCHANT_LISTINGS_DATA_’ of MWS API using C# web Application.
Some times we got � Character instead of single quote, space or for any other special characters while encoding data.
We have used Encoding.GetEncoding(1252) method to encode StreamReader.
We are using below code.
Stream s = reportRequest.Report;
StreamReader stream_reader = new StreamReader(s);
string reportResponseText = stream_reader.ReadToEnd();
byte[] byteArray = Encoding.GetEncoding(1252).GetBytes(reportResponseText);
MemoryStream stream = new MemoryStream(byteArray);
StreamReader filestream = new StreamReader(stream);
We also have tried ‘Encoding.UTF8.GetBytes(reportResponseText)’ but not useful.
Could anyone please suggest us correct method to encode data in correct format?

Cannot write to rtf file after replacing inside string with utf8 characters

I have a rtf file in which I have to make some text replacements with some language specific characters (UTF8). After the replacements I try to save to a new rtf file but either the characters are not set right(strange characters) or the file is saved with all the rtf raw code and all the formatting.
Here is my code:
var fs = new FileStream(#"F:\projects\projects\RtfEditor\Test.rtf", FileMode.Open, FileAccess.Read);
//reads the file in a byte[]
var sb = FileWorker.ReadToEnd(fs);
var enc = Encoding.GetEncoding(1250);
//var enc = Encoding.UTF8;
var sbs = enc.GetString(sb);
var sbsNew = sbs.Replace("#test/#", "ă î â șșțț");
//first writting aproach
var fsw = new FileStream(#"F:\projects\projects\RtfEditor\diac.rtf", FileMode.Create, FileAccess.Write);
fsw.Write(enc.GetBytes(sbsNew), 0, enc.GetBytes(sbsNew).Length);
fsw.Flush();
fsw.Close();
In this aproach, the result file is the right one but the characters "șșțț" are shown as "????".
//second writing aproach
using (StreamWriter sw = new StreamWriter(fsw, Encoding.UTF8))
{
sw.Write(sbsNew);
sw.Flush();
}
In this aproach, the result file is a rtf file but with all rtf raw code and formatting and the special characters are saved right (șșțț appear correcty, no more ????)
A RTF file can directly contain 7-bit characters only. Everything else needs to be encoded into escape sequences. More detailed information can be found in e.g. this Wikipedia article.

Categories