Cannot write to rtf file after replacing inside string with utf8 characters - c#

I have a rtf file in which I have to make some text replacements with some language specific characters (UTF8). After the replacements I try to save to a new rtf file but either the characters are not set right(strange characters) or the file is saved with all the rtf raw code and all the formatting.
Here is my code:
var fs = new FileStream(#"F:\projects\projects\RtfEditor\Test.rtf", FileMode.Open, FileAccess.Read);
//reads the file in a byte[]
var sb = FileWorker.ReadToEnd(fs);
var enc = Encoding.GetEncoding(1250);
//var enc = Encoding.UTF8;
var sbs = enc.GetString(sb);
var sbsNew = sbs.Replace("#test/#", "ă î â șșțț");
//first writting aproach
var fsw = new FileStream(#"F:\projects\projects\RtfEditor\diac.rtf", FileMode.Create, FileAccess.Write);
fsw.Write(enc.GetBytes(sbsNew), 0, enc.GetBytes(sbsNew).Length);
fsw.Flush();
fsw.Close();
In this aproach, the result file is the right one but the characters "șșțț" are shown as "????".
//second writing aproach
using (StreamWriter sw = new StreamWriter(fsw, Encoding.UTF8))
{
sw.Write(sbsNew);
sw.Flush();
}
In this aproach, the result file is a rtf file but with all rtf raw code and formatting and the special characters are saved right (șșțț appear correcty, no more ????)

A RTF file can directly contain 7-bit characters only. Everything else needs to be encoded into escape sequences. More detailed information can be found in e.g. this Wikipedia article.

Related

How to remove BOM from an encoded base64 UTF string?

I have a file encoded in base64 using openssl base64 -in en -out en1 in a command line in MacOS and I am reading this file using the following code:
string fileContent = File.ReadAllText(Path.Combine(AppContext.BaseDirectory, MConst.BASE_DIR, "en1"));
var b1 = Convert.FromBase64String(fileContent);
var str1 = System.Text.Encoding.UTF8.GetString(b1);
The string I am getting has a ? before the actual file content. I am not sure what's causing this, any help will be appreciated.
Example Input:
import pandas
import json
Encoded file example:
77u/DQppbXBvcnQgY29ubmVjdG9yX2FwaQ0KaW1wb3J0IGpzb24NCg0K
Output based on the C# code:
?import pandas
import json
Normally, when you read UTF (with BOM) from a text file, the decoding is handled for you behind the scene. For example, both of the following lines will read UTF text correctly regardless of whether or not the text file has a BOM:
File.ReadAllText(path, Encoding.UTF8);
File.ReadAllText(path); // UTF8 is the default.
The problem is that you're dealing with UTF text that has been encoded to a Base64 string. So, ReadAllText() can no longer handle the BOM for you. You can either do it yourself by (checking and) removing the first 3 bytes from the byte array or delegate that job to a StreamReader, which is exactly what ReadAllText() does:
var bytes = Convert.FromBase64String(fileContent);
string finalString = null;
using (var ms = new MemoryStream(bytes))
using (var reader = new StreamReader(ms)) // Or:
// using (var reader = new StreamReader(ms, Encoding.UTF8))
{
finalString = reader.ReadToEnd();
}
// Proceed to using finalString.

Encoding issue with spanish file in C#

I have a file store online in an azure blob storage in spanish. Some word have special charactere (for example : Almacén)
When I open the file in notepad++, the encoding is ANSI.
So now I try to read the file with the code :
using StreamReader reader = new StreamReader(Stream, Encoding.UTF8);
blobStream.Seek(0, SeekOrigin.Begin);
var allLines = await reader.ReadToEndAsync();
the issue is that "allLines" are not proper encoding, I have some issue like : Almac�n
I have try some solution like this one :
C# Convert string from UTF-8 to ISO-8859-1 (Latin1) H
but still not working
(the final goal is to "merge" two csv so I read the stream of both, remove the header and concatenate the string to push it again. If there is a better solution to merge csv in c# that can skip this encoding issue I am open to it also)
You are trying to read a non-UTF8 encoded file as if it was UTF8 encoded. I can replicate this issue with
var s = "Almacén";
using var memStream = new MemoryStream(Encoding.GetEncoding(28591).GetBytes(s));
using var reader = new StreamReader(memStream, Encoding.UTF8);
var allLines = await reader.ReadToEndAsync();
Console.WriteLine(allLines); // writes "Almac�n" to console
You should be attempting to read the file with encoding iso-8859-1 "Western European (ISO)" which is codepage 28591.
using var reader = new StreamReader(Stream, Encoding.GetEncoding(28591));
var allLines = await reader.ReadToEndAsync();

Which character encoding should I use for Tab-delimited flat file?

We are calling Report Type ‘_GET_MERCHANT_LISTINGS_DATA_’ of MWS API using C# web Application.
Some times we got � Character instead of single quote, space or for any other special characters while encoding data.
We have used Encoding.GetEncoding(1252) method to encode StreamReader.
We are using below code.
Stream s = reportRequest.Report;
StreamReader stream_reader = new StreamReader(s);
string reportResponseText = stream_reader.ReadToEnd();
byte[] byteArray = Encoding.GetEncoding(1252).GetBytes(reportResponseText);
MemoryStream stream = new MemoryStream(byteArray);
StreamReader filestream = new StreamReader(stream);
We also have tried ‘Encoding.UTF8.GetBytes(reportResponseText)’ but not useful.
Could anyone please suggest us correct method to encode data in correct format?

C# Streamreader: Handling of special characters \" \' etc

I'm reading then writing a text file. Before and after the data of interest the file contains many lines that should remain unaltered. But streamreader seems to convert the special characters ( " ' — ) into other characters that appear as funky diamonds in both C# textboxes and in notepad. How can text get passed through file read/write operations completely unaltered? Thanks.
StreamWriter sw = new StreamWriter(sOutputFileName);
using (StreamReader sr = new StreamReader(sTempFileName))
{
while (sr.Peek() >= 0)
{
rdBuffer = sr.ReadLine();
txtProgressDisplay.Text += rdBuffer + "\r\n";
// parse and process some lines here
wrBuffer = rdBuffer;
sw.WriteLine(wrBuffer);
txtProgressDisplay.Text += wrBuffer + "\r\n";
}
sr.Close();
}
sw.Close();
I am almost certain the issue is related to character encoding, ie UTF8, ASCII, UTF7, etc. Try creating your StreamReader passing in the correct encoding,
StreamReader sr = new StreamReader(sTempFileName, System.Text.Encoding.ASCII);
You can use Encoding.ASCII, Encoding.UTF7, etc
Your problem seems to be something with encoding.
1) Check that your text viewer is using the same encoding as your .NET application (maybe UTF-8?).
2) Check if the file itself has been created using the same encoding as your .NET application too (are you mixing characters in different encodings?).

C# letters like å not correctly shown in output file

I work in C# and this is my code:
Encoding encoding;
StringBuilder output = new StringBuilder();
//somePath is string
using (StreamReader sr = new StreamReader(somePath))
{
string line;
encoding = sr.CurrentEncoding;
while ((line = sr.ReadLine()) != null)
{
//make some changes to line
output.AppendLine(line);
}
}
using (StreamWriter writer = new StreamWriter(someOtherPath, false))//encoding
{
writer.Write(output);
}
In the file which is on somePath, I have Norwegian characters like å. But, on the file in someOtherPath I get question marks instead of them. I think it's an encoding problem, so I tried getting the input file encoding and to grant it to the output file. It had no results. I tried opening the file with Google Chrome and grant it every possible encoding but the letters weren't the same as in the input file.
StreamReader can only make guesses with regards to certain encodings. Ideally, you should find out what the encoding of the file really is, then use that to read the file. What created the file, and what allows you to read it correctly? Does the latter program expose which encoding it's using? (For example, it may be using something like Windows-CP1252.)
I would personally recommend using UTF-8 as your output encoding if you can, but it depends on whether you're in control over whatever's then reading the output.
EDIT: Okay, now I've seen the file, I can confirm it's not UTF-8. The word "direktør" is represented as these bytes:
64 69 72 65 6b 74 f8 72
So the non-ASCII character is a single byte (F8) which is not a valid UTF-8 representation of a character.
It could be ISO-Latin-1 - it's not clear (there are multiple encodings which would match). If it is, you can use:
Encoding encoding = Encoding.GetEncoding(28591);
using (TextReader reader = new StreamReader(filename, encoding))
{
...
}
(Alternatively, use File.ReadAllLines to make life simpler.)
You'll need to separately work out what output encoding you want.
EDIT: Here's a short but complete program which I've run against the file you provided, and which has correctly converted the character to UTF-8:
using System;
using System.IO;
using System.Text;
class Test
{
static void Main()
{
Encoding encoding = Encoding.GetEncoding(28591);
StringBuilder output = new StringBuilder();
using (TextReader reader = new StreamReader("file.html", encoding))
{
string line;
while ((line = reader.ReadLine()) != null)
{
output.AppendLine("Read line: " + line);
}
}
using (StreamWriter writer = new StreamWriter("output.html", false))
{
writer.Write(output);
}
}
}
Try this case to save your text:
using (StreamWriter writer = new StreamWriter(someOtherPath, Encoding.UTF8)) { ... }

Categories