File.ReadAllText vs Encoding.UTF8: some string (apparently), but not equal [duplicate]

File.ReadAllText vs Encoding.UTF8: some string (apparently), but not equal [duplicate] - c#

In .NET, I'm trying to use Encoding.UTF8.GetString method, which takes a byte array and converts it to a string.
It looks like this method ignores the BOM (Byte Order Mark), which might be a part of a legitimate binary representation of a UTF8 string, and takes it as a character.
I know I can use a TextReader to digest the BOM as needed, but I thought that the GetString method should be some kind of a macro that makes our code shorter.
Am I missing something? Is this like so intentionally?
Here's a reproduction code:
static void Main(string[] args)
{
string s1 = "abc";
byte[] abcWithBom;
using (var ms = new MemoryStream())
using (var sw = new StreamWriter(ms, new UTF8Encoding(true)))
{
sw.Write(s1);
sw.Flush();
abcWithBom = ms.ToArray();
Console.WriteLine(FormatArray(abcWithBom)); // ef, bb, bf, 61, 62, 63
}
byte[] abcWithoutBom;
using (var ms = new MemoryStream())
using (var sw = new StreamWriter(ms, new UTF8Encoding(false)))
{
sw.Write(s1);
sw.Flush();
abcWithoutBom = ms.ToArray();
Console.WriteLine(FormatArray(abcWithoutBom)); // 61, 62, 63
}
var restore1 = Encoding.UTF8.GetString(abcWithoutBom);
Console.WriteLine(restore1.Length); // 3
Console.WriteLine(restore1); // abc
var restore2 = Encoding.UTF8.GetString(abcWithBom);
Console.WriteLine(restore2.Length); // 4 (!)
Console.WriteLine(restore2); // ?abc
}
private static string FormatArray(byte[] bytes1)
{
return string.Join(", ", from b in bytes1 select b.ToString("x"));
}

It looks like this method ignores the BOM (Byte Order Mark), which might be a part of a legitimate binary representation of a UTF8 string, and takes it as a character.
It doesn't look like it "ignores" it at all - it faithfully converts it to the BOM character. That's what it is, after all.
If you want to make your code ignore the BOM in any string it converts, that's up to you to do... or use StreamReader.
Note that if you either use Encoding.GetBytes followed by Encoding.GetString or use StreamWriter followed by StreamReader, both forms will either produce then swallow or not produce the BOM. It's only when you mix using a StreamWriter (which uses Encoding.GetPreamble) with a direct Encoding.GetString call that you end up with the "extra" character.

Based on the answer by Jon Skeet (thanks!), this is how I just did it:
var memoryStream = new MemoryStream(byteArray);
var s = new StreamReader(memoryStream).ReadToEnd();
Note that this will probably only work reliably if there is a BOM in the byte array you are reading from. If not, you might want to look into another StreamReader constructor overload which takes an Encoding parameter so you can tell it what the byte array contains.

for those who do not want to use streams I found a quite simple solution using Linq:
public static string GetStringExcludeBOMPreamble(this Encoding encoding, byte[] bytes)
{
var preamble = encoding.GetPreamble();
if (preamble?.Length > 0 && bytes.Length >= preamble.Length && bytes.Take(preamble.Length).SequenceEqual(preamble))
{
return encoding.GetString(bytes, preamble.Length, bytes.Length - preamble.Length);
}
else
{
return encoding.GetString(bytes);
}
}

I know I am kind of late to the party but here's the code I am using (feel free to adapt to C#) if you need:
Public Function Serialize(Of YourXMLClass)(ByVal obj As YourXMLClass,
Optional ByVal omitXMLDeclaration As Boolean = True,
Optional ByVal omitXMLNamespace As Boolean = True) As String
Dim serializer As New XmlSerializer(obj.GetType)
Using memStream As New MemoryStream()
Dim settings As New XmlWriterSettings() With {
.Encoding = Encoding.UTF8,
.Indent = True,
.omitXMLDeclaration = omitXMLDeclaration}
Using writer As XmlWriter = XmlWriter.Create(memStream, settings)
Dim xns As New XmlSerializerNamespaces
If (omitXMLNamespace) Then xns.Add("", "")
serializer.Serialize(writer, obj, xns)
End Using
Return Encoding.UTF8.GetString(memStream.ToArray())
End Using
End Function
Public Function Deserialize(Of YourXMLClass)(ByVal obj As YourXMLClass, ByVal xml As String) As YourXMLClass
Dim result As YourXMLClass
Dim serializer As New XmlSerializer(GetType(YourXMLClass))
Using memStream As New MemoryStream()
Dim bytes As Byte() = Encoding.UTF8.GetBytes(xml.ToArray)
memStream.Write(bytes, 0, bytes.Count)
memStream.Seek(0, SeekOrigin.Begin)
Using reader As XmlReader = XmlReader.Create(memStream)
result = DirectCast(serializer.Deserialize(reader), YourXMLClass)
End Using
End Using
Return result
End Function

Related

Get binary representation of ASCII symbol (C#)

Sorry for asking a question like that, but I'm really stuck.
I have this method for reading data from file:
public void ReadFromFile()
{
string fileName = #"my .txt file path";
StreamReader sr;
List<char> encoded = new List<char>();
List<byte> converted = new List<byte>();
using (StreamReader sr = new StreamReader(fileName))
{
string line = sr.ReadToEnd();
string[] lines = line.Split('\n');
foreach (var v in lines[2])
{
encoded.Add(v); // just get data I need
}
} }
Now in encoded I have F and # symbols.
I want to get 01000110 (F representation) and 01000000 (# representation)
I tried to convert every item in List<char> encoded into bytes and then use Convert.ToString(value, 2)
But it's not a good idea, because there's a mistake "Value was either too large or too small for an unsigned byte."
in the output file I have something like this:
s,01;w,000;e,1;t,001; // dictionary of character and its code
6 // number of zeros
F# // encoded string
So what I want to do is to DECODE this thing into the input string (that is 'sweet'). For this, I need to decode F# into 0100011001000000

vb.net dataset.load xml file containing string "&"

I'm using a Dataset.Load statement to load a XMl File and on the file I have some tags with the "&" character and this is causing a exception. Are there any way to Load the XML to the dataset or replacing the & for another string.
I tried to do a Replace but when I use StringVar.Replace("&","e") for example when I have "ç" or "ã" strings on the file this chars are replaced for an wrong sequence of chars.
I was trying this
My.Computer.FileSystem.WriteAllText(MyFilePath, My.Computer.FileSystem.ReadAllText(MyFilePath, System.Text.Encoding.UTF8).Replace(" & ", "&"), False, System.Text.Encoding.UTF8)
but it happens that some files has "A&B" or any other combination of letters before and after the "&"
I'll be glad if anyone can help-me.
Thanks

`Hello Guys, I solved my problem. The problem was really #Blorgbeard sayd the Xml File was coming not valid.
Public Shared Function Decompress(text As String) As String
Dim bytes As Byte() = Convert.FromBase64String(text)
Using msi = New MemoryStream(bytes)
Using mso = New MemoryStream()
Using gs = New System.IO.Compression.GZipStream(msi, System.IO.Compression.CompressionMode.Decompress)
Dim bytesAux As Byte() = New Byte(4095) {}
Dim cnt As Integer
While (InlineAssignHelper(cnt, gs.Read(bytesAux, 0, bytesAux.Length))) <> 0
mso.Write(bytesAux, 0, cnt)
End While
End Using
Dim streamReader As StreamReader = New StreamReader(mso, System.Text.Encoding.UTF8, True)
Dim XmlDoc As String
mso.Seek(0, SeekOrigin.Begin)
XmlDoc = streamReader.ReadToEnd
Return XmlDoc
End Using
End Using
End Function`
this is what I did to get and return the string containing the correct XML data to be write to file.

Stream reader.Read number of character

Is there any Stream reader Class to read only number of char from string Or byte from byte[]?
forexample reading string:
string chunk = streamReader.ReadChars(5); // Read next 5 chars
or reading bytes
byte[] bytes = streamReader.ReadBytes(5); // Read next 5 bytes
Note that the return type of this method or name of the class does not matter. I just want to know if there is some thing similar to this then i can use it.
I have byte[] from midi File. I want to Read this midi file in C#. But i need ability to read number of bytes. or chars(if i convert it to hex). To validate midi and read data from it more easily.

Thanks for the comments. I didnt know there is an Overload for Read Methods. i could achieve this with FileStream.
using (FileStream fileStream = new FileStream(path, FileMode.Open))
{
byte[] chunk = new byte[4];
fileStream.Read(chunk, 0, 4);
string hexLetters = BitConverter.ToString(chunk); // 4 Hex Letters that i need!
}

You can achieve this by doing something like below but I am not sure this will applicable for your problem or not.
StreamReader sr = new StreamReader(stream);
StringBuilder S = new StringBuilder();
while(true)
{
S = S.Append(sr.ReadLine());
if (sr.EndOfStream == true)
{
break;
}
}
Once you have value on "S", you can consider sub strings from it.

How convert xml string UTF8 to UTF16?

I have a string of XML(utf-8).I need to store the string in the database(MS SQL). Encoding a string must be UTF-16.
This code does not work, utf16Xml is empty
XDocument xDoc = XDocument.Parse(utf8Xml);
xDoc.Declaration.Encoding = "utf-16";
StringWriter writer = new StringWriter();
XmlWriter xml = XmlWriter.Create(writer, new XmlWriterSettings()
{ Encoding = writer.Encoding, Indent = true });
xDoc.WriteTo(xml);
string utf16Xml = writer.ToString();
utf8Xml - string contains a serialize object(encoding UTF8).
How convert xml string UTF8 to UTF16?

This might help you
MemoryStream ms = new MemoryStream();
XmlWriterSettings xws = new XmlWriterSettings();
xws.OmitXmlDeclaration = true;
xws.Indent = true;
XDocument xDoc = XDocument.Parse(utf8Xml);
xDoc.Declaration.Encoding = "utf-16";
using (XmlWriter xw = XmlWriter.Create(ms, xws))
{
xDoc.WriteTo(xw);
}
Encoding ut8 = Encoding.UTF8;
Encoding ut116 = Encoding.Unicode;
byte[] utf16XmlArray = Encoding.Convert(ut8, ut116, ms.ToArray());
var utf16Xml = Encoding.Unicode.GetString(utf16XmlArray);

Given that XDocument.Parse only accepts a string, and that string in .NET is always UTF-16 Little Endian, it looks like you are going through a lot of steps to effectively do nothing. Either:
The string – utf8Xml – is already UTF-16 LE and can be inserted into SQL Server as is (i.e. do nothing) as SqlDbType.Xml or SqlDbType.NVarChar,
or
utf8Xml somehow contains UTF-8 byte sequences, which would be invalid UTF-16 LE (i.e. "Unicode" in Microsoft-land) byte sequences. If this is the case, then you might be able to simply:
add the XML Declaration, stating that the encoding is UTF-8:
xDoc.Declaration.Encoding = "utf-8";
do not omit the XML declaration:
OmitXmlDeclaration = false;
pass utf8Xml into SQL Server as DbType.VarChar
For further explanation, please see my answer to the related question (here on S.O.):
How to solve “unable to switch the encoding” error when inserting XML into SQL Server

How to read a file starting at a specific cursor point in C#?

I want to read a file but not from the beginning of the file but at a specific point of a file. For example I want to read a file after 977 characters after the beginning of the file, and then read the next 200 characters at once. Thanks.

If you want to read the file as text, skipping characters (not bytes):
using (var textReader = System.IO.File.OpenText(path))
{
// read and disregard the first 977 chars
var buffer = new char[977];
textReader.Read(buffer, 0, buffer.Length);
// read 200 chars
buffer = new char[200];
textReader.Read(buffer, 0, buffer.Length);
}
If you merely want to skip a certain number of bytes (not characters):
using (var fileStream = System.IO.File.OpenRead(path))
{
// seek to starting point
fileStream.Seek(977, SeekOrigin.Begin);
// read 200 bytes
var buffer = new byte[200];
fileStream.Read(buffer, 0, buffer.Length);
}

you can use Linq and converting array of char to string .
add these namespace :
using System.Linq;
using System.IO;
then you can use this to get an array of characters starting index a as much as b characters from your text file :
char[] c = File.ReadAllText(FilePath).ToCharArray().Skip(a).Take(b).ToArray();
Then you can have a string , includes continuous chars of c :
string r = new string(c);
for example , i have this text in a file :
hello how are you ?
i use this code :
char[] c = File.ReadAllText(FilePath).ToCharArray().Skip(6).Take(3).ToArray();
string r = new string(c);
MessageBox.Show(r);
and it shows : how
Way 2
Very simple :
Using Substring method
string s = File.ReadAllText(FilePath);
string r = s.Substring(6,3);
MessageBox.Show(r);
Good Luck ;

using (var fileStream = System.IO.File.OpenRead(path))
{
// seek to starting point
fileStream.Position = 977;
// read
}

if you want to read specific data types from files System.IO.BinaryReader is the best choice.
if you are not sure about file encoding use
using (var binaryreader = new BinaryReader(File.OpenRead(path)))
{
// seek to starting point
binaryreader.ReadChars(977);
// read
char[] data = binaryreader.ReadChars(200);
//do what you want with data
}
else if you know character size in source file size are 1 or 2 byte use
using (var binaryreader = new BinaryReader(File.OpenRead(path)))
{
// seek to starting point
binaryreader.BaseStream.Position = 977 * X;//x is 1 or 2 base on character size in sourcefile
// read
char[] data = binaryreader.ReadChars(200);
//do what you want with data
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

File.ReadAllText vs Encoding.UTF8: some string (apparently), but not equal [duplicate] - c#

Related

Get binary representation of ASCII symbol (C#)

vb.net dataset.load xml file containing string "&"

Stream reader.Read number of character

How convert xml string UTF8 to UTF16?

How to read a file starting at a specific cursor point in C#?

Categories

Resources