BinaryReader in c# reads '\0' between all characters of a string - c#

I am trying to write and read a binary file using c# BinaryWriter and BinaryReader classes.
When I am storing a string in file, it is storing it properly, but when I am trying to read it is returning a string which has '\0' character on every alternate place within the string.
Here is the code:
public void writeBinary(BinaryWriter bw)
{
bw.Write("Hello");
}
public void readBinary(BinaryReader br)
{
BinaryReader br = new BinaryReader(fs);
String s;
s = br.ReadString();
}
Here s is getting value as = "H\0e\0l\0l\0o\0".

You are using different encodings when reading and writing the file.
You are using UTF-16 when writing the file, so each character ends up as a 16 bit character code, i.e. two bytes.
You are using UTF-8 or some of the 8-bit encodings when reading the file, so each byte will end up as one character.
Pick one encoding and use for both reading and writing the file.

Related

Visual Basic to C#: load binary file in a string

I've to convert a project from old VB6 to c#, the aim is to preserve the old code as much possible as I can, for a matter of time.
A function of the old project loads a binary file into a string variable, and then this variable is analyzezed in its single characters values with the asc function:
OLD VB Code:
Public Function LoadText(ByVal DirIn As String) As String
Dim FileBuffer As String
Dim LenghtFile As Long
Dim ContIN As Long
ContIN = FreeFile
Open DirIn For Binary Access Read As #ContIN
LenghtFile = LOF(ContIN)
FileBuffer = Space(LenghtFile)
Get #ContIN, , FileBuffer
Close #ContIN
LoadText = FileBuffer
'following line for test purpose
debug.print(asc(mid(filebuffer,1,1)))
debug.print(asc(mid(filebuffer,2,1)))
debug.print(asc(mid(filebuffer,3,1)))
End Function
SUB Main
dim testSTring as String
teststring=loadtext("e:\testme.bin")
end sub
Result in immediate window:
1
10
133
C# code:
public static string LoadText(string dirIn)
{
string myString, myString2;
FileStream fs = new FileStream(dirIn, FileMode.Open);
BinaryReader br = new BinaryReader(fs);
byte[] bin = br.ReadBytes(Convert.ToInt32(fs.Length));
//myString = Convert.ToBase64String(bin);
myString = Encoding.Default.GetString(bin);
string m1 = Encoding.Default.GetString(bin);
//string m1 = Encoding.ASCII.GetString(bin);
//string m1 = Encoding.BigEndianUnicode.GetString(bin);
//string m1 = Encoding.UTF32.GetString(bin);
//string m1 = Encoding.UTF7.GetString(bin);
//string m1 = Encoding.UTF8.GetString(bin);
//string m1 = Encoding.Unicode.GetString(bin);
//string m1 = Encoding.Unicode.GetString(bin);
Console.WriteLine(General.Asc(m1.Substring(0, 1)));
Console.WriteLine(General.Asc(m1.Substring(1, 1)));
Console.WriteLine(General.Asc(m1.Substring(2, 1)));
br.Close();
fs.Close();
return myString;
}
General class:
public static int Asc(string stringToEValuate)
{
return (int)stringToEValuate[0];
}
Result in output window:
1
10
8230 <--fail!
The string in VB6 has a length 174848, identical to the size of the test file.
In c# is the same size for DEFAUILT and ASCII encoding, while all the others has different size and i cannot use them unless i change everithing in the whole project.
The problem is that I can't find the correct encoding that permits to have a string which asc function returns identical numbers to the VB6 one.
The problem is all there, if the string is not identical I have to change a lot of lines of code, because the whole program is based on ASCii value and the position of it in the string.
Maybe it's the wrong way to load a binary into a string, or the Asc function..
If you want to try the example file you can download it from here:
http:// www.snokie.org / testme.bin
8230 is correct. It is a UTF-16 code unit for the Unicode codepoint (U+2026, which only needs one UTF-16 code unit). You expected 133. 133 as one byte is the encoding for the same character in at least one other character set: Windows-1252.
There is no text but encoded text.
When you read a text file you have to know the encoding that was used to write it. Once you read into a .NET String or Char, you have it in Unicode's UTF-16 encoding. Because Unicode is a superset of any character set you would be using, it is not incorrect.
If you don't want to compare characters as characters, read them as binary to keep it in them in the same encoding as the file. You can then compare the byte sequences.
The problem is that the VB6 code, rather than using Unicode for character code like it should have, used the "default ANSI" character set, which changes meaning from system to system and user to user.
The problem is this: "old project loads a binary file into a string variable". Yes, this was a common—but bad—VB6 practice. String datatypes are for text. Strings in VB6 are UTF-16 code unit sequences, just like in .NET (and Java, JavaScript, HTML, XML, …).
Get #ContIN, , FileBuffer converts from the system's default ANSI code page to UTF-16 and Asc converts it back again. So, you just have to do that in your .NET code, too.
Note: Just like in the VB6, Encoding.Default is hazardous because it can vary from system to system and user to user.
Reference Microsoft.VisualBasic.dll and
using static Microsoft.VisualBasic.Strings;
Then
var fileBuffer = File.ReadAllText(path, Encoding.Default);
Debug.WriteLine(Asc(Mid(fileBuffer, 3, 1));
If you'd rather not bring Microsoft.VisualBasic.dll into a C# project, you can write your own versions
static class VB6StringReplacements
{
static public Byte Asc(String source) =>
Encoding.Default.GetBytes(source.Substring(0,1)).FirstOrDefault();
static public String Mid(String source, Int32 offset, Int32 length) =>
source.Substring(offset, length);
}
and, change your using directive to
using static VB6StringReplacements;

Encoding detection for a string-data in a byte[] succeed and after that all string comparisons failed

How it is all setup:
I receive a byte[] which contains CSV data
I don't know the encoding (should be unicode / utf8)
I need to detect the encoding or fallback to a default (the text may contain umlauts, so the encoding is important)
I need to read the header line and compare it with defined strings
After a short search I how to get a string out of the byte[] I found How to convert byte[] to string? which stated to use something like
string result = System.Text.Encoding.UTF8.GetString(byteArray);
I (know) use this helper to detect the encoding and afterwards the Encoding.GetString method to read the string like so:
string csvFile = TextFileEncodingDetector.DetectTextByteArrayEncoding(data).GetString(data);
But when I now try to compare values from this result string with static strings in my code all comparisons fails!
// header is the first line from the string that I receive from EncodingHelper.ReadData(data)
for (int i = 0; i < headers.Count; i++) {
switch (headers[i].Trim().ToLower()) {
case "number":
// do
break;
default:
throw new Exception();
}
}
// where (headers[i].Trim().ToLower()) => "number"
While this seems to be a problem with the encoding of both strings my question is:
How can I detect the encoding of a string from a byte[] and convert it into the default encoding so that I am able to work with that string data?
Edit
The code supplied above was working as long the string data came from a file that was saved this way:
string tempFile = Path.GetTempFileName();
StreamReader reader = new StreamReader(inputStream);
string line = null;
TextWriter tw = new StreamWriter(tempFile);
fileCount++;
while ((line = reader.ReadLine()) != null)
{
if (line.Length > 1)
{
tw.WriteLine(line);
}
}
tw.Close();
and afterwards read out with
File.ReadAllText()
This
A. Forces the file to be unicode (ANSI format kills all umlauts)
B. requires the written file be accessible
Now I only got the inputStream and tried what I posted above. And as I mentioned this worked before and the strings look identical. But they are not.
Note: If I use ANSI encoded file, which uses Encoding.Default all works fine.
Edit 2
While ANSI encoded data work the UTF8 Encoded (notepadd++ only show UTF-8 not w/o BOM) start with char [0]: 65279
So where is my error because I guess System.Text.Encoding.UTF8.GetString(byteArray) is working the right way.
Yes, Encoding.GetString doesn't strip the BOM (see https://stackoverflow.com/a/11701560/613130). You could:
string result;
using (var memoryStream = new MemoryStream(byteArray))
{
result = new StreamReader(memoryStream).ReadToEnd();
}
The StreamReader will autodetect the encoding (your encoding detector is a copy of the StreamReader.DetectEncoding())

C# Read and replace binary data in text file

I have a file that contains text data and binary data. This may not be a good idea, but there's nothing I can do about it.
I know the end and start positions of the binary data.
What would be the best way to read in that binary data between those positions, make a Base64 string out of it, and then write it back to the position it was.
EDIT: The Base64-encoded string won't be same length as the binary data, so I might have to pad the Base64 string to the binary data length.
int binaryStart = 100;
int binaryEnd = 150;
//buffer to copy the remaining data to it and insert it after inserting the base64string
byte[] dataTailBuffer = null;
string base64String = null;
//get the binary data and convert it to base64string
using (System.IO.Stream fileStream = new FileStream(#"c:\Test Soap", FileMode.Open, FileAccess.Read))
{
using (System.IO.BinaryReader reader = new BinaryReader(fileStream))
{
reader.BaseStream.Seek(binaryStart, SeekOrigin.Begin);
var buffer = new byte[binaryEnd - binaryStart];
reader.Read(buffer, 0, buffer.Length);
base64String = Convert.ToBase64String(buffer);
if (reader.BaseStream.Position < reader.BaseStream.Length - 1)
{
dataTailBuffer = new byte[reader.BaseStream.Length - reader.BaseStream.Position];
reader.Read(dataTailBuffer, 0, dataTailBuffer.Length);
}
}
}
//write the new base64string at specifid location.
using (System.IO.Stream fileStream = new FileStream(#"C:\test soap", FileMode.Open, FileAccess.Write))
{
using (System.IO.BinaryWriter writer = new BinaryWriter(fileStream))
{
writer.Seek(binaryStart, SeekOrigin.Begin);
writer.Write(base64String);//writer.Write(Convert.FromBase64String(base64String));
if (dataTailBuffer != null)
{
writer.Write(dataTailBuffer, 0, dataTailBuffer.Length);
}
}
}
You'll want to use a FileStream object, and the Read(byte[], int, int) and Write(byte[], int, int) methods.
Although the point about base64 being bigger than binary is valid - you'll actually need to grab the data beyond the end point of what you want to replace, store it, write to the file with your new data, then write out the stored data after you finish.
I trust you're not trying to mod exe files to write viruses here... ;)
Clearly, writing out base-64 in the place of binary data cannot work, since the base-64 will be longer. So the question is, what do you need to do this for?
I will speculate that you have inherited this terrible binary file format, and you would like to use a text-editor to edit the textual portions of this binary file. If that is the case, then perhaps a more robust round-tripping binary-to-text-to-binary conversion is what you need.
I recommend using base-64 for the binary portions, but the rest of the file should be wrapped up in XML, or some other format that would be easy to parse and interpret. XML is good, because the parsers for it are already available in the system.
<mydoc>
<t>Original text</t>
<b fieldId="1">base-64 binary</b>
<t>Hello, world!</t>
<b fieldId="2">928h982hr98h2984hf</b>
</mydoc>
This file can be easily created from your specification, and it can be easily edited in any text editor. Then the file can be converted back into the original format. If any text intrudes into the binary fields, then it can be truncated. Likewise, text that is too short could be padded with spaces.

Filestream prepends junk characters while read

I am reading a simple text file which contains single line using filestream class. But it seems filestream.read prepends some junk character in the beginning.
Below the code.
using (var _fs = File.Open(_idFilePath, FileMode.Open, FileAccess.ReadWrite, FileShare.Read))
{
byte[] b = new byte[_fs.Length];
UTF8Encoding temp = new UTF8Encoding(true);
while (_fs.Read(b, 0, b.Length) > 0)
{
Console.WriteLine(temp.GetString(b));
Console.WriteLine(ASCIIEncoding.ASCII.GetString(b));
}
}
for example: My data in text file is just "sample". But the above code returns
"?sample" and
"???sample"
Whats the reason?? is it start of the file indicator? is there a way to read only my actual content??
The byte order mark(BOM) consists of the Unicode char 0xFEFF and is used to mark a file with the encoding used for it.
So if you correctly decode the file as UTF8 you get that character as first char of your string. If you incorrectly decode it as ANSI you get 3 chars, since the UTF8 encoding of 0xFEFF is the byte sequence "EF BB BF" which is 3 bytes.
But your whole code can be replaced with
File.ReadAllText(fileName,Encoding.UTF8)
and that should remove the BOM too. Or you leave out the encoding parameter and let the function autodetect the encoding(for which it uses the BOM)
Could be the BOM - a.k.a byte order mark.
You are reading the BOM from the stream. If you are reading text, try using a StreamReader which will handle this automatically.
Try instead
using (StreamReader sr = new StreamReader(File.Open(path),Encoding.UTF8))
It will definitely strip you the BOM

Using .NET how to convert ISO 8859-1 encoded text files that contain Latin-1 accented characters to UTF-8

I am being sent text files saved in ISO 88591-1 format that contain accented characters from the Latin-1 range (as well as normal ASCII a-z, etc.). How do I convert these files to UTF-8 using C# so that the single-byte accented characters in ISO 8859-1 become valid UTF-8 characters?
I have tried to use a StreamReader with ASCIIEncoding, and then converting the ASCII string to UTF-8 by instantiating encoding ascii and encoding utf8 and then using Encoding.Convert(ascii, utf8, ascii.GetBytes( asciiString) ) — but the accented characters are being rendered as question marks.
What step am I missing?
You need to get the proper Encoding object. ASCII is just as it's named: ASCII, meaning that it only supports 7-bit ASCII characters. If what you want to do is convert files, then this is likely easier than dealing with the byte arrays directly.
using (System.IO.StreamReader reader = new System.IO.StreamReader(fileName,
Encoding.GetEncoding("iso-8859-1")))
{
using (System.IO.StreamWriter writer = new System.IO.StreamWriter(
outFileName, Encoding.UTF8))
{
writer.Write(reader.ReadToEnd());
}
}
However, if you want to have the byte arrays yourself, it's easy enough to do with Encoding.Convert.
byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"),
Encoding.UTF8, data);
It's important to note here, however, that if you want to go down this road then you should not use an encoding-based string reader like StreamReader for your file IO. FileStream would be better suited, as it will read the actual bytes of the files.
In the interest of fully exploring the issue, something like this would work:
using (System.IO.FileStream input = new System.IO.FileStream(fileName,
System.IO.FileMode.Open,
System.IO.FileAccess.Read))
{
byte[] buffer = new byte[input.Length];
int readLength = 0;
while (readLength < buffer.Length)
readLength += input.Read(buffer, readLength, buffer.Length - readLength);
byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"),
Encoding.UTF8, buffer);
using (System.IO.FileStream output = new System.IO.FileStream(outFileName,
System.IO.FileMode.Create,
System.IO.FileAccess.Write))
{
output.Write(converted, 0, converted.Length);
}
}
In this example, the buffer variable gets filled with the actual data in the file as a byte[], so no conversion is done. Encoding.Convert specifies a source and destination encoding, then stores the converted bytes in the variable named...converted. This is then written to the output file directly.
Like I said, the first option using StreamReader and StreamWriter will be much simpler if this is all you're doing, but the latter example should give you more of a hint as to what's actually going on.
If the files are relatively small (say, ~10 megabytes), you'll only need two lines of code:
string txt = System.IO.File.ReadAllText(inpPath, Encoding.GetEncoding("iso-8859-1"));
System.IO.File.WriteAllText(outPath, txt);

Categories