Filestream prepends junk characters while read - c#

I am reading a simple text file which contains single line using filestream class. But it seems filestream.read prepends some junk character in the beginning.
Below the code.
using (var _fs = File.Open(_idFilePath, FileMode.Open, FileAccess.ReadWrite, FileShare.Read))
{
byte[] b = new byte[_fs.Length];
UTF8Encoding temp = new UTF8Encoding(true);
while (_fs.Read(b, 0, b.Length) > 0)
{
Console.WriteLine(temp.GetString(b));
Console.WriteLine(ASCIIEncoding.ASCII.GetString(b));
}
}
for example: My data in text file is just "sample". But the above code returns
"?sample" and
"???sample"
Whats the reason?? is it start of the file indicator? is there a way to read only my actual content??

The byte order mark(BOM) consists of the Unicode char 0xFEFF and is used to mark a file with the encoding used for it.
So if you correctly decode the file as UTF8 you get that character as first char of your string. If you incorrectly decode it as ANSI you get 3 chars, since the UTF8 encoding of 0xFEFF is the byte sequence "EF BB BF" which is 3 bytes.
But your whole code can be replaced with
File.ReadAllText(fileName,Encoding.UTF8)
and that should remove the BOM too. Or you leave out the encoding parameter and let the function autodetect the encoding(for which it uses the BOM)

Could be the BOM - a.k.a byte order mark.

You are reading the BOM from the stream. If you are reading text, try using a StreamReader which will handle this automatically.

Try instead
using (StreamReader sr = new StreamReader(File.Open(path),Encoding.UTF8))
It will definitely strip you the BOM

Related

Encoding detection for a string-data in a byte[] succeed and after that all string comparisons failed

How it is all setup:
I receive a byte[] which contains CSV data
I don't know the encoding (should be unicode / utf8)
I need to detect the encoding or fallback to a default (the text may contain umlauts, so the encoding is important)
I need to read the header line and compare it with defined strings
After a short search I how to get a string out of the byte[] I found How to convert byte[] to string? which stated to use something like
string result = System.Text.Encoding.UTF8.GetString(byteArray);
I (know) use this helper to detect the encoding and afterwards the Encoding.GetString method to read the string like so:
string csvFile = TextFileEncodingDetector.DetectTextByteArrayEncoding(data).GetString(data);
But when I now try to compare values from this result string with static strings in my code all comparisons fails!
// header is the first line from the string that I receive from EncodingHelper.ReadData(data)
for (int i = 0; i < headers.Count; i++) {
switch (headers[i].Trim().ToLower()) {
case "number":
// do
break;
default:
throw new Exception();
}
}
// where (headers[i].Trim().ToLower()) => "number"
While this seems to be a problem with the encoding of both strings my question is:
How can I detect the encoding of a string from a byte[] and convert it into the default encoding so that I am able to work with that string data?
Edit
The code supplied above was working as long the string data came from a file that was saved this way:
string tempFile = Path.GetTempFileName();
StreamReader reader = new StreamReader(inputStream);
string line = null;
TextWriter tw = new StreamWriter(tempFile);
fileCount++;
while ((line = reader.ReadLine()) != null)
{
if (line.Length > 1)
{
tw.WriteLine(line);
}
}
tw.Close();
and afterwards read out with
File.ReadAllText()
This
A. Forces the file to be unicode (ANSI format kills all umlauts)
B. requires the written file be accessible
Now I only got the inputStream and tried what I posted above. And as I mentioned this worked before and the strings look identical. But they are not.
Note: If I use ANSI encoded file, which uses Encoding.Default all works fine.
Edit 2
While ANSI encoded data work the UTF8 Encoded (notepadd++ only show UTF-8 not w/o BOM) start with char [0]: 65279
So where is my error because I guess System.Text.Encoding.UTF8.GetString(byteArray) is working the right way.
Yes, Encoding.GetString doesn't strip the BOM (see https://stackoverflow.com/a/11701560/613130). You could:
string result;
using (var memoryStream = new MemoryStream(byteArray))
{
result = new StreamReader(memoryStream).ReadToEnd();
}
The StreamReader will autodetect the encoding (your encoding detector is a copy of the StreamReader.DetectEncoding())

BinaryReader in c# reads '\0' between all characters of a string

I am trying to write and read a binary file using c# BinaryWriter and BinaryReader classes.
When I am storing a string in file, it is storing it properly, but when I am trying to read it is returning a string which has '\0' character on every alternate place within the string.
Here is the code:
public void writeBinary(BinaryWriter bw)
{
bw.Write("Hello");
}
public void readBinary(BinaryReader br)
{
BinaryReader br = new BinaryReader(fs);
String s;
s = br.ReadString();
}
Here s is getting value as = "H\0e\0l\0l\0o\0".
You are using different encodings when reading and writing the file.
You are using UTF-16 when writing the file, so each character ends up as a 16 bit character code, i.e. two bytes.
You are using UTF-8 or some of the 8-bit encodings when reading the file, so each byte will end up as one character.
Pick one encoding and use for both reading and writing the file.

Unicode surrogates character encoding c#

I've got problem with Unicode characters. When I want to encode surrogates character (between D800 and DFFF) it encodes as FFFD. I used Encoding.Unicode.GetString() method it doesn't work and Decoder.GetChars() method it doesnt work with every surrogate character.
I use following codes:
Encoding Codes:
string unicodeChars="a\uD800\uDA65";
FileStream stream=new FileStream (#"unicode_encoding.txt",FileMode.Create,FileAccess.Write);
byte[] buffer=Encoding.Unicode.GetBytes(unicodeChars);
stream.Write(buffer,0,buffer.Length);
stream.Close();
Decoding Codes:
string decodedUnicodeChars;
FileStream stream2=new FileStream (#"unicode_encoding.txt",FileMode.Open,FileAccess.Read);
StreamReader reader=new StreamReader(stream2,Encoding.Unicode);
decodedUnicodeChars=reader.ReadToEnd();
foreach(char c in decodedUnicodeChars)
{
Console.Write("{0} ",Convert.ToInt32(c).ToString("X4"));
}
Output is:
0061 FFFD FFFD
string unicodeChars="a\uD800\uD565";
This is a case of gigo, Garbage In, Garbage Out. The surrogate is not valid, the second one must be in the range \uDC00..\uDFFF.

how do i read n any character from file?

how do i read any n character from file with c# whole encoding?
i mean not reading byte! i wanna just read one character (any encoding) at a time?
Use TextReader.Read(), probably via StreamReader, which inherits from TextReader.
Try this:
char[] c = new char[5];
using (StreamReader streamReader = File.OpenText("c:\test.txt"))
{
streamReader.Read(c, 0, c.Length);
}
Update:
I just realized that this is for UTF8 encoding only and no further parameters are available to specify the encoding.
StreamReader reader = new StreamReader("date.txt");
string txt = reader.ReadToEnd();
txt(char num);
It's not the most efficient but neither is c# so :/

Using .NET how to convert ISO 8859-1 encoded text files that contain Latin-1 accented characters to UTF-8

I am being sent text files saved in ISO 88591-1 format that contain accented characters from the Latin-1 range (as well as normal ASCII a-z, etc.). How do I convert these files to UTF-8 using C# so that the single-byte accented characters in ISO 8859-1 become valid UTF-8 characters?
I have tried to use a StreamReader with ASCIIEncoding, and then converting the ASCII string to UTF-8 by instantiating encoding ascii and encoding utf8 and then using Encoding.Convert(ascii, utf8, ascii.GetBytes( asciiString) ) — but the accented characters are being rendered as question marks.
What step am I missing?
You need to get the proper Encoding object. ASCII is just as it's named: ASCII, meaning that it only supports 7-bit ASCII characters. If what you want to do is convert files, then this is likely easier than dealing with the byte arrays directly.
using (System.IO.StreamReader reader = new System.IO.StreamReader(fileName,
Encoding.GetEncoding("iso-8859-1")))
{
using (System.IO.StreamWriter writer = new System.IO.StreamWriter(
outFileName, Encoding.UTF8))
{
writer.Write(reader.ReadToEnd());
}
}
However, if you want to have the byte arrays yourself, it's easy enough to do with Encoding.Convert.
byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"),
Encoding.UTF8, data);
It's important to note here, however, that if you want to go down this road then you should not use an encoding-based string reader like StreamReader for your file IO. FileStream would be better suited, as it will read the actual bytes of the files.
In the interest of fully exploring the issue, something like this would work:
using (System.IO.FileStream input = new System.IO.FileStream(fileName,
System.IO.FileMode.Open,
System.IO.FileAccess.Read))
{
byte[] buffer = new byte[input.Length];
int readLength = 0;
while (readLength < buffer.Length)
readLength += input.Read(buffer, readLength, buffer.Length - readLength);
byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"),
Encoding.UTF8, buffer);
using (System.IO.FileStream output = new System.IO.FileStream(outFileName,
System.IO.FileMode.Create,
System.IO.FileAccess.Write))
{
output.Write(converted, 0, converted.Length);
}
}
In this example, the buffer variable gets filled with the actual data in the file as a byte[], so no conversion is done. Encoding.Convert specifies a source and destination encoding, then stores the converted bytes in the variable named...converted. This is then written to the output file directly.
Like I said, the first option using StreamReader and StreamWriter will be much simpler if this is all you're doing, but the latter example should give you more of a hint as to what's actually going on.
If the files are relatively small (say, ~10 megabytes), you'll only need two lines of code:
string txt = System.IO.File.ReadAllText(inpPath, Encoding.GetEncoding("iso-8859-1"));
System.IO.File.WriteAllText(outPath, txt);

Categories