Why does text from Assembly.GetManifestResourceStream() start with three junk characters? - c#

I have a SQL file added to my VS.NET 2008 project as an embedded resource. Whenever I use the following code to read the file's content, the string returned always starts with three junk characters and then the text I expect. I assume this has something to do with the Encoding.Default I am using, but that is just a guess. Why does this text keep showing up? Should I just trim off the first three characters or is there a more informed approach?
public string GetUpdateRestoreSchemaScript()
{
var type = GetType();
var a = Assembly.GetAssembly(type);
var script = "UpdateRestoreSchema.sql";
var resourceName = String.Concat(type.Namespace, ".", script);
using(Stream stream = a.GetManifestResourceStream(resourceName))
{
byte[] buffer = new byte[stream.Length];
stream.Read(buffer, 0, buffer.Length);
// UPDATE: Should be Encoding.UTF8
return Encoding.Default.GetString(buffer);
}
}
Update:
I now know that my code works as expected if I simply change the last line to return a UTF-8 encoded string. It will always be true for this embedded file, but will it always be true? Is there a way to test any buffer to determine its encoding?

Probably the file is in utf-8 encoding and Encoding.Default is ASCII. Why don't you use a specific encoding?
Edit to answer a comment:
In order to guess the file encoding you could look for BOM at the start of the stream. If it exists, it helps, if not then you can only guess or ask user.

if you try to load xml from assembly you actually need to inspect and skip the byte order mark bytes (drove me nuts):
....
byte[] data;
using (var stream = assembly.GetManifestResourceStream(filename))
{
var length = stream.Length;
data = new byte[length];
stream.Read(data, 0, (int) length);
}
if (!HasUtf8ByteOrderMark(data))
{
throw new InvalidOperationException("Expected UTF8 byte order mark EF BB BF");
}
return Encoding.UTF8.GetChars(data.Skip(3).ToArray());
And
static bool HasUtf8ByteOrderMark(byte[] data)
{
var bom = new byte[] { 0xEF, 0xBB, 0xBF };
return data[0] == bom[0] && data[1] == bom[1] && data[2] == bom[2];
}
More information here

I had the same problem in net.core
You can let streamreader do the encoding
using (var stream = = a.GetManifestResourceStream(resourceName))
using (var reader = new StreamReader(stream))
return reader.ReadToEnd();

Related

How can i read the end of a file in c#?

i searched in stackoverflow and got one way but this method only let me to write word by word in the console. My goal is to get the end of my file but get the complete result not char by char.
This code only show me char by char the end of my file:
using (var reader = new StreamReader("file.dll")
{
if (reader.BaseStream.Length > 1024)
{
reader.BaseStream.Seek(-1024, SeekOrigin.End);
}
string line;
while ((line = reader.ReadLine()) != null)
{
Console.WriteLine(line);
Console.ReadKey();
}
}
I was trying to get something like this, it's c++ but i was trying to get the same result in c#.
QFile *archivo;
archivo = new QFile();
archivo->setFileName("file.dll");
archivo->open(QFile::ReadOnly);
archivo->seek(archivo->size() - 1024);
trama = archivo->read(1024);
It's possible to get the complete result of the end of my file in c#?
If the file is line-delimited text file, you can use ReadAllLines.
string[] lines = System.IO.File.ReadAllLines("file.txt");
If it's a binary file, you can use ReadAllBytes. Shocker, I know.
byte[] data = System.IO.File.ReadAllBytes("file.dll");
if you want to be able to seek first (e.g. if you want only the last 1024 bytes of the file) you can use the stream's Read method. Again, crazy.
reader.BaseStream.Seek(-1024, SeekOrigin.End);
var chars = new char[1024];
reader.Read(chars, 0, 1024);
And before you ask, you can convert the characters to a string by passing them to the constructor:
char[] chars = new char[1024];
string s = new string(chars);
Console.WriteLine(s);
Not sure what it'll look like, since you're reading characters from a binary file, but good luck. My guess is you should be reading bytes instead though:
reader.BaseStream.Seek(-1024, SeekOrigin.End);
var bytes = new byte[1024];
reader.BaseStream.Read(bytes, 0, 1024);
(Notice you don't even need the StreamReader, since the FileStream (your base stream) exposes the Read method you need).

How do I print the content of a Stream when the content type is application/octet-stream?

I'm trying to intercept .NET Remoting Request/Responses by implementing a ServerChannelSink.
All is well, apart from the fact that I can't seem to decode the stream into a string. How do I do this?
Basically, in the watch window I can see that a value has been assigned to my variable after running the code: -
But if I open the Text Visualizer it is empty.
Similarly if I try to write the string to the Output window I don't get any lines written.
Here is the code that I'm using:
private static void PrintStream(TextWriter output, ref Stream stream)
{
// If we can't reset the stream's position after printing its content,
// we must make a copy.
if (!stream.CanSeek)
stream = CopyStream(stream);
long startPosition = stream.Position;
byte[] buffer = new byte[stream.Length];
stream.Read(buffer, 0, (int)stream.Length);
System.Text.ASCIIEncoding enc = new System.Text.ASCIIEncoding();
string request = enc.GetString(buffer, 0, buffer.Length);
output.WriteLine(request);
output.WriteLine();
// set stream to previous position for message processing
stream.Seek(startPosition, SeekOrigin.Begin);
}
I've also tried using a StreamReader with the same result:
private static void PrintStream(TextWriter output, ref Stream stream)
{
// If we can't reset the stream's position after printing its content,
// we must make a copy.
if (!stream.CanSeek)
stream = CopyStream(stream);
long startPosition = stream.Position;
StreamReader sr = new StreamReader(stream);
String line;
while ((line = sr.ReadLine()) != null)
{
output.WriteLine(line);
}
stream.Position = startPosition;
}
application/octet-stream means binary. Your request variable contains binary data only some of which converts to human readable text so you cannot convert it to a string.
The best you could do is use Convert.ToBase64String to convert it to base 64 but it won't be human readable. Converting it to ASCII will corrupt the data.
Thanks to paqogomez answer about \0's being interpreted as the end of the string, I just added the following: -
request = request.Replace("\0", "");
I now get this in the output window which is perfect for my purposes, thanks.
----------Request Headers-----------
__ConnectionId: 16
__IPAddress: 127.0.0.1
__RequestUri: /VisaOM.Server.ClientServices.Services
Content-Type: application/octet-stream
__CustomErrorsEnabled: False
----------Request Message-----------
get_SecurityServiceszVisaOM.Client.Services.IServices, VisaOM.Client.Services.Interfaces,
Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
------End of Request Message--------

Zip file with utf-8 file names

In my website i have option to download all images uploaded by users. The problem is in images with hebrew names (i need original name of file). I tried to decode file names but this is not helping. Here is a code :
using ICSharpCode.SharpZipLib.Zip;
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(file.Name);
byte[] isoBytes = Encoding.Convert(utf8, iso, utfBytes);
string name = iso.GetString(isoBytes);
var entry = new ZipEntry(name + ".jpg");
zipStream.PutNextEntry(entry);
using (var reader = new System.IO.FileStream(file.Name, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
byte[] buffer = new byte[ChunkSize];
int bytesRead;
while ((bytesRead = reader.Read(buffer, 0, buffer.Length)) > 0)
{
byte[] actual = new byte[bytesRead];
Buffer.BlockCopy(buffer, 0, actual, 0, bytesRead);
zipStream.Write(actual, 0, actual.Length);
}
}
After utf-8 encoding i get hebrew file names like this : ??????.jpg
Where is my mistake?
Unicode (UTF-8 is one of the binary encoding) can represent more characters than the other 8-bit encoding. Moreover, you are not doing a proper conversion but a re-interpretation, which means that you get garbage for your filenames. You should really read the article from Joel on Unicode.
...
Now that you've read the article, you should know that in C# string can store unicode data, so you probably don't need to do any conversion of file.Name and can pass this directly to ZipEntry constructor if the library does not contains encoding handling bugs (this is always possible).
Try using
ZipStrings.UseUnicode = true;
It should be a part of the ICSharpCode.SharpZipLib.Zip namespace.
After that you can use something like
var newZipEntry = new ZipEntry($"My ünicödë string.pdf");
and add the entry as normal to the stream. You shouldn't need to do any conversion of the string before that in C#.
You are doing wrong conversion, since strings in C# are already unicode.
What tools do you use to check file names in archive?
By default Windows ZIP implementations use system DOS encoding for file names, while other implementations can use other encoding.

C#: how to read a line from a stream and then start reading it from beginning?

I need to read the first line from a stream to determine file's encoding, and then recreate the stream with that Encoding
The following code does not work correctly:
var r = response.GetResponseStream();
var sr = new StreamReader(r);
string firstLine = sr.ReadLine();
string encoding = GetEncodingFromFirstLine(firstLine);
string text = new StreamReader(r, Encoding.GetEncoding(encoding)).ReadToEnd();
The text variable doesn't contain the whole text. For some reason the first line and several lines after it are skipped.
I tried everything: closing the StreamReader, resetting it, calling a separate GetResponseStream... but nothing worked.
I can't get the response stream again as I'm getting this file from the internet, and redownloading it again would be bad performance wise.
Update
Here's what GetEncodingFromFirstLine() looks like:
public static string GetEncodingFromFirstLine(string line)
{
int encodingIndex = line.IndexOf("encoding=");
if (encodingIndex == -1)
{
return "utf-8";
}
return line.Substring(encodingIndex + "encoding=".Length).Replace("\"", "").Replace("'", "").Replace("?", "").Replace(">", "");
}
...
// true
Assert.AreEqual("windows-1251", GetEncodingFromFirstLine(#"<?xml version=""1.0"" encoding=""windows-1251""?>"));
** Update 2 **
I'm working with XML files, and the text variable is parsed as XML:
var feedItems = XElement.Parse(text);
Well you're asking it to detect the encoding... and that requires it to read data. That's reading it from the underlying stream, and you're then creating another StreamReader around the same stream.
I suggest you:
Get the response stream
Retrieve all the data into a byte array (or MemoryStream)
Detect the encoding (which should be performed on bytes, not text - currently you're already assuming UTF-8 by creating a StreamReader)
Create a MemoryStream around the byte array, and a StreamReader around that
It's not clear what your GetEncodingFromFirstLine method does... or what this file really is. More information may make it easier to help you.
EDIT: If this is to load some XML, don't reinvent the wheel. Just give the stream to one of the existing XML-parsing classes, which will perform the appropriate detection for you.
You need to change the current position in the stream to the beginning.
r.Position = 0;
string text = new StreamReader(r, Encoding.GetEncoding(encoding)).ReadToEnd();
I found the answer to my question here:
How can I read an Http response stream twice in C#?
Stream responseStream = CopyAndClose(resp.GetResponseStream());
// Do something with the stream
responseStream.Position = 0;
// Do something with the stream again
private static Stream CopyAndClose(Stream inputStream)
{
const int readSize = 256;
byte[] buffer = new byte[readSize];
MemoryStream ms = new MemoryStream();
int count = inputStream.Read(buffer, 0, readSize);
while (count > 0)
{
ms.Write(buffer, 0, count);
count = inputStream.Read(buffer, 0, readSize);
}
ms.Position = 0;
inputStream.Close();
return ms;
}

How to find out the Encoding of a File? C#

Well i need to find out which of the files i found in some directory is UTF8 Encoded either ANSI encoded to change the Encoding in something else i decide later. My problem is.. how can i find out if a file is UTF8 or ANSI Encoded? Both of the encodings are actually posible in my files.
There is no reliable way to do it (since the file might be just random binary), however the process done by Windows Notepad software is detailed in Micheal S Kaplan's blog:
http://www.siao2.com/2007/04/22/2239345.aspx
Check the first two bytes;
1. If there is a UTF-16 LE BOM, then treat it (and load it) as a "Unicode" file;
2. If there is a UTF-16 BE BOM, then treat it (and load it) as a "Unicode (Big Endian)" file;
3. If the first two bytes look like the start of a UTF-8 BOM, then check the next byte and if we have a UTF-8 BOM, then treat it (and load it) as a "UTF-8" file;
Check with IsTextUnicode to see if that function think it is BOM-less UTF-16 LE, if so, then treat it (and load it) as a "Unicode" file;
Check to see if it UTF-8 using the original RFC 2279 definition from 1998 and if it then treat it (and load it) as a "UTF-8" file;
Assume an ANSI file using the default system code page of the machine.
Now note that there are some holes
here, like the fact that step 2 does
not do quite as good with BOM-less
UTF-16 BE (there may even be a bug
here, I'm not sure -- if so it's a bug
in Notepad beyond any bug in
IsTextUnicode).
http://msdn.microsoft.com/en-us/netframework/aa569610.aspx#Question2
There is no great way to detect an
arbitrary ANSI code page, though there
have been some attempts to do this
based on the probability of certain
byte sequences in the middle of text.
We don't try that in StreamReader. A
few file formats like XML or HTML have
a way of specifying the character set
on the first line in the file, so Web
browsers, databases, and classes like
XmlTextReader can read these files
correctly. But many text files don't
have this type of information built
in.
Unicode/UTF8/UnicodeBigEndian are considered to be different types. ANSI is considered the same as UTF8.
public class EncodingType
{
public static System.Text.Encoding GetType(string FILE_NAME)
{
FileStream fs = new FileStream(FILE_NAME, FileMode.Open, FileAccess.Read);
Encoding r = GetType(fs);
fs.Close();
return r;
}
public static System.Text.Encoding GetType(FileStream fs)
{
byte[] Unicode = new byte[] { 0xFF, 0xFE, 0x41 };
byte[] UnicodeBIG = new byte[] { 0xFE, 0xFF, 0x00 };
byte[] UTF8 = new byte[] { 0xEF, 0xBB, 0xBF }; //with BOM
Encoding reVal = Encoding.Default;
BinaryReader r = new BinaryReader(fs, System.Text.Encoding.Default);
int i;
int.TryParse(fs.Length.ToString(), out i);
byte[] ss = r.ReadBytes(i);
if (IsUTF8Bytes(ss) || (ss[0] == 0xEF && ss[1] == 0xBB && ss[2] == 0xBF))
{
reVal = Encoding.UTF8;
}
else if (ss[0] == 0xFE && ss[1] == 0xFF && ss[2] == 0x00)
{
reVal = Encoding.BigEndianUnicode;
}
else if (ss[0] == 0xFF && ss[1] == 0xFE && ss[2] == 0x41)
{
reVal = Encoding.Unicode;
}
r.Close();
return reVal;
}
private static bool IsUTF8Bytes(byte[] data)
{
int charByteCounter = 1; 
byte curByte;
for (int i = 0; i < data.Length; i++)
{
curByte = data[i];
if (charByteCounter == 1)
{
if (curByte >= 0x80)
{
while (((curByte <<= 1) & 0x80) != 0)
{
charByteCounter++;
}
 
if (charByteCounter == 1 || charByteCounter > 6)
{
return false;
}
}
}
else
{
if ((curByte & 0xC0) != 0x80)
{
return false;
}
charByteCounter--;
}
}
if (charByteCounter > 1)
{
throw new Exception("Error byte format");
}
return true;
}
}
See these two codeproject articles - it is not trivial to find out file encoding simply from the file content:
Detect encoding from ByteOrderMarks (BOM)
Detect Encoding for In- and Outgoing Text
public static System.Text.Encoding GetEncoding(string filepath, Encoding defaultEncoding)
{
// will fall to defaultEncoding if file does not have BOM
using (var reader = new StreamReader(filepath, defaultEncoding, true))
{
reader.Peek(); //need it
return reader.CurrentEncoding;
}
}
Check Byte Order Mark (BOM).
To see the BOM you need to see file in a hexadecimal view.
Notepad show the file encoding at status bar, but it can be just estimated, if the file hasn't the BOM set.

Categories