How to find out the Encoding of a File? C# - c#

Well i need to find out which of the files i found in some directory is UTF8 Encoded either ANSI encoded to change the Encoding in something else i decide later. My problem is.. how can i find out if a file is UTF8 or ANSI Encoded? Both of the encodings are actually posible in my files.

There is no reliable way to do it (since the file might be just random binary), however the process done by Windows Notepad software is detailed in Micheal S Kaplan's blog:
http://www.siao2.com/2007/04/22/2239345.aspx
Check the first two bytes;
1. If there is a UTF-16 LE BOM, then treat it (and load it) as a "Unicode" file;
2. If there is a UTF-16 BE BOM, then treat it (and load it) as a "Unicode (Big Endian)" file;
3. If the first two bytes look like the start of a UTF-8 BOM, then check the next byte and if we have a UTF-8 BOM, then treat it (and load it) as a "UTF-8" file;
Check with IsTextUnicode to see if that function think it is BOM-less UTF-16 LE, if so, then treat it (and load it) as a "Unicode" file;
Check to see if it UTF-8 using the original RFC 2279 definition from 1998 and if it then treat it (and load it) as a "UTF-8" file;
Assume an ANSI file using the default system code page of the machine.
Now note that there are some holes
here, like the fact that step 2 does
not do quite as good with BOM-less
UTF-16 BE (there may even be a bug
here, I'm not sure -- if so it's a bug
in Notepad beyond any bug in
IsTextUnicode).

http://msdn.microsoft.com/en-us/netframework/aa569610.aspx#Question2
There is no great way to detect an
arbitrary ANSI code page, though there
have been some attempts to do this
based on the probability of certain
byte sequences in the middle of text.
We don't try that in StreamReader. A
few file formats like XML or HTML have
a way of specifying the character set
on the first line in the file, so Web
browsers, databases, and classes like
XmlTextReader can read these files
correctly. But many text files don't
have this type of information built
in.

Unicode/UTF8/UnicodeBigEndian are considered to be different types. ANSI is considered the same as UTF8.
public class EncodingType
{
public static System.Text.Encoding GetType(string FILE_NAME)
{
FileStream fs = new FileStream(FILE_NAME, FileMode.Open, FileAccess.Read);
Encoding r = GetType(fs);
fs.Close();
return r;
}
public static System.Text.Encoding GetType(FileStream fs)
{
byte[] Unicode = new byte[] { 0xFF, 0xFE, 0x41 };
byte[] UnicodeBIG = new byte[] { 0xFE, 0xFF, 0x00 };
byte[] UTF8 = new byte[] { 0xEF, 0xBB, 0xBF }; //with BOM
Encoding reVal = Encoding.Default;
BinaryReader r = new BinaryReader(fs, System.Text.Encoding.Default);
int i;
int.TryParse(fs.Length.ToString(), out i);
byte[] ss = r.ReadBytes(i);
if (IsUTF8Bytes(ss) || (ss[0] == 0xEF && ss[1] == 0xBB && ss[2] == 0xBF))
{
reVal = Encoding.UTF8;
}
else if (ss[0] == 0xFE && ss[1] == 0xFF && ss[2] == 0x00)
{
reVal = Encoding.BigEndianUnicode;
}
else if (ss[0] == 0xFF && ss[1] == 0xFE && ss[2] == 0x41)
{
reVal = Encoding.Unicode;
}
r.Close();
return reVal;
}
private static bool IsUTF8Bytes(byte[] data)
{
int charByteCounter = 1; 
byte curByte;
for (int i = 0; i < data.Length; i++)
{
curByte = data[i];
if (charByteCounter == 1)
{
if (curByte >= 0x80)
{
while (((curByte <<= 1) & 0x80) != 0)
{
charByteCounter++;
}
 
if (charByteCounter == 1 || charByteCounter > 6)
{
return false;
}
}
}
else
{
if ((curByte & 0xC0) != 0x80)
{
return false;
}
charByteCounter--;
}
}
if (charByteCounter > 1)
{
throw new Exception("Error byte format");
}
return true;
}
}

See these two codeproject articles - it is not trivial to find out file encoding simply from the file content:
Detect encoding from ByteOrderMarks (BOM)
Detect Encoding for In- and Outgoing Text

public static System.Text.Encoding GetEncoding(string filepath, Encoding defaultEncoding)
{
// will fall to defaultEncoding if file does not have BOM
using (var reader = new StreamReader(filepath, defaultEncoding, true))
{
reader.Peek(); //need it
return reader.CurrentEncoding;
}
}
Check Byte Order Mark (BOM).
To see the BOM you need to see file in a hexadecimal view.
Notepad show the file encoding at status bar, but it can be just estimated, if the file hasn't the BOM set.

Related

Detecting utf-8 without BOM or with BOM

I'm building a compression program. I want to use LWZ for utf-8 files (any urf-8 files) and BZip for others (usually random binary files). I can't find method to define is file utf8 or not.
I tried this and many other methods all over stackoverflow but they can't do it for me.
I can share examples of files that should be recognized as utf 8 and files that should be recognized as "others"
else if (args[0] != null && args[1] != null)
{
if (random binary detected)
{
Console.WriteLine("Started Bzip");
byte[] res = new Bzip2Compressor(65).Compress(File.ReadAllBytes(args[0]));
File.WriteAllBytes(args[1], res);
Console.WriteLine("Done!");
return;
}
else //for utf 8 cases (both with bom and without)
{
Console.WriteLine("Started LZW");
byte[] res = LZWCompressor.Compress(File.ReadAllBytes(args[0]));
File.WriteAllBytes(args[1], res);
Console.WriteLine("Done");
return;
}
}
Note: i only need to separate utf-8 and all others
EDIT: so i would like to check first n symbols to be invalid utf 8;
var bytes = new byte[1024 * 1024];
new Random().NextBytes(bytes);
File.WriteAllBytes(#"PATH", bytes);
General goal is to detected files cerated like in code above as NOT utf-8 files

Read stream from XmlReader, base64 decode it and write result to file

Basically, I want to extract the stream from the XmlReader and directly base64 decode it to a file.
The structure of the XML file can be seen here. To get the value I have to use ReadInnerXml(). Is it possible to use ReadValueChunk instead?
Here is my current code:
using (XmlReader reader = XmlReader.Create("/your/path/47311.xml"))
{
while(reader.Read())
{
if (reader.IsStartElement () && reader.NodeType == XmlNodeType.Element) {
switch (reader.Name) {
case "ttOutputRow":
reader.ReadToDescendant ("cKey");
switch (reader.ReadInnerXml ()) {
case "findMe":
reader.ReadToNextSibling ("cValue");
// here begins the interesting part
char[] buffer = new char[4096];
int charRead;
using (var destStream = File.OpenWrite ("/your/path/47311.jpg")) {
while ((charRead = reader.ReadValueChunk (buffer, 0, 4096)) != 0) {
byte[] decodedStream = System.Convert.FromBase64String (new string (buffer));
await destStream.WriteAsync(decodedStream, 0, decodedStream.Length);
Console.WriteLine ("in");
}
}
break;
default:
break;
}
break;
default:
break;
}
}
}
}
Currently, he doesn't read the value in.
Can't I use ReadValueChunk for this? How can I directly use the stream from the XmlReader without sacrificing too much memory?
Edit:
According to dbc I modified my code. This is what I currently use:
using (XmlReader reader = XmlReader.Create("test.xml"))
{
while(reader.Read())
{
if (reader.IsStartElement () && reader.NodeType == XmlNodeType.Element) {
switch (reader.Name) {
case "ttOutputRow":
reader.ReadToDescendant ("cKey");
switch (reader.ReadInnerXml ()) {
case "findMe":
reader.ReadToNextSibling ("cValue");
byte[] buffer = new byte[40960];
int readBytes = 0;
using (FileStream outputFile = File.OpenWrite ("test.jpg"))
using (BinaryWriter bw = new BinaryWriter(outputFile))
{
while ((readBytes = reader.ReadElementContentAsBase64(buffer, 0, 40960)) > 0) {
bw.Write (buffer, 0, readBytes);
Console.WriteLine ("in");
}
}
break;
default:
break;
}
break;
default:
break;
}
}
}
}
Here you can find a test file. The real file is a little bit bigger and therefore takes much more time.
The above code doesn't work as expected. It is very slow and the extracted image is mostly black (destroyed).
In order to give a definitive answer to your question I would need to see the XML you are trying to read. However, two points:
According to the documentation for Convert.FromBase64String:
The FromBase64String method is designed to process a single string that contains all the data to be decoded. To decode base-64 character data from a stream, use the System.Security.Cryptography.FromBase64Transform class.
Thus your problem may be with decoding the content in chunks rather than with reading it in chunks.
You can use XmlReader.ReadElementContentAsBase64 or XmlReader.ReadElementContentAsBase64Async for exactly this purpose. From the docs:
This method reads the element content, decodes it using Base64 encoding, and returns the decoded binary bytes (for example, an inline Base64-encoded GIF image) into the buffer.
In fact, the example in the documentation demonstrates how to extract a base64-encoded image from an XML file and write it to a binary file in chunks.

Zip file with utf-8 file names

In my website i have option to download all images uploaded by users. The problem is in images with hebrew names (i need original name of file). I tried to decode file names but this is not helping. Here is a code :
using ICSharpCode.SharpZipLib.Zip;
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
byte[] utfBytes = utf8.GetBytes(file.Name);
byte[] isoBytes = Encoding.Convert(utf8, iso, utfBytes);
string name = iso.GetString(isoBytes);
var entry = new ZipEntry(name + ".jpg");
zipStream.PutNextEntry(entry);
using (var reader = new System.IO.FileStream(file.Name, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
{
byte[] buffer = new byte[ChunkSize];
int bytesRead;
while ((bytesRead = reader.Read(buffer, 0, buffer.Length)) > 0)
{
byte[] actual = new byte[bytesRead];
Buffer.BlockCopy(buffer, 0, actual, 0, bytesRead);
zipStream.Write(actual, 0, actual.Length);
}
}
After utf-8 encoding i get hebrew file names like this : ??????.jpg
Where is my mistake?
Unicode (UTF-8 is one of the binary encoding) can represent more characters than the other 8-bit encoding. Moreover, you are not doing a proper conversion but a re-interpretation, which means that you get garbage for your filenames. You should really read the article from Joel on Unicode.
...
Now that you've read the article, you should know that in C# string can store unicode data, so you probably don't need to do any conversion of file.Name and can pass this directly to ZipEntry constructor if the library does not contains encoding handling bugs (this is always possible).
Try using
ZipStrings.UseUnicode = true;
It should be a part of the ICSharpCode.SharpZipLib.Zip namespace.
After that you can use something like
var newZipEntry = new ZipEntry($"My ünicödë string.pdf");
and add the entry as normal to the stream. You shouldn't need to do any conversion of the string before that in C#.
You are doing wrong conversion, since strings in C# are already unicode.
What tools do you use to check file names in archive?
By default Windows ZIP implementations use system DOS encoding for file names, while other implementations can use other encoding.

C# - Check if File is Text Based

How can I test whether a file that I'm opening in C# using FileStream is a "text type" file? I would like my program to open any file that is text based, for example, .txt, .html, etc.
But not open such things as .doc or .pdf or .exe, etc.
In general: there is no way to tell.
A text file stored in UTF-16 will likely look like binary if you open it with an 8-bit encoding. Equally someone could save a text file as a .doc (it is a document).
While you could open the file and look at some of the content all such heuristics will sometimes fail (eg. notepad tries to do this, by careful selection of a few characters notepad will guess wrong and display completely different content).
If you have a specific scenario, rather than being able to open and process anything, you should be able to do much better.
I guess you could just check through the first 1000 (arbitrary number) characters and see if there are unprintable characters, or if they are all ascii in a certain range. If the latter, assume that it is text?
Whatever you do is going to be a guess.
As others have pointed out there is no absolute way to be sure. However, to determine if a file is binary (which can be said to be easier than determining if it is text) some implementations check for consecutive NUL characters. Git apparently just checks the first 8000 chars for a NUL and if it finds one treats the file as binary. See here for more details.
Here is a similar C# solution I wrote that looks for a given number of required consecutive NUL. If IsBinary returns false then it is very likely your file is text based.
public bool IsBinary(string filePath, int requiredConsecutiveNul = 1)
{
const int charsToCheck = 8000;
const char nulChar = '\0';
int nulCount = 0;
using (var streamReader = new StreamReader(filePath))
{
for (var i = 0; i < charsToCheck; i++)
{
if (streamReader.EndOfStream)
return false;
if ((char) streamReader.Read() == nulChar)
{
nulCount++;
if (nulCount >= requiredConsecutiveNul)
return true;
}
else
{
nulCount = 0;
}
}
}
return false;
}
To get the real type of a file, you must check its header, which won't be changed even the extension is modified. You can get the header list here, and use something like this in your code:
using(var stream = new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
using(var reader = new BinaryReader(stream))
{
// read the first X bytes of the file
// In this example I want to check if the file is a BMP
// whose header is 424D in hex(2 bytes 6677)
string code = reader.ReadByte().ToString() + reader.ReadByte().ToString();
if (code.Equals("6677"))
{
//it's a BMP file
}
}
}
I have a below solution which works for me.This is general solution which check all types of Binary file.
/// <summary>
/// This method checks whether selected file is Binary file or not.
/// </summary>
public bool CheckForBinary()
{
Stream objStream = new FileStream("your file path", FileMode.Open, FileAccess.Read);
bool bFlag = true;
// Iterate through stream & check ASCII value of each byte.
for (int nPosition = 0; nPosition < objStream.Length; nPosition++)
{
int a = objStream.ReadByte();
if (!(a >= 0 && a <= 127))
{
break; // Binary File
}
else if (objStream.Position == (objStream.Length))
{
bFlag = false; // Text File
}
}
objStream.Dispose();
return bFlag;
}
public bool IsTextFile(string FilePath)
using (StreamReader reader = new StreamReader(FilePath))
{
int Character;
while ((Character = reader.Read()) != -1)
{
if ((Character > 0 && Character < 8) || (Character > 13 && Character < 26))
{
return false;
}
}
}
return true;
}

Why does text from Assembly.GetManifestResourceStream() start with three junk characters?

I have a SQL file added to my VS.NET 2008 project as an embedded resource. Whenever I use the following code to read the file's content, the string returned always starts with three junk characters and then the text I expect. I assume this has something to do with the Encoding.Default I am using, but that is just a guess. Why does this text keep showing up? Should I just trim off the first three characters or is there a more informed approach?
public string GetUpdateRestoreSchemaScript()
{
var type = GetType();
var a = Assembly.GetAssembly(type);
var script = "UpdateRestoreSchema.sql";
var resourceName = String.Concat(type.Namespace, ".", script);
using(Stream stream = a.GetManifestResourceStream(resourceName))
{
byte[] buffer = new byte[stream.Length];
stream.Read(buffer, 0, buffer.Length);
// UPDATE: Should be Encoding.UTF8
return Encoding.Default.GetString(buffer);
}
}
Update:
I now know that my code works as expected if I simply change the last line to return a UTF-8 encoded string. It will always be true for this embedded file, but will it always be true? Is there a way to test any buffer to determine its encoding?
Probably the file is in utf-8 encoding and Encoding.Default is ASCII. Why don't you use a specific encoding?
Edit to answer a comment:
In order to guess the file encoding you could look for BOM at the start of the stream. If it exists, it helps, if not then you can only guess or ask user.
if you try to load xml from assembly you actually need to inspect and skip the byte order mark bytes (drove me nuts):
....
byte[] data;
using (var stream = assembly.GetManifestResourceStream(filename))
{
var length = stream.Length;
data = new byte[length];
stream.Read(data, 0, (int) length);
}
if (!HasUtf8ByteOrderMark(data))
{
throw new InvalidOperationException("Expected UTF8 byte order mark EF BB BF");
}
return Encoding.UTF8.GetChars(data.Skip(3).ToArray());
And
static bool HasUtf8ByteOrderMark(byte[] data)
{
var bom = new byte[] { 0xEF, 0xBB, 0xBF };
return data[0] == bom[0] && data[1] == bom[1] && data[2] == bom[2];
}
More information here
I had the same problem in net.core
You can let streamreader do the encoding
using (var stream = = a.GetManifestResourceStream(resourceName))
using (var reader = new StreamReader(stream))
return reader.ReadToEnd();

Categories