Detecting utf-8 without BOM or with BOM

Detecting utf-8 without BOM or with BOM - c#

I'm building a compression program. I want to use LWZ for utf-8 files (any urf-8 files) and BZip for others (usually random binary files). I can't find method to define is file utf8 or not.
I tried this and many other methods all over stackoverflow but they can't do it for me.
I can share examples of files that should be recognized as utf 8 and files that should be recognized as "others"
else if (args[0] != null && args[1] != null)
{
if (random binary detected)
{
Console.WriteLine("Started Bzip");
byte[] res = new Bzip2Compressor(65).Compress(File.ReadAllBytes(args[0]));
File.WriteAllBytes(args[1], res);
Console.WriteLine("Done!");
return;
}
else //for utf 8 cases (both with bom and without)
{
Console.WriteLine("Started LZW");
byte[] res = LZWCompressor.Compress(File.ReadAllBytes(args[0]));
File.WriteAllBytes(args[1], res);
Console.WriteLine("Done");
return;
}
}
Note: i only need to separate utf-8 and all others
EDIT: so i would like to check first n symbols to be invalid utf 8;
var bytes = new byte[1024 * 1024];
new Random().NextBytes(bytes);
File.WriteAllBytes(#"PATH", bytes);
General goal is to detected files cerated like in code above as NOT utf-8 files

Related

How can i read the end of a file in c#?

i searched in stackoverflow and got one way but this method only let me to write word by word in the console. My goal is to get the end of my file but get the complete result not char by char.
This code only show me char by char the end of my file:
using (var reader = new StreamReader("file.dll")
{
if (reader.BaseStream.Length > 1024)
{
reader.BaseStream.Seek(-1024, SeekOrigin.End);
}
string line;
while ((line = reader.ReadLine()) != null)
{
Console.WriteLine(line);
Console.ReadKey();
}
}
I was trying to get something like this, it's c++ but i was trying to get the same result in c#.
QFile *archivo;
archivo = new QFile();
archivo->setFileName("file.dll");
archivo->open(QFile::ReadOnly);
archivo->seek(archivo->size() - 1024);
trama = archivo->read(1024);
It's possible to get the complete result of the end of my file in c#?

If the file is line-delimited text file, you can use ReadAllLines.
string[] lines = System.IO.File.ReadAllLines("file.txt");
If it's a binary file, you can use ReadAllBytes. Shocker, I know.
byte[] data = System.IO.File.ReadAllBytes("file.dll");
if you want to be able to seek first (e.g. if you want only the last 1024 bytes of the file) you can use the stream's Read method. Again, crazy.
reader.BaseStream.Seek(-1024, SeekOrigin.End);
var chars = new char[1024];
reader.Read(chars, 0, 1024);
And before you ask, you can convert the characters to a string by passing them to the constructor:
char[] chars = new char[1024];
string s = new string(chars);
Console.WriteLine(s);
Not sure what it'll look like, since you're reading characters from a binary file, but good luck. My guess is you should be reading bytes instead though:
reader.BaseStream.Seek(-1024, SeekOrigin.End);
var bytes = new byte[1024];
reader.BaseStream.Read(bytes, 0, 1024);
(Notice you don't even need the StreamReader, since the FileStream (your base stream) exposes the Read method you need).

Decoding base64 file contents between PHP and C#

I need to serve an AES encrypted, base64 encoded file from PHP to a C# client (Mono, on various platforms). I've successfully got the AES encryption/decryption working fine but as soon as I attempt the base64 encoding/decoding I run into trouble. Both the examples below have the AES disabled, so that shouldn't be a factor.
My simplest test case, a Hello World string, works fine:
PHP serving output-
// Save encoded data to file
$data = base64_encode("Hello encryption world!!");
$file = fopen($targetPath, 'w');
fwrite($file, $data);
fclose($file);
// Later on, serve the file
header("Pragma: public");
header("Expires: 0");
header("Cache-Control: must-revalidate, post-check=0, pre-check=0");
header("Cache-Control: private",false);
header("Content-Type: application/octet-stream");
header("Content-Disposition: attachment; filename=".basename($product->PackageFilename($packageId)));
header("Content-Transfer-Encoding: binary");
header("Content-Length: ".filesize($targetPath));
ob_clean();
flush();
$handle = fopen($targetPath, "r");
fpassthru($handle);
fclose($handle);
C# decoding and using-
StreamReader reader = new StreamReader(stream);
char[] buffer = DecodeBuffer;
string decoded = "";
int read = 0;
while (0 < (read = reader.Read(buffer, 0, DecodeBufferSize)))
{
byte[] decodedBytes = Convert.FromBase64CharArray(buffer, 0, read);
decoded += System.Text.Encoding.UTF8.GetString(decodedBytes);
}
Log(decoded); // Correctly logs "Hello encryption world!!"
However once I start trying to do the same thing with the contents of a file, a FormatException: Invalid character found is thrown by Convert.FromBase64CharArray:
PHP serving output-
// Save encoded data to file
$data = base64_encode(file_get_contents($targetPath));
$file = fopen($targetPath, 'w');
fwrite($file, $data);
fclose($file);
// Later on, serve the file
// Same as above
C# decoding and using-
using (Stream file = File.Open(zipPath, FileMode.Create))
{
using (StreamReader reader = new StreamReader(stream))
{
char[] buffer = DecodeBuffer;
byte[] decodedBytes;
int read = 0;
while (0 < (read = reader.Read(buffer, 0, DecodeBufferSize)))
{
// Throws FormatException: Invalid character found
decodedBytes = Convert.FromBase64CharArray(buffer, 0, read);
file.Write(decodedBytes, 0, decodedBytes.Length);
}
}
}
Is there some kind of additional processing that should be done on larger data for base64 to be valid? Is it perhaps just not appropriate to be doing this with large binary data-and if so how else would you prevent potential problems with characters unsafe for transmission?

Your reading Base64 text code is not correct.
Base64 is text, so consider using text reader instead
Base64 may contain new lines/white spaces. I.e. it is custom to have split whole Base64 encoded value into 70-80 character long strings.
To verify if data in the file is correct read whole file as string (StreamReader.ReadToEnd) and convert to byte array (Convert.FromBase64String).
If file contains valid Base64 data and you can't read it as single string you should implement your own Base64 decoding or manually read correct number of non white space characters (multiple of 4) and decode such chunks.

Base64 encoding converts 3 octets into 4 encoded characters. Thus, the length of the data you provide for decoding needs to be a multiple of 4.
First, ensure that DecodeBufferSize is such a multiple of 4. Next, since StreamReader.Read does not guarantee that all the requested bytes will be read, you should continue reading into buffer until either it has been filled, or the end of the stream been reached.

C# - Check if File is Text Based

How can I test whether a file that I'm opening in C# using FileStream is a "text type" file? I would like my program to open any file that is text based, for example, .txt, .html, etc.
But not open such things as .doc or .pdf or .exe, etc.

In general: there is no way to tell.
A text file stored in UTF-16 will likely look like binary if you open it with an 8-bit encoding. Equally someone could save a text file as a .doc (it is a document).
While you could open the file and look at some of the content all such heuristics will sometimes fail (eg. notepad tries to do this, by careful selection of a few characters notepad will guess wrong and display completely different content).
If you have a specific scenario, rather than being able to open and process anything, you should be able to do much better.

I guess you could just check through the first 1000 (arbitrary number) characters and see if there are unprintable characters, or if they are all ascii in a certain range. If the latter, assume that it is text?
Whatever you do is going to be a guess.

As others have pointed out there is no absolute way to be sure. However, to determine if a file is binary (which can be said to be easier than determining if it is text) some implementations check for consecutive NUL characters. Git apparently just checks the first 8000 chars for a NUL and if it finds one treats the file as binary. See here for more details.
Here is a similar C# solution I wrote that looks for a given number of required consecutive NUL. If IsBinary returns false then it is very likely your file is text based.
public bool IsBinary(string filePath, int requiredConsecutiveNul = 1)
{
const int charsToCheck = 8000;
const char nulChar = '\0';
int nulCount = 0;
using (var streamReader = new StreamReader(filePath))
{
for (var i = 0; i < charsToCheck; i++)
{
if (streamReader.EndOfStream)
return false;
if ((char) streamReader.Read() == nulChar)
{
nulCount++;
if (nulCount >= requiredConsecutiveNul)
return true;
}
else
{
nulCount = 0;
}
}
}
return false;
}

To get the real type of a file, you must check its header, which won't be changed even the extension is modified. You can get the header list here, and use something like this in your code:
using(var stream = new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
using(var reader = new BinaryReader(stream))
{
// read the first X bytes of the file
// In this example I want to check if the file is a BMP
// whose header is 424D in hex(2 bytes 6677)
string code = reader.ReadByte().ToString() + reader.ReadByte().ToString();
if (code.Equals("6677"))
{
//it's a BMP file
}
}
}

I have a below solution which works for me.This is general solution which check all types of Binary file.
/// <summary>
/// This method checks whether selected file is Binary file or not.
/// </summary>
public bool CheckForBinary()
{
Stream objStream = new FileStream("your file path", FileMode.Open, FileAccess.Read);
bool bFlag = true;
// Iterate through stream & check ASCII value of each byte.
for (int nPosition = 0; nPosition < objStream.Length; nPosition++)
{
int a = objStream.ReadByte();
if (!(a >= 0 && a <= 127))
{
break; // Binary File
}
else if (objStream.Position == (objStream.Length))
{
bFlag = false; // Text File
}
}
objStream.Dispose();
return bFlag;
}

public bool IsTextFile(string FilePath)
using (StreamReader reader = new StreamReader(FilePath))
{
int Character;
while ((Character = reader.Read()) != -1)
{
if ((Character > 0 && Character < 8) || (Character > 13 && Character < 26))
{
return false;
}
}
}
return true;
}

How to find out the Encoding of a File? C#

Well i need to find out which of the files i found in some directory is UTF8 Encoded either ANSI encoded to change the Encoding in something else i decide later. My problem is.. how can i find out if a file is UTF8 or ANSI Encoded? Both of the encodings are actually posible in my files.

There is no reliable way to do it (since the file might be just random binary), however the process done by Windows Notepad software is detailed in Micheal S Kaplan's blog:
http://www.siao2.com/2007/04/22/2239345.aspx
Check the first two bytes;
1. If there is a UTF-16 LE BOM, then treat it (and load it) as a "Unicode" file;
2. If there is a UTF-16 BE BOM, then treat it (and load it) as a "Unicode (Big Endian)" file;
3. If the first two bytes look like the start of a UTF-8 BOM, then check the next byte and if we have a UTF-8 BOM, then treat it (and load it) as a "UTF-8" file;
Check with IsTextUnicode to see if that function think it is BOM-less UTF-16 LE, if so, then treat it (and load it) as a "Unicode" file;
Check to see if it UTF-8 using the original RFC 2279 definition from 1998 and if it then treat it (and load it) as a "UTF-8" file;
Assume an ANSI file using the default system code page of the machine.
Now note that there are some holes
here, like the fact that step 2 does
not do quite as good with BOM-less
UTF-16 BE (there may even be a bug
here, I'm not sure -- if so it's a bug
in Notepad beyond any bug in
IsTextUnicode).

http://msdn.microsoft.com/en-us/netframework/aa569610.aspx#Question2
There is no great way to detect an
arbitrary ANSI code page, though there
have been some attempts to do this
based on the probability of certain
byte sequences in the middle of text.
We don't try that in StreamReader. A
few file formats like XML or HTML have
a way of specifying the character set
on the first line in the file, so Web
browsers, databases, and classes like
XmlTextReader can read these files
correctly. But many text files don't
have this type of information built
in.

Unicode/UTF8/UnicodeBigEndian are considered to be different types. ANSI is considered the same as UTF8.
public class EncodingType
{
public static System.Text.Encoding GetType(string FILE_NAME)
{
FileStream fs = new FileStream(FILE_NAME, FileMode.Open, FileAccess.Read);
Encoding r = GetType(fs);
fs.Close();
return r;
}
public static System.Text.Encoding GetType(FileStream fs)
{
byte[] Unicode = new byte[] { 0xFF, 0xFE, 0x41 };
byte[] UnicodeBIG = new byte[] { 0xFE, 0xFF, 0x00 };
byte[] UTF8 = new byte[] { 0xEF, 0xBB, 0xBF }; //with BOM
Encoding reVal = Encoding.Default;
BinaryReader r = new BinaryReader(fs, System.Text.Encoding.Default);
int i;
int.TryParse(fs.Length.ToString(), out i);
byte[] ss = r.ReadBytes(i);
if (IsUTF8Bytes(ss) || (ss[0] == 0xEF && ss[1] == 0xBB && ss[2] == 0xBF))
{
reVal = Encoding.UTF8;
}
else if (ss[0] == 0xFE && ss[1] == 0xFF && ss[2] == 0x00)
{
reVal = Encoding.BigEndianUnicode;
}
else if (ss[0] == 0xFF && ss[1] == 0xFE && ss[2] == 0x41)
{
reVal = Encoding.Unicode;
}
r.Close();
return reVal;
}
private static bool IsUTF8Bytes(byte[] data)
{
int charByteCounter = 1;　
byte curByte;
for (int i = 0; i < data.Length; i++)
{
curByte = data[i];
if (charByteCounter == 1)
{
if (curByte >= 0x80)
{
while (((curByte <<= 1) & 0x80) != 0)
{
charByteCounter++;
}
　
if (charByteCounter == 1 || charByteCounter > 6)
{
return false;
}
}
}
else
{
if ((curByte & 0xC0) != 0x80)
{
return false;
}
charByteCounter--;
}
}
if (charByteCounter > 1)
{
throw new Exception("Error byte format");
}
return true;
}
}

See these two codeproject articles - it is not trivial to find out file encoding simply from the file content:
Detect encoding from ByteOrderMarks (BOM)
Detect Encoding for In- and Outgoing Text

public static System.Text.Encoding GetEncoding(string filepath, Encoding defaultEncoding)
{
// will fall to defaultEncoding if file does not have BOM
using (var reader = new StreamReader(filepath, defaultEncoding, true))
{
reader.Peek(); //need it
return reader.CurrentEncoding;
}
}
Check Byte Order Mark (BOM).
To see the BOM you need to see file in a hexadecimal view.
Notepad show the file encoding at status bar, but it can be just estimated, if the file hasn't the BOM set.

Why does text from Assembly.GetManifestResourceStream() start with three junk characters?

I have a SQL file added to my VS.NET 2008 project as an embedded resource. Whenever I use the following code to read the file's content, the string returned always starts with three junk characters and then the text I expect. I assume this has something to do with the Encoding.Default I am using, but that is just a guess. Why does this text keep showing up? Should I just trim off the first three characters or is there a more informed approach?
public string GetUpdateRestoreSchemaScript()
{
var type = GetType();
var a = Assembly.GetAssembly(type);
var script = "UpdateRestoreSchema.sql";
var resourceName = String.Concat(type.Namespace, ".", script);
using(Stream stream = a.GetManifestResourceStream(resourceName))
{
byte[] buffer = new byte[stream.Length];
stream.Read(buffer, 0, buffer.Length);
// UPDATE: Should be Encoding.UTF8
return Encoding.Default.GetString(buffer);
}
}
Update:
I now know that my code works as expected if I simply change the last line to return a UTF-8 encoded string. It will always be true for this embedded file, but will it always be true? Is there a way to test any buffer to determine its encoding?

Probably the file is in utf-8 encoding and Encoding.Default is ASCII. Why don't you use a specific encoding?
Edit to answer a comment:
In order to guess the file encoding you could look for BOM at the start of the stream. If it exists, it helps, if not then you can only guess or ask user.

if you try to load xml from assembly you actually need to inspect and skip the byte order mark bytes (drove me nuts):
....
byte[] data;
using (var stream = assembly.GetManifestResourceStream(filename))
{
var length = stream.Length;
data = new byte[length];
stream.Read(data, 0, (int) length);
}
if (!HasUtf8ByteOrderMark(data))
{
throw new InvalidOperationException("Expected UTF8 byte order mark EF BB BF");
}
return Encoding.UTF8.GetChars(data.Skip(3).ToArray());
And
static bool HasUtf8ByteOrderMark(byte[] data)
{
var bom = new byte[] { 0xEF, 0xBB, 0xBF };
return data[0] == bom[0] && data[1] == bom[1] && data[2] == bom[2];
}
More information here

I had the same problem in net.core
You can let streamreader do the encoding
using (var stream = = a.GetManifestResourceStream(resourceName))
using (var reader = new StreamReader(stream))
return reader.ReadToEnd();

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Detecting utf-8 without BOM or with BOM - c#

Related

How can i read the end of a file in c#?

Decoding base64 file contents between PHP and C#

C# - Check if File is Text Based

How to find out the Encoding of a File? C#

Why does text from Assembly.GetManifestResourceStream() start with three junk characters?

Categories

Resources