How can I test whether a file that I'm opening in C# using FileStream is a "text type" file? I would like my program to open any file that is text based, for example, .txt, .html, etc.
But not open such things as .doc or .pdf or .exe, etc.
In general: there is no way to tell.
A text file stored in UTF-16 will likely look like binary if you open it with an 8-bit encoding. Equally someone could save a text file as a .doc (it is a document).
While you could open the file and look at some of the content all such heuristics will sometimes fail (eg. notepad tries to do this, by careful selection of a few characters notepad will guess wrong and display completely different content).
If you have a specific scenario, rather than being able to open and process anything, you should be able to do much better.
I guess you could just check through the first 1000 (arbitrary number) characters and see if there are unprintable characters, or if they are all ascii in a certain range. If the latter, assume that it is text?
Whatever you do is going to be a guess.
As others have pointed out there is no absolute way to be sure. However, to determine if a file is binary (which can be said to be easier than determining if it is text) some implementations check for consecutive NUL characters. Git apparently just checks the first 8000 chars for a NUL and if it finds one treats the file as binary. See here for more details.
Here is a similar C# solution I wrote that looks for a given number of required consecutive NUL. If IsBinary returns false then it is very likely your file is text based.
public bool IsBinary(string filePath, int requiredConsecutiveNul = 1)
{
const int charsToCheck = 8000;
const char nulChar = '\0';
int nulCount = 0;
using (var streamReader = new StreamReader(filePath))
{
for (var i = 0; i < charsToCheck; i++)
{
if (streamReader.EndOfStream)
return false;
if ((char) streamReader.Read() == nulChar)
{
nulCount++;
if (nulCount >= requiredConsecutiveNul)
return true;
}
else
{
nulCount = 0;
}
}
}
return false;
}
To get the real type of a file, you must check its header, which won't be changed even the extension is modified. You can get the header list here, and use something like this in your code:
using(var stream = new FileStream(fileName, FileMode.Open, FileAccess.Read))
{
using(var reader = new BinaryReader(stream))
{
// read the first X bytes of the file
// In this example I want to check if the file is a BMP
// whose header is 424D in hex(2 bytes 6677)
string code = reader.ReadByte().ToString() + reader.ReadByte().ToString();
if (code.Equals("6677"))
{
//it's a BMP file
}
}
}
I have a below solution which works for me.This is general solution which check all types of Binary file.
/// <summary>
/// This method checks whether selected file is Binary file or not.
/// </summary>
public bool CheckForBinary()
{
Stream objStream = new FileStream("your file path", FileMode.Open, FileAccess.Read);
bool bFlag = true;
// Iterate through stream & check ASCII value of each byte.
for (int nPosition = 0; nPosition < objStream.Length; nPosition++)
{
int a = objStream.ReadByte();
if (!(a >= 0 && a <= 127))
{
break; // Binary File
}
else if (objStream.Position == (objStream.Length))
{
bFlag = false; // Text File
}
}
objStream.Dispose();
return bFlag;
}
public bool IsTextFile(string FilePath)
using (StreamReader reader = new StreamReader(FilePath))
{
int Character;
while ((Character = reader.Read()) != -1)
{
if ((Character > 0 && Character < 8) || (Character > 13 && Character < 26))
{
return false;
}
}
}
return true;
}
Related
Code:
public void mergeFiles(string dir)
{
for (int i = 0; i < parts; i++)
{
if (!File.Exists(dir))
{
File.Create(dir).Close();
}
var output = File.Open(dir, FileMode.Open);
var input = File.Open(dir + ".part" + (i + 1), FileMode.Open);
input.CopyTo(output);
output.Close();
input.Close();
File.Delete(dir + ".part" + (i + 1));
}
}
dir variable is for example /path/file.txt.gz
I have a file packed into a .gz archive. This archive is divided into e.g. 8 parts and I want to get this file.
The problem is that I don't know how to combine these files "file.gz.part1..." to extract them later.
When I use the above function, the archive is corrupted.
I have been struggling with it for a week, looking on the Internet, but this is the best solution I have found and it does not work.
Anyone have any advice on how to combine archive parts into one file?
Your code has a few problems. If you look at the documentation for System.IO.Stream.Close you will see the following remark (emphasis mine):
Closes the current stream and releases any resources (such as sockets and file handles) associated with the current stream. Instead of calling this method, ensure that the stream is properly disposed.
So, per the docs, you want to dispose your streams rather than calling close directly (I'll come back to that in a second). Ignoring that, your main problem lies here:
var output = File.Open(dir, FileMode.Open);
You're using FileMode.Open for your output file. Again from the docs:
Specifies that the operating system should open an existing file. The ability to open the file is dependent on the value specified by the FileAccess enumeration. A FileNotFoundException exception is thrown if the file does not exist.
That's opening a stream at the beginning of the file. So, you're writing each partial file over the beginning of your output file repeatedly. I'm sure you noticed that your combined file size was only as large as the largest partial file. Take a look at FileMode.Append on the other hand:
Opens the file if it exists and seeks to the end of the file, or creates a new file. This requires Append permission. FileMode.Append can be used only in conjunction with FileAccess.Write. Trying to seek to a position before the end of the file throws an IOException exception, and any attempt to read fails and throws a NotSupportedException exception.
OK - but backing up even a step further, this:
if (!File.Exists(dir))
{
File.Create(dir).Close();
}
var output = File.Open(dir, FileMode.Open);
... is ineffecient. Why would we check for the file existing n number of times, then open/close it n number of times? We can just create the file as the first step, and leave that output stream open until we have appended all of our data to it.
So, how would we refactor your code to use IDisposable while fixing your bug? Check out the using statement. Putting all of this together, your code might look like this:
public void mergeFiles(string dir)
{
using (FileStream combinedFile = File.Create(dir))
{
for (int i = 0; i < parts; i++)
{
// Since this string is referenced more than once, capture as a
// variable to lower risk of copy/paste errors.
var splitFileName = dir + ".part" + (i + 1);
using (FileStream filePart = File.Open(splitFileName, FileMode.Open))
{
filePart.CopyTo(combinedFile);
}
// Note that it's safe to delete the file now, because our filePart
// stream has been disposed as it is out of scope.
File.Delete(splitFileName);
}
}
}
Give that a try. And here's an entire working program with a contrived example that you can past into a new console app and run:
using System.IO;
using System.Text;
namespace temp_test
{
class Program
{
static int parts = 10;
static void Main(string[] args)
{
// First we will generate some dummy files.
generateFiles();
// Next, open files and combine.
combineFiles();
}
/// <summary>
/// A contived example to generate some files.
/// </summary>
static void generateFiles()
{
for (int i = 0; i < parts; i++)
{
using (FileStream newFile = File.Create("splitfile.part" + i))
{
byte[] info = new UTF8Encoding(true).GetBytes($"This is File # ${i.ToString()}");
newFile.Write(info);
}
}
}
/// <summary>
/// A contived example to combine our files.
/// </summary>
static void combineFiles()
{
using (FileStream combinedFile = File.Create("combined"))
{
for (int i = 0; i < parts; i++)
{
var splitFileName = "splitfile.part" + i;
using (FileStream filePart = File.Open(splitFileName, FileMode.Open))
{
filePart.CopyTo(combinedFile);
}
// Note that it's safe to delete the file now, because our filePart
// stream has been disposed as it is out of scope.
File.Delete(splitFileName);
}
}
}
}
}
Good luck and welcome to StackOverflow!
I just want to use the ASCII Unit Separator character (decimal 31 and hex 1F) instead of a tab for a delimited file. I assume the problem is encoding but I sure can't find how to change it. In the following, I get the desired output on the console in the first line of output in my StreamWriter file but the second line is missing the '\x1f'.
static StreamWriter sw = null;
static void Main(string[] args)
{
try
{
sw = new StreamWriter(OutFilename, false, Encoding.UTF8);
}
catch (Exception ex)
{
Console.WriteLine("File open error: " + ex.Message);
return;
}
// This works
Output("From▼To"); // Has a '\x1f' in it
// This does not work
StringBuilder sb = new StringBuilder();
sb.Append("From");
sb.Append('\x1f');
sb.Append("To");
Output(sb.ToString());
//
sw.Close();
}
static void Output(string s)
{
Console.WriteLine(s);
sw.WriteLine(s);
}
The output file has:
From▼To
FromTo
I want to build a string using StringBuilder except with the '\x1f' in the output.
Seems like there is a lot of confusion here. Let me see if I can clear things up somewhat.
First of all, let's agree on the following points that are easily verifiable:
'\x1f' == '\u001F'
'\x1f' == (char)31
'\x1f' != '▼' // <-- here appears to be your mistaken assumption.
'▼' == (char)9660
'▼' == '\u25BC'
So this...
// This works
Output("From▼To"); // Has a '\x1f' in it
... ironically is the exact line that does not work. There is no '\x1f' in this string. The triangle character is not '\x1f'. Not sure where you got that impression.
Which leads us to the last point: '\x1f' is not a visible character. So when you try to display it in the console, you will not see it, and that is 100% normal.
However, be assured that when you have a string with '\x1f' and write that out to a file, the character is still there. But you will never be able to "see" it, unless you read the bytes directly.
So whether or not you can use '\x1f' as a delimiter depends on whether you need the delimiter to be visible. If yes, then you need to pick another character. But if you only need it as a delimiter for when you programmatically parse the file, then using '\x1f' is appropriate.
Just in case you want to try your luck with such trick, you could write exactly the bytes you expect in the following way:
Output(Encoding.UTF8.GetBytes(sb.ToString()));
if you have another Output method like this:
static void Output(string s)
{
Console.WriteLine(s);
sw.WriteLine(s);
}
static void Output(byte[] bytes)
{
int dataLength = bytes.Length;
List<byte> modified = new List<byte>();
for (int i = 0; i < dataLength; i++)
{
if (bytes[i] == 0xBC && (i < dataLength - 1) && bytes[i + 1] == 0x25)
{
modified.Add(0x1F);
i++;
}
else
{
modified.Add(bytes[i]);
}
}
byte[] data = modified.ToArray();
Console.WriteLine(Encoding.UTF8.GetString(bytes)); // Use this or the next line
// Console.WriteLine(Encoding.UTF8.GetString(data));
sw.BaseStream.Write(data, 0, data.Length);
sw.WriteLine();
}
I need to be able to take a text file with unknown encoding (e.g., UTF-8, UTF-16, ...) and copy it line by line, making specific changes as I go. In this example, I am changing the encoding, however there are other uses for this kind of processing.
What I can't figure out is how to determine if the last line has a newline! Some programs care about the difference between a file with these records:
Rec1<newline>
Rec2<newline>
And a file with these:
Rec1<newline>
Rec2
How can I tell the difference in my code so that I can take appropriate action?
using (StreamReader reader = new StreamReader(sourcePath))
using (StreamWriter writer = new StreamWriter(destinationPath, false, outputEncoding))
{
bool isFirstLine = true;
while (!reader.EndOfStream)
{
string line = reader.ReadLine();
if (isFirstLine)
{
writer.Write(line);
isFirstLine = false;
}
else
{
writer.Write("\r\n" + line);
}
}
//if (LastLineHasNewline)
//{
// writer.Write("\n");
//}
writer.Flush();
}
The commented out code is what I want to be able to do, but I can't figure out how to set the condition lastInputLineHadNewline! Remember, I have no a priori knowledge of the input file encoding.
Remember, I have no a priori knowledge of the input file encoding.
That's the fundamental problem to solve.
If the file could be using any encoding, then there is no concept of reading "line by line" as you can't possibly tell what the line ending is.
I suggest you first address this part, and the rest will be easy. Now, without knowing the context it's hard to say whether that means you should be asking the user for the encoding, or detecting it heuristically, or something else - but I wouldn't start trying to use the data before you can fully understand it.
As often happens, the moment you go to ask for help, the answer comes to the surface. The commented out code becomes:
if (LastLineHasNewline(reader))
{
writer.Write("\n");
}
And the function looks like this:
private static bool LastLineHasNewline(StreamReader reader)
{
byte[] newlineBytes = reader.CurrentEncoding.GetBytes("\n");
int newlineByteCount = newlineBytes.Length;
reader.BaseStream.Seek(-newlineByteCount, SeekOrigin.End);
byte[] inputBytes = new byte[newlineByteCount];
reader.BaseStream.Read(inputBytes, 0, newlineByteCount);
for (int i = 0; i < newlineByteCount; i++)
{
if (newlineBytes[i] != inputBytes[i])
return false;
}
return true;
}
I have a big file with some text, and I want to split it into smaller files.
In this example, What I do:
I open a text file let's say with 10 000 lines into it
I set a number of package=300 here, which means, that's the small file limit, once a small file has 300 lines into it, close it, open a new file for writing for example (package2).
Same, as step 2.
You already know
Here is the code from my function that should do that. The ideea (what I dont' know) is how to close, and open a new file once it has reached the 300 limit (in our case here).
Let me show you what I'm talking about:
int nr = 1;
package=textBox1.Text;//how many lines/file (small file)
string packnr = nr.ToString();
string filer=package+"Pack-"+packnr+"+_"+date2+".txt";//name of small file/s
int packtester = 0;
int package= 300;
StreamReader freader = new StreamReader("bigfile.txt");
StreamWriter pak = new StreamWriter(filer);
while ((line = freader.ReadLine()) != null)
{
if (packtester < package)
{
pak.WriteLine(line);//writing line to small file
packtester++;//increasing the lines of small file
}
else if (packtester == package)//in this example, checking if the lines
//written, got to 300
{
packtester = 0;
pak.Close();//closing the file
nr++;//nr++ -> just for file name to be Pack-2;
packnr = nr.ToString();
StreamWriter pak = new StreamWriter(package + "Pack-" + packnr + "+_" + date2 + ".txt");
}
}
I get this errors:
Cannot use local variable 'pak' before it is declared
A local variable named 'pak' cannot be declared in this scope because it would give a different meaning to 'pak', which is already used in a 'parent or current' scope to denote something else
Try this:
public void SplitFile()
{
int nr = 1;
int package = 300;
DateTime date2 = DateTime.Now;
int packtester = 0;
using (var freader = new StreamReader("bigfile.txt"))
{
StreamWriter pak = null;
try
{
pak = new StreamWriter(GetPackFilename(package, nr, date2), false);
string line;
while ((line = freader.ReadLine()) != null)
{
if (packtester < package)
{
pak.WriteLine(line); //writing line to small file
packtester++; //increasing the lines of small file
}
else
{
pak.Flush();
pak.Close(); //closing the file
packtester = 0;
nr++; //nr++ -> just for file name to be Pack-2;
pak = new StreamWriter(GetPackFilename(package, nr, date2), false);
}
}
}
finally
{
if(pak != null)
{
pak.Dispose();
}
}
}
}
private string GetPackFilename(int package, int nr, DateTime date2)
{
return string.Format("{0}Pack-{1}+_{2}.txt", package, nr, date2);
}
Logrotate can do this automatically for you. Years have been put into it and it's what people trust to handle their sometimes very large webserver logs.
Note that the code, as written, will not compile because you define the variable pak more than once. It should otherwise function, though it has some room for improvement.
When working with files, my suggestion and the general norm is to wrap your code in a using block, which is basically syntactic sugar built on top of a finally clause:
using (var stream = File.Open("C:\hi.txt"))
{
//write your code here. When this block is exited, stream will be disposed.
}
Is equivalent to:
try
{
var stream = File.Open(#"C:\hi.txt");
}
finally
{
stream.Dispose();
}
In addition, when working with files, always prefer opening file streams using very specific permissions and modes as opposed to using the more sparse constructors that assume some default options. For example:
var stream = new StreamWriter(File.Open(#"c:\hi.txt", FileMode.CreateNew, FileAccess.ReadWrite, FileShare.Read));
This will guarantee, for example, that files should not be overwritten -- instead, we assume that the file we want to open doesn't exist yet.
Oh, and instead of using the check you perform, I suggest using the EndOfStream property of the StreamReader object.
This code looks like it closes the stream and re-opens a new stream when you hit 300 lines. What exactly doesn't work in this code?
One thing you'll want to add is a final close (probably with a check so it doesn't try to close an already closed stream) in case you don't have an even multiple of 300 lines.
EDIT:
Due to your edit I see your problem. You don't need to redeclare pak in the last line of code, simply reinitialize it to another streamwriter.
(I don't remember if that is disposable but if it is you probably should do that before making a new one).
StreamWriter pak = new StreamWriter(package + "Pack-" + packnr + "+_" + date2 + ".txt");
becomes
pak = new StreamWriter(package + "Pack-" + packnr + "+_" + date2 + ".txt");
Well i need to find out which of the files i found in some directory is UTF8 Encoded either ANSI encoded to change the Encoding in something else i decide later. My problem is.. how can i find out if a file is UTF8 or ANSI Encoded? Both of the encodings are actually posible in my files.
There is no reliable way to do it (since the file might be just random binary), however the process done by Windows Notepad software is detailed in Micheal S Kaplan's blog:
http://www.siao2.com/2007/04/22/2239345.aspx
Check the first two bytes;
1. If there is a UTF-16 LE BOM, then treat it (and load it) as a "Unicode" file;
2. If there is a UTF-16 BE BOM, then treat it (and load it) as a "Unicode (Big Endian)" file;
3. If the first two bytes look like the start of a UTF-8 BOM, then check the next byte and if we have a UTF-8 BOM, then treat it (and load it) as a "UTF-8" file;
Check with IsTextUnicode to see if that function think it is BOM-less UTF-16 LE, if so, then treat it (and load it) as a "Unicode" file;
Check to see if it UTF-8 using the original RFC 2279 definition from 1998 and if it then treat it (and load it) as a "UTF-8" file;
Assume an ANSI file using the default system code page of the machine.
Now note that there are some holes
here, like the fact that step 2 does
not do quite as good with BOM-less
UTF-16 BE (there may even be a bug
here, I'm not sure -- if so it's a bug
in Notepad beyond any bug in
IsTextUnicode).
http://msdn.microsoft.com/en-us/netframework/aa569610.aspx#Question2
There is no great way to detect an
arbitrary ANSI code page, though there
have been some attempts to do this
based on the probability of certain
byte sequences in the middle of text.
We don't try that in StreamReader. A
few file formats like XML or HTML have
a way of specifying the character set
on the first line in the file, so Web
browsers, databases, and classes like
XmlTextReader can read these files
correctly. But many text files don't
have this type of information built
in.
Unicode/UTF8/UnicodeBigEndian are considered to be different types. ANSI is considered the same as UTF8.
public class EncodingType
{
public static System.Text.Encoding GetType(string FILE_NAME)
{
FileStream fs = new FileStream(FILE_NAME, FileMode.Open, FileAccess.Read);
Encoding r = GetType(fs);
fs.Close();
return r;
}
public static System.Text.Encoding GetType(FileStream fs)
{
byte[] Unicode = new byte[] { 0xFF, 0xFE, 0x41 };
byte[] UnicodeBIG = new byte[] { 0xFE, 0xFF, 0x00 };
byte[] UTF8 = new byte[] { 0xEF, 0xBB, 0xBF }; //with BOM
Encoding reVal = Encoding.Default;
BinaryReader r = new BinaryReader(fs, System.Text.Encoding.Default);
int i;
int.TryParse(fs.Length.ToString(), out i);
byte[] ss = r.ReadBytes(i);
if (IsUTF8Bytes(ss) || (ss[0] == 0xEF && ss[1] == 0xBB && ss[2] == 0xBF))
{
reVal = Encoding.UTF8;
}
else if (ss[0] == 0xFE && ss[1] == 0xFF && ss[2] == 0x00)
{
reVal = Encoding.BigEndianUnicode;
}
else if (ss[0] == 0xFF && ss[1] == 0xFE && ss[2] == 0x41)
{
reVal = Encoding.Unicode;
}
r.Close();
return reVal;
}
private static bool IsUTF8Bytes(byte[] data)
{
int charByteCounter = 1;
byte curByte;
for (int i = 0; i < data.Length; i++)
{
curByte = data[i];
if (charByteCounter == 1)
{
if (curByte >= 0x80)
{
while (((curByte <<= 1) & 0x80) != 0)
{
charByteCounter++;
}
if (charByteCounter == 1 || charByteCounter > 6)
{
return false;
}
}
}
else
{
if ((curByte & 0xC0) != 0x80)
{
return false;
}
charByteCounter--;
}
}
if (charByteCounter > 1)
{
throw new Exception("Error byte format");
}
return true;
}
}
See these two codeproject articles - it is not trivial to find out file encoding simply from the file content:
Detect encoding from ByteOrderMarks (BOM)
Detect Encoding for In- and Outgoing Text
public static System.Text.Encoding GetEncoding(string filepath, Encoding defaultEncoding)
{
// will fall to defaultEncoding if file does not have BOM
using (var reader = new StreamReader(filepath, defaultEncoding, true))
{
reader.Peek(); //need it
return reader.CurrentEncoding;
}
}
Check Byte Order Mark (BOM).
To see the BOM you need to see file in a hexadecimal view.
Notepad show the file encoding at status bar, but it can be just estimated, if the file hasn't the BOM set.