StreamReader fails to detect BOM - c#

I have the following piece of code:
using (StreamReader sr = new StreamReader(path, Encoding.GetEncoding("shift-jis"), true)) {
mCertainFileIsUTFFormat = !sr.CurrentEncoding.Equals(Encoding.GetEncoding("shift-jis"));
mCodingFromBOM = sr.CurrentEncoding;
String line = sr.ReadToEnd();
return line.Split('\n');
}
Basically reading a file and assuming Shift-Jis if there is no BOM. Alas, this method is always, no matter what, returning Shift-JIS encoding, even if the file in question has a BOM within it. Am I doing something wrong here or perhaps there is a known issue? I could always open the file binary and do the work myself, but this is supposed to do what I want :)

You need to call Read of any kind - StreamReader will not detect encoding before reading. I.e. get encoding after your ReadToEnd call:
String line = sr.ReadToEnd();
mCodingFromBOM = sr.CurrentEncoding;
Info: StreamReader.CurrentEncoding
The value can be different after the first call to any Read` method of StreamReader, since encoding autodetection is not done until the first call to a Read method.

Related

Streamreader initialization with a text file that has an empty line each other line

I'm facing a quite curious problem. I'm trying to initialize a streamreader with a file name and an encoding parameter but my code fails due to the fact the file has an empty line each other line contained.
What I'm trying to do is to read the lines in a list. If the file does not contain empty lines then the code executes sucessfuly.
i'm initalizing the reader like this
using (StreamReader reader = new StreamReader(filename, encoding))
{
//do stuff...
}
Any thoughts as of how I could perform the operation mentioned above ? This is for an automated process, so no manual tampering with the file can be performed.
Thank you in advance
I hope this helps:
using (StreamReader sr = new StreamReader(FileName)){
while (!sr.EndOfStream)
{
// read every line
line = sr.ReadLine();
// if line is empty
if (string.IsNullOrEmpty(line.Trim(' '))) // if you want to handle a line with a space as empty use trim for spaces
{
//... do something
}
}
}
It seems that the enconding of the file was messed up when I saved it using Notepad++ (without altering anything on the file, simply clicking save) and the Streamreader did not like it. I can parse it correctly now (if do not click on "save" accidentally) and i can handle the empty lines using
string.IsNullOrWhiteSpace
Thank you all for your time.

Problems with strings in the CSV file

I have an application that reads information from a CSV file to write it to the database. But some characters (example: º ç) are appearing problems Gravalos base. Anyone know how to fix this problem?
Thank you.
I'm using these lines of code to read the information from the CSV file:
string directory = #"C:\test.csv";
StreamReader stream = new StreamReader(directory);
string line = "";
line = stream.ReadLine();
string[] column = line.Split(';');
StreamReader defaults to UTF8 encoding and your file is in a different encoding. Try specifying it like this...
var encoding = Encoding.UTF16;
StreamReader stream = new StreamReader(directory, encoding);
Note that you need to know what encoding the file is in to properly read it... I'm just guessing that it might be UTF16 but obviously I can't know what it is.
You should specify the right encoding when reading the file. The default is UTF-8. Your file is probably encoded with a different encoding.
This is most likely related to the Encoding that is used when reading the file. By default, UTF8 is assumed as the Encoding. In order to read the file correctly, you need to specify the right encoding, e.g.:
string directory = #"C:\test.csv";
using(StreamReader stream = new StreamReader(directory, Encoding.ASCII))
{
string line = "";
line = stream.ReadLine();
string[] column = line.Split(';');
}
You can try the following encodings (see this link for a complete list):
Encoding.Default for ANSI encoding based in the current windows code page.
Encoding.ASCII for ASCII encoding.
Encoding.UTF* for different Unicode encodings.
Please note that I enclosed the StreamReader in a using block so that it is disposed when it is not needed anymore.

C#: Compare text file contents to a string variable

I have an application that dumps text to a text file. I think there might be an issue with the text not containing the proper carriage returns, so I'm in the process of writing a test that will compare the contents of of this file to a string variable that I declare in the code.
Ex:
1) Code creates a text file that contains the text:
This is line 1
This is line 2
This is line 3
2) I have the following string that I want to compare it to:
string testString = "This is line 1\nThis is line 2\nThis is line3"
I understand that I could open a file stream reader and read the text file line by line and store that in a mutable string variable while appending "\n" after each line, but wondering if this is re-inventing the wheel (other words, .NET has a built in class for something like this). Thanks in advance.
you can either use StreamReader's ReadToEnd() method to read contents in a single string like
using System.IO;
using(StreamReader streamReader = new StreamReader(filePath))
{
string text = streamReader.ReadToEnd();
}
Note: you have to make sure that you release the resources (above code uses "using" to do that) and ReadToEnd() method assumes that stream knows when it has reached an end. For interactive protocols in which the server sends data only when you ask for it and does not close the connection, ReadToEnd might block indefinitely because it does not reach an end, and should be avoided and also you should take care that current position in the string should be at the start.
You can also use ReadAllText like
// Open the file to read from.
string readText = File.ReadAllText(path);
which is simple it opens a file, reads all lines and takes care of closing as well.
No, there is nothing built in for this. The easiest way, assuming that your file is small, is to just read the whole thing and compare them:
var fileContents = File.ReadAllText(fileName);
return testString == filecontents;
If the file is fairly long, you may want to compare the file line by line, since finding a difference early on would allow you to reduce IO.
A faster way to implement reading all the text in a file is
System.IO.File.ReadAllText()
but theres no way to do the string level comparison shorter
if(System.IO.File.ReadAllText(filename) == "This is line 1\nThis is line 2\nThis is line3") {
// it matches
}
This should work:
StreamReader streamReader = new StreamReader(filePath);
string originalString = streamReader.ReadToEnd();
streamReader.Close();
I don't think there is a quicker way of doing it in C#.
You can read the entire file into a string variable this way:
FileStream stream;
StreamReader reader;
stream = new FileStream(yourFileName, FileMode.Open, FileAccess.Read, FileShare.Read);
reader = new StreamReader(stream);
string stringContainingFilesContent = reader.ReadToEnd();
// and check for your condition
if (testString.Equals(stringContainingFilesContent, StringComparison.OrdinalIgnoreCase))

C# - Detecting encoding in a file, write change to file using the found encoding

I wrote a small program for iterating through a lot of files and applying some changes where a certain string match is found, the problem I have is that different files have different encodings. So what I would like to do is check the encoding, then overwrite the file in its original encoding.
What would be the prettiest way of doing that in C# .net 2.0?
My code looks very simple as of now;
String f1 = File.ReadAllText(fileList[i]).ToLower();
if (f1.Contains(oPath))
{
f1 = f1.Replace(oPath, nPath);
File.WriteAllText(fileList[i], f1, Encoding.Unicode);
}
I took a look at Auto encoding detect in C# which made me realize how I could detect encoding, but I am not sure how I could use that information to write in the same encoding.
Would greatly appreciate any help here.
Unfortunately encoding is one of those subjects where there is not always a definitive answer. In many cases it's much closer to guessing the encoding as opposed to detecting it. Raymond Chen did an excellent blog post on this subject that is worth the read
http://blogs.msdn.com/b/oldnewthing/archive/2007/04/17/2158334.aspx
The gist of the article is
If the BOM (byte order marker) exists then you're golden
Else it's guess work and heuristics
However I still think the best approach is to Darin mentioned in the question you linked. Let StreamReader guess for you vs. re-inventing the wheel. It only requires a very slight modification to your sample.
String f1;
Encoding encoding;
using (var reader = new StreamReader(fileList[i])) {
f1 = reader.ReadToEnd().ToLower();
encoding = reader.CurrentEncoding;
}
if (f1.Contains(oPath))
{
f1 = f1.Replace(oPath, nPath);
File.WriteAllText(fileList[i], f1, encoding);
}
By default, .Net use UTF8. It is hard to detect character encoding becus most of the time .Net will read as UTF8. i alway have problem with ANSI.
my trick is i will read the file as Stream as force it to read as UTF8 and detect usual character that should be in text. If found, then UTF8 else ANSI ... and tell user u can use just 2 encoding either ANSI or UTF8. auto dectect not quite work in my language :p
I am afraid, you will have to know the encoding. For UTF based encodings though you can use StreamReader built in functionality though.
Taken form here.
With regard to encodings - you will
need to have identified the encoding
in order to use the StreamReader.
However, the StreamReader itself can
help if you create it with one of the
constructor overloads that allows you
to supply the flag
detectEncodingFromByteOrderMarks as
true (or you can use
Encoding.GetPreamble and look at the
byte preamble yourself).
Both these methods will only help
auto-detect UTF based encodings though
- so any ANSI encodings with a specified codepage will probably not
be parsed correctly.
Prob a bit late but I encountered the same problem myself, using the previous answers I found a solution that works for me, It reads in the text using StreamReaders default encoding, extracts the encoding used on that file and uses StreamWriter to write it back with the changes using the found Encoding. Also removes\reAdds the ReadOnly flag
string file = "File to open";
string text;
Encoding encoding;
string oldValue = "string to be replaced";
string replacementValue = "New string";
var attributes = File.GetAttributes(file);
File.SetAttributes(file, attributes & ~FileAttributes.ReadOnly);
using (StreamReader reader = new StreamReader(file, Encoding.Default))
{
text = reader.ReadToEnd();
encoding = reader.CurrentEncoding;
reader.Close();
}
bool changedValue = false;
if (text.Contains(oldValue))
{
text = text.Replace(oldValue, replacementValue);
changedValue = true;
}
if (changedValue)
{
using (StreamWriter write = new StreamWriter(file, false, encoding))
{
write.Write(text.ToString());
write.Close();
}
File.SetAttributes(file, attributes | FileAttributes.ReadOnly);
}
The solution for all Germans => ÄÖÜäöüß
This function opens the file an determines the Encoding by the BOM.
If the BOM is missing the file will be interpreted as ANSI, but if there are UTF8 encoded German Umlaute in it, it will be detected as UTF8.
https://stackoverflow.com/a/69312696/9134997

Is there a way to read large text file in parts?

I have a large file(60mb) and I am reading the file into a string and Iam returning that string to another method.
Now when I am reading the file into a string its giving System out of memory exception.
Is there a way to read file in parts and append it to the string?
If not is there a way around this?
static public string Serialize()
{
string returnValue;
System.IO.FileInfo file1 = new FileInfo(#"c:\file.txt");
returnValue = System.IO.File.ReadAllText(file1.ToString());
}
How do you read the file right now ?
You could use the StreamReader class, and read the file line by line (ReadLine method).
You could also read a specified amount of bytes from the file on each read operation (Read method)
Yes- it's called streaming. Have a look at the StreamReader Class. Though I'm not sure why you want 1 60MB in one string. Probably best to deal with it a little at a time if possible (possibly in your scenario on a line by line basis?).
Instead of ReadAllText look at OpenRead and passing the returned FileStream into the constructor of a StreamReader, have a look at doing something along these lines if possible:
using (FileStream fs = File.OpenRead("c:\theFile.text"))
using (StreamReader sr = new StreamReader(fs))
{
string oneLine = sr.ReadLine();
}
even if you read it line by line (or in parts by streaming), you will run out of memory as you are appending it to a single string. is compressing it along the way an option? if not, i'd probably up the maxHeap for the JVM to 512MB or similar.

Categories