I am creating a file with ASCII encoding, but when I test to get the Encoding type of that file, it is returning UTF8Encoding.
Can anyone explain the reason or figure my mistake??
CODE:
Creating File:
FileStream _textStream = File.Open("CreateAsciiFile.txt", FileMode.Create, FileAccess.Write);
StreamWriter _streamWriter = new StreamWriter(_textStream, System.Text.Encoding.ASCII);
Byte[] byteContent = BtyeTowrite(); // This returns the array of byte
foreach(var myByte in byteContent)
_streamWriter.Write(System.Convert.ToChar(myByte));
Reading a file:
StreamReader sr = new StreamReader(#"C:\CreateAsciiFile.txt",true);
string LineText= sr.ReadLine();
System.Text.Encoding enc = sr.CurrentEncoding;
Here enc gives UTF8Encoding... But I am expecting ASCII ???
You need to read from the reader before querying the encoding. So before calling sr.CurrentEncoding try reading something. The StreamReader looks at the first bytes to try to guess the encoding and because ASCII has no BOM it might not be recognizable as such and you might get wrong results. For example there is no difference (at the binary level) between an ASCII encoded file a ISO-8859-1 encoded file.
The answer is probably here:"
Every valid ASCII character is also a valid UTF‑8 encoded Unicode character with the same binary value.
In other words, your ASCII file is both valid UTF-8 and ASCII. It is detected as UTF-8.
Related
I want to write a String to a Stream (a MemoryStream in this case) and read the bytes one by one.
stringAsStream = new MemoryStream();
UnicodeEncoding uniEncoding = new UnicodeEncoding();
String message = "Message";
stringAsStream.Write(uniEncoding.GetBytes(message), 0, message.Length);
Console.WriteLine("This:\t\t" + (char)uniEncoding.GetBytes(message)[0]);
Console.WriteLine("Differs from:\t" + (char)stringAsStream.ReadByte());
The (undesired) result I get is:
This: M
Differs from: ?
It looks like it's not being read correctly, as the first char of "Message" is 'M', which works when getting the bytes from the UnicodeEncoding instance but not when reading them back from the stream.
What am I doing wrong?
The bigger picture: I have an algorithm which will work on the bytes of a Stream, I'd like to be as general as possible and work with any Stream. I'd like to convert an ASCII-String into a MemoryStream, or maybe use another method to be able to work on the String as a Stream. The algorithm in question will work on the bytes of the Stream.
After you write to the MemoryStream and before you read it back, you need to Seek back to the beginning of the MemoryStream so you're not reading from the end.
UPDATE
After seeing your update, I think there's a more reliable way to build the stream:
UnicodeEncoding uniEncoding = new UnicodeEncoding();
String message = "Message";
// You might not want to use the outer using statement that I have
// I wasn't sure how long you would need the MemoryStream object
using(MemoryStream ms = new MemoryStream())
{
var sw = new StreamWriter(ms, uniEncoding);
try
{
sw.Write(message);
sw.Flush();//otherwise you are risking empty stream
ms.Seek(0, SeekOrigin.Begin);
// Test and work with the stream here.
// If you need to start back at the beginning, be sure to Seek again.
}
finally
{
sw.Dispose();
}
}
As you can see, this code uses a StreamWriter to write the entire string (with proper encoding) out to the MemoryStream. This takes the hassle out of ensuring the entire byte array for the string is written.
Update: I stepped into issue with empty stream several time. It's enough to call Flush right after you've finished writing.
Try this "one-liner" from Delta's Blog, String To MemoryStream (C#).
MemoryStream stringInMemoryStream =
new MemoryStream(ASCIIEncoding.Default.GetBytes("Your string here"));
The string will be loaded into the MemoryStream, and you can read from it. See Encoding.GetBytes(...), which has also been implemented for a few other encodings.
You're using message.Length which returns the number of characters in the string, but you should be using the nubmer of bytes to read. You should use something like:
byte[] messageBytes = uniEncoding.GetBytes(message);
stringAsStream.Write(messageBytes, 0, messageBytes.Length);
You're then reading a single byte and expecting to get a character from it just by casting to char. UnicodeEncoding will use two bytes per character.
As Justin says you're also not seeking back to the beginning of the stream.
Basically I'm afraid pretty much everything is wrong here. Please give us the bigger picture and we can help you work out what you should really be doing. Using a StreamWriter to write and then a StreamReader to read is quite possibly what you want, but we can't really tell from just the brief bit of code you've shown.
I think it would be a lot more productive to use a TextWriter, in this case a StreamWriter to write to the MemoryStream. After that, as other have said, you need to "rewind" the MemoryStream using something like stringAsStream.Position = 0L;.
stringAsStream = new MemoryStream();
// create stream writer with UTF-16 (Unicode) encoding to write to the memory stream
using(StreamWriter sWriter = new StreamWriter(stringAsStream, UnicodeEncoding.Unicode))
{
sWriter.Write("Lorem ipsum.");
}
stringAsStream.Position = 0L; // rewind
Note that:
StreamWriter defaults to using an instance of UTF8Encoding unless specified otherwise. This instance of UTF8Encoding is constructed without a byte order mark (BOM)
Also, you don't have to create a new UnicodeEncoding() usually, since there's already one as a static member of the class for you to use in convenient utf-8, utf-16, and utf-32 flavors.
And then, finally (as others have said) you're trying to convert the bytes directly to chars, which they are not. If I had a memory stream and knew it was a string, I'd use a TextReader to get the string back from the bytes. It seems "dangerous" to me to mess around with the raw bytes.
You need to reset the stream to the beginning:
stringAsStream.Seek(0, SeekOrigin.Begin);
Console.WriteLine("Differs from:\t" + (char)stringAsStream.ReadByte());
This can also be done by setting the Position property to 0:
stringAsStream.Position = 0
I have an application that reads information from a CSV file to write it to the database. But some characters (example: º ç) are appearing problems Gravalos base. Anyone know how to fix this problem?
Thank you.
I'm using these lines of code to read the information from the CSV file:
string directory = #"C:\test.csv";
StreamReader stream = new StreamReader(directory);
string line = "";
line = stream.ReadLine();
string[] column = line.Split(';');
StreamReader defaults to UTF8 encoding and your file is in a different encoding. Try specifying it like this...
var encoding = Encoding.UTF16;
StreamReader stream = new StreamReader(directory, encoding);
Note that you need to know what encoding the file is in to properly read it... I'm just guessing that it might be UTF16 but obviously I can't know what it is.
You should specify the right encoding when reading the file. The default is UTF-8. Your file is probably encoded with a different encoding.
This is most likely related to the Encoding that is used when reading the file. By default, UTF8 is assumed as the Encoding. In order to read the file correctly, you need to specify the right encoding, e.g.:
string directory = #"C:\test.csv";
using(StreamReader stream = new StreamReader(directory, Encoding.ASCII))
{
string line = "";
line = stream.ReadLine();
string[] column = line.Split(';');
}
You can try the following encodings (see this link for a complete list):
Encoding.Default for ANSI encoding based in the current windows code page.
Encoding.ASCII for ASCII encoding.
Encoding.UTF* for different Unicode encodings.
Please note that I enclosed the StreamReader in a using block so that it is disposed when it is not needed anymore.
I'm reading a PDF file with C#, but the characters are coming from another encoding, and returning different characters than those which I expected from when I view the file in a PDF viewer.
I thought a UTF-8 encoding would be correct.
What am I doing wrong?
string file = #"c:\document.pdf";
Stream stream = File.Open(file, FileMode.Open);
BinaryReader binaryReady = new BinaryReader(stream);
byte[] buffer = binaryReady.ReadBytes(Convert.ToInt32(stream.Length));
var encoder = UTF8Encoding.UTF8.GetString(buffer);
PDF is a very complex multi-part file, it is not just UTF8 text.
If you want to read a PDF file, you must read over the full PDF File Format Documentation and fully implement the large and complex details of how the file format works.
I try to encode a string in Windows-1252 with a StreamWriter. The input string (dataString) is encode in UTF8.
StreamWriter sw = new StreamWriter(#"C:\Temp\data.txt", true, Encoding.GetEncoding(1252));
sw.Write(dataString);
sw.Close();
When I open the file in Notepad++ I get a ANSI file. I need a Windows-1252 encoded file.
Someone have an idea?
Your file is Windows-1252 encoded. There is no data in the file of a non-Unicode to indicate how the file is encoded. In this case ANSI just means not Unicode. If you where to encode the as Russian/Windows-1251 and open it in Notepad++, Notepad++ would display it as ANSI as well.
See Unicode, UTF, ASCII, ANSI format differences for more info.
Okay, I'm trying to work with UTF8 text files. I'm constantly fighting the BOM chars that the writer drops in for UTF8, which blows up pretty much anything I need to use to read the file including serializers and other text readers.
I'm getting a leading six bytes of data:
0xEF
0xBB
0xBF
0xEF
0xBB
0xBF
(now that I'm looking at it, I realize there's two characters there. Is that the UTF8 BOM marker? Am I double encoding it)?
Notice the serializer encodes to UTF8, then the memory stream gets a string as UTF8, then I write the string to the file with UTF8... seems like a lot of redundancy. Thoughts?
//I'm storing this xml result to a database field. (this one includes the BOF chars)
using (MemoryStream ms = new MemoryStream())
{
Utility.SerializeXml(ms, root);
xml = Encoding.UTF8.GetString(ms.ToArray());
}
//later on, I would take that xml and then write it out to a file like this:
File.WriteAllText(path, xml, Encoding.UTF8);
public static void SerializeXml(Stream output, object data)
{
XmlSerializer xs = new XmlSerializer(data.GetType());
XmlWriterSettings settings = new XmlWriterSettings();
settings.Indent = true;
settings.IndentChars = "\t";
settings.Encoding = Encoding.UTF8;
XmlWriter writer = XmlTextWriter.Create(output, settings);
xs.Serialize(writer, data);
writer.Flush();
writer.Close();
}
Yeah, that's two BOMs. You're encoding to UTF-8 twice and each time adds a pseudo-BOM, due to the extremely unfortunate fact that:
Encoding.UTF8
means “UTF-8 with a pointless, meaningless U+FEFF stuck to the front to screw up your applications”. Try instead using
new UTF8Encoding(false)
which should give you a less sucky version.
Yes that is a BOM.
Yes some older JDK's had a bug that blew up on UTF-8 BOM data. And two of them will confuse even a modern version of Java.
The solution I used was to stick a pushback stream on the front and filter it out.
Or use a more modern version of Java.
The byte sequence 0xEF 0xBB 0xBF is the UTF-8 encoding of U+FEFF, which is the Unicode BOM (byte order mark). It is unnecessary in UTF-8, but crucial in UTF-16 or UTF-32.
You've got the same sequence twice.
The only good thing to do with them is ignore and/or delete them.