Handle — while import csv file using C# - c#

I had written code (in C#) about to import csv file using filehelper.
I am facing one issue that if file contain any &mdash (—) than it would replace by ? character (not exact ? instead some special character as per shown in below image)
How can i handle this by code?
Thanks.

How your stream reader object is created ? Have you provided any specific encoding to it ? I think you should try if not yet, as default encoding can not be detected while there is no BOM defined.
From MSDN
The character encoding is set by the encoding parameter, and the
buffer size is set to 1024 bytes. The StreamReader object attempts to
detect the encoding by looking at the first three bytes of the stream.
It automatically recognizes UTF-8, little-endian Unicode, and
big-endian Unicode text if the file starts with the appropriate byte
order marks. Otherwise, the user-provided encoding is used. See the
Encoding.GetPreamble method for more information.

Related

Why Does Byte 150 show up as a dash in Notepad but Not when I read it programatically?

I've got a file that looks OK in Notepad (and Notepad++) but when I try to read it with a C# program, the dash shows up as a replacement character (�) instead. After some trial and error, I can reproduce the error as follows:
File.WriteAllBytes("C:\\Misc\\CharTest\\wtf.txt", new byte[] { 150 });
var readFile = File.ReadAllText("C:\\Misc\\CharTest\\wtf.txt");
Console.WriteLine(readFile);
Now, if you go and look in the wtf.txt file using Notepad, you'll see a dash... but I don't get it. I know that's not a "real" Unicode value so that's probably the root of the issue, but I don't get why it looks fine in Notepad and not when I read in the file. And how do I get the file to read it as a dash?
As an aside, a VB6 program I'm trying to rewrite in C# also reads it as a dash.
The File.ReadAllText(string) overload defaults to UTF8 encoding, in which a standalone byte with value 150 is invalid.
Specify the actual encoding of the file, for example:
var encoding = Encoding.GetEncoding(1252);
string content = File.ReadAllText(fileName, encoding);
I used the Windows-1252 encoding, which has a dash at codepoint 150.
Edit: Notepad displays the file correctly because for non-Unicode files the Windows-1252 codepage is the default for western regional settings. So likely you can use also Encoding.Default to get the correct result but keep in mind that Encoding.Default can return different code pages with different regional settings.
You are writing bytes in a textfile. And the you are reading those bytes and interpret them as chars.
Now, when you write bytes, you don't care about encoding, while you have to, in order to read those very same bytes as char.
Notepad++ seems to interpret the byte as Unicode char and therefore prints the _.
Now File.ReadAllText reads the bytes in the specified encoding, which you did not specify and there will be set to one of these and seems to be UTF-8, where 150 is not a valid entry.

Some questions about encoding

I have some questions about encoding. I have some files in various types and encoding. Only text files (.txt, .csv, .xml) can have byte order mark or am I wrong? In first method I want to prepare file : change encoding, remove pramble or not , at secund I want to only use :
File.WriteAllBytes("testFile.csv", fileBytes);
I don't want to send encoding to second method, so should be bom at every file? or It will be saved with unicod encoding.
To convert encoding I want to use Encoding.Convert method but after converting there is no bom at file, so Is it better to use streamReader and StreamWriter with source and dest encodings?
The XML format marks its encoding to its root node but it's just a notation. Real encoding can be different.
In my opinion, use same encoding to all of output and input files. Unicode is good choice as you mentioned.
If you cannot handle the input files encoding because they wrote by other person, try UTF-8 for them. Almost text editors are using UTF-8 as default encoding. File.WriteAllBytes() as well.

Invalid characters in File.ReadAllText

I'm calling File.ReadAllText() in a program designed to format some files that I have.
Some of these files contain the ® (174) symbol. However, when the text is being read, the returned string contains � (65533) symbols where the ® (174) should be.
What would cause this and how can I fix it?
Most likely the file contains a different encoding than the default. If you know it, you can specify it using the File.ReadAllText Method (String, Encoding) override.
Code sample:
string readText = File.ReadAllText(path, Encoding.Default); // <-- change the encoding to whatever the encoding really is
If you DON'T know the encoding, see this previous SO question: How to use ReadAllText when file encoding unknown
This is likely due to a mismatch in the Encoding. Use the ReadAllText overload which allows you to specify the proper Encoding to use when reading the file.
The default overload will assume UTF-8 unless it can detect UTF-32. Any other encoding will come through incorrectly.
You need to specify the encoding when you call File.ReadAllText, unless the file is actually in UTF-8, which it sounds like it's not. (Basically the one-parameter overload is equivalent to passing in UTF-8 as the second argument. It will also detect UTF-32 with an appropriate byte-order mark, I believe.)
The first thing is to work out which encoding it is in (e.g. ISO-8859-1 - but you need to check this) and then pass that as a second argument.
For example:
Encoding isoLatin1 = Encoding.GetEncoding(28591);
string text = File.ReadAllText(path, isoLatin1);
It's always important that you know what encoding binary data is using before you try to read it as text. That's true for files, network streams, anything.
The character you are reading is the Replacement character
used to replace an incoming character whose value is unknown or unrepresentable in Unicode
compare the use of U+001A as a control character to indicate the substitute function
http://www.fileformat.info/info/unicode/char/fffd/index.htm
You are getting this because the actual encoding of the file does not match the encoding your program expects.
By default ReadAllText expects UTF-8. It is encountering a byte sequence that does not represent a valid UTF-8 character, so replacing it with the Replacement character.

StreamReader weird error with �

StreamReader reads '–' (alt+ 0150) as � even if I have UTF-8 encoding and I have detectEncodingFromByteOrderMarks (BOM) set to true. Can any one guide me on this ?
That byte code won't appear in utf-8 encoded text. It is '\u2013', 0xe2 + 0x80 + 0x93 when encoded in utf-8. If you get this character when you type Alt+0150 on the numeric keypad then your default system code page is probably 1252. Simply pass Encoding.Default to the StreamReader constructor.
You need to know the encoding that was used to encode the text. There's no way around that. Try different encodings until you get the desired results.
From MSDN:
The detectEncodingFromByteOrderMarks parameter detects the encoding by
looking at the first three bytes of the stream. It automatically
recognizes UTF-8, little-endian Unicode, and big-endian Unicode text
if the file starts with the appropriate byte order marks. Otherwise,
the user-provided encoding is used. See the Encoding.GetPreamble
method for more information.
Which means that using that BOM is just an extra thing that may or may not work or can be easily overriden
As the other users wrote, the probable reason of this issue is an ANSI encoding of the file you are trying to read. I've recreated the issue you've described when I saved the file in ANSI encoding.
Try to use this code:
var stream = new StreamReader(fileName, Encoding.Default);
The Encoding.Default parameter is important in here. This code should read the character you've mentioned correctly.

FileInfo.OpenRead() - what type of encoding does it use?

I'm using this method to write to a MemoryStream object, which is subsequently stored a binary in SQL. It is being used to read in .HTML files from the file system on Windows.
How do I know which type of encoding this data is being read in as? Thanks.
None, because it opens a binary stream. When you e.g. wrap stream into a StreamReader, that's the moment you choose the encoding. The FileStream itself as returned by the OpenRead method is not text based and thus does not have an encoding.
FileInfo.OpenRead returns a raw stream that does not use any encoding (since it returns bytes, not characters).
Encodings are used to convert raw bytes into Unicode characters.
In .Net, encodings are used by the StreamReader and StreamWriter classes, which work with strings instead of bytes.

Categories