how do i read n any character from file? - c#

how do i read any n character from file with c# whole encoding?
i mean not reading byte! i wanna just read one character (any encoding) at a time?

Use TextReader.Read(), probably via StreamReader, which inherits from TextReader.

Try this:
char[] c = new char[5];
using (StreamReader streamReader = File.OpenText("c:\test.txt"))
{
streamReader.Read(c, 0, c.Length);
}
Update:
I just realized that this is for UTF8 encoding only and no further parameters are available to specify the encoding.

StreamReader reader = new StreamReader("date.txt");
string txt = reader.ReadToEnd();
txt(char num);
It's not the most efficient but neither is c# so :/

Related

Encoding detection for a string-data in a byte[] succeed and after that all string comparisons failed

How it is all setup:
I receive a byte[] which contains CSV data
I don't know the encoding (should be unicode / utf8)
I need to detect the encoding or fallback to a default (the text may contain umlauts, so the encoding is important)
I need to read the header line and compare it with defined strings
After a short search I how to get a string out of the byte[] I found How to convert byte[] to string? which stated to use something like
string result = System.Text.Encoding.UTF8.GetString(byteArray);
I (know) use this helper to detect the encoding and afterwards the Encoding.GetString method to read the string like so:
string csvFile = TextFileEncodingDetector.DetectTextByteArrayEncoding(data).GetString(data);
But when I now try to compare values from this result string with static strings in my code all comparisons fails!
// header is the first line from the string that I receive from EncodingHelper.ReadData(data)
for (int i = 0; i < headers.Count; i++) {
switch (headers[i].Trim().ToLower()) {
case "number":
// do
break;
default:
throw new Exception();
}
}
// where (headers[i].Trim().ToLower()) => "number"
While this seems to be a problem with the encoding of both strings my question is:
How can I detect the encoding of a string from a byte[] and convert it into the default encoding so that I am able to work with that string data?
Edit
The code supplied above was working as long the string data came from a file that was saved this way:
string tempFile = Path.GetTempFileName();
StreamReader reader = new StreamReader(inputStream);
string line = null;
TextWriter tw = new StreamWriter(tempFile);
fileCount++;
while ((line = reader.ReadLine()) != null)
{
if (line.Length > 1)
{
tw.WriteLine(line);
}
}
tw.Close();
and afterwards read out with
File.ReadAllText()
This
A. Forces the file to be unicode (ANSI format kills all umlauts)
B. requires the written file be accessible
Now I only got the inputStream and tried what I posted above. And as I mentioned this worked before and the strings look identical. But they are not.
Note: If I use ANSI encoded file, which uses Encoding.Default all works fine.
Edit 2
While ANSI encoded data work the UTF8 Encoded (notepadd++ only show UTF-8 not w/o BOM) start with char [0]: 65279
So where is my error because I guess System.Text.Encoding.UTF8.GetString(byteArray) is working the right way.
Yes, Encoding.GetString doesn't strip the BOM (see https://stackoverflow.com/a/11701560/613130). You could:
string result;
using (var memoryStream = new MemoryStream(byteArray))
{
result = new StreamReader(memoryStream).ReadToEnd();
}
The StreamReader will autodetect the encoding (your encoding detector is a copy of the StreamReader.DetectEncoding())

How to properly decode accented characters for display

My raw input file text file contains a string:
Caf&eacute (Should be Café)
The text file is a UTF8 file.
The output lets say is to another text file, so its not necessarily for a web page.
What C# method(s) can i use to output the correct format, Café?
Apparently a common problem?
Have you tried System.Web.HttpUtility.HtmlDecode("Café")? it returns 538M results
This is HTML encoded text. You need to decode it:
string decoded = HttpUtility.HtmlDecode(text);
UPDATE: french symbol "é" has HTML code "é" so, you need to fix your input string.
You should use SecurityElement.Escape when working with XML files.
HtmlEncode will encode a lot of extra entities that are not required. XML only requires that you escape >, <, &, ", and ', which SecurityElement.Escape does.
When reading the file back through an XML parser, this conversion is done for you by the parser, you shouldn't need to "decode" it.
EDIT: Of course this is only helpful when writing XML files.
I think this works:
string utf8String = "Your string";
Encoding utf8 = Encoding.UTF8;
Encoding unicode = Encoding.Unicode;
byte[] utf8Bytes = utf8.GetBytes(utf8String);
byte[] unicodeBytes = Encoding.Convert(utf8, unicode, utf8Bytes);
char[] uniChars = new char[unicode.GetCharCount(unicodeBytes, 0, unicodeBytes.Length)];
unicode.GetChars(unicodeBytes, 0, unicodeBytes.Length, uniChars, 0);
string unicodeString = new string(uniChars);
Use HttpUtility.HtmlDecode. Example:
class Program
{
static void Main()
{
XDocument doc = new XDocument(new XElement("test",
HttpUtility.HtmlDecode("café")));
Console.WriteLine(doc);
Console.ReadKey();
}
}

Filestream prepends junk characters while read

I am reading a simple text file which contains single line using filestream class. But it seems filestream.read prepends some junk character in the beginning.
Below the code.
using (var _fs = File.Open(_idFilePath, FileMode.Open, FileAccess.ReadWrite, FileShare.Read))
{
byte[] b = new byte[_fs.Length];
UTF8Encoding temp = new UTF8Encoding(true);
while (_fs.Read(b, 0, b.Length) > 0)
{
Console.WriteLine(temp.GetString(b));
Console.WriteLine(ASCIIEncoding.ASCII.GetString(b));
}
}
for example: My data in text file is just "sample". But the above code returns
"?sample" and
"???sample"
Whats the reason?? is it start of the file indicator? is there a way to read only my actual content??
The byte order mark(BOM) consists of the Unicode char 0xFEFF and is used to mark a file with the encoding used for it.
So if you correctly decode the file as UTF8 you get that character as first char of your string. If you incorrectly decode it as ANSI you get 3 chars, since the UTF8 encoding of 0xFEFF is the byte sequence "EF BB BF" which is 3 bytes.
But your whole code can be replaced with
File.ReadAllText(fileName,Encoding.UTF8)
and that should remove the BOM too. Or you leave out the encoding parameter and let the function autodetect the encoding(for which it uses the BOM)
Could be the BOM - a.k.a byte order mark.
You are reading the BOM from the stream. If you are reading text, try using a StreamReader which will handle this automatically.
Try instead
using (StreamReader sr = new StreamReader(File.Open(path),Encoding.UTF8))
It will definitely strip you the BOM

Using .NET how to convert ISO 8859-1 encoded text files that contain Latin-1 accented characters to UTF-8

I am being sent text files saved in ISO 88591-1 format that contain accented characters from the Latin-1 range (as well as normal ASCII a-z, etc.). How do I convert these files to UTF-8 using C# so that the single-byte accented characters in ISO 8859-1 become valid UTF-8 characters?
I have tried to use a StreamReader with ASCIIEncoding, and then converting the ASCII string to UTF-8 by instantiating encoding ascii and encoding utf8 and then using Encoding.Convert(ascii, utf8, ascii.GetBytes( asciiString) ) — but the accented characters are being rendered as question marks.
What step am I missing?
You need to get the proper Encoding object. ASCII is just as it's named: ASCII, meaning that it only supports 7-bit ASCII characters. If what you want to do is convert files, then this is likely easier than dealing with the byte arrays directly.
using (System.IO.StreamReader reader = new System.IO.StreamReader(fileName,
Encoding.GetEncoding("iso-8859-1")))
{
using (System.IO.StreamWriter writer = new System.IO.StreamWriter(
outFileName, Encoding.UTF8))
{
writer.Write(reader.ReadToEnd());
}
}
However, if you want to have the byte arrays yourself, it's easy enough to do with Encoding.Convert.
byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"),
Encoding.UTF8, data);
It's important to note here, however, that if you want to go down this road then you should not use an encoding-based string reader like StreamReader for your file IO. FileStream would be better suited, as it will read the actual bytes of the files.
In the interest of fully exploring the issue, something like this would work:
using (System.IO.FileStream input = new System.IO.FileStream(fileName,
System.IO.FileMode.Open,
System.IO.FileAccess.Read))
{
byte[] buffer = new byte[input.Length];
int readLength = 0;
while (readLength < buffer.Length)
readLength += input.Read(buffer, readLength, buffer.Length - readLength);
byte[] converted = Encoding.Convert(Encoding.GetEncoding("iso-8859-1"),
Encoding.UTF8, buffer);
using (System.IO.FileStream output = new System.IO.FileStream(outFileName,
System.IO.FileMode.Create,
System.IO.FileAccess.Write))
{
output.Write(converted, 0, converted.Length);
}
}
In this example, the buffer variable gets filled with the actual data in the file as a byte[], so no conversion is done. Encoding.Convert specifies a source and destination encoding, then stores the converted bytes in the variable named...converted. This is then written to the output file directly.
Like I said, the first option using StreamReader and StreamWriter will be much simpler if this is all you're doing, but the latter example should give you more of a hint as to what's actually going on.
If the files are relatively small (say, ~10 megabytes), you'll only need two lines of code:
string txt = System.IO.File.ReadAllText(inpPath, Encoding.GetEncoding("iso-8859-1"));
System.IO.File.WriteAllText(outPath, txt);

How do I convert a byte array to a string

I have a byte array which I got back from a FileStream.Read and I would like to turn that into a string. I'm not 100% sure of the encoding - it's just a file i saved to disk - how do I do the conversion? Is there a .NET class that reads the byte order mark and can figure out the encoding for me?
See how-to-guess-the-encoding-of-a-file-with-no-bom-in-net.
Since strings are Unicode, you must specify an encoding on conversion. Text streams (even ReadAllText() ) have an active encoding inside, usually some sensible default.
Try something like this:
buffer = Encoding.Convert( Encoding.GetEncoding("iso-8859-1"), Encoding.UTF8, buffer );
newString = Encoding.UTF8.GetString( buffer, 0, len );
If File.ReadAllText will read the file correctly, then you have a couple of options.
Instead of calling BeginRead, you could just call File.ReadAllText asynchronously:
delegate string AsyncMethodCaller(string fname);
static void Main(string[] args)
{
string InputFilename = "testo.txt";
AsyncMethodCaller caller = File.ReadAllText;
IAsyncResult rslt = caller.BeginInvoke(InputFilename, null, null);
// do other work ...
string fileContents = caller.EndInvoke(rslt);
}
Or you can create a MemoryStream from the byte array, and then use a StreamReader on that.
How much do you know about the file? Could it really be any encoding? If so, you'd need to use heuristics to guess the encoding. If it's going to be UTF-8, UTF-16 or UTF-32 then
new StreamReader(new MemoryStream(bytes), true)
will detect the encoding for you automatically. Text is pretty nasty if you really don't know the encoding though. There are plenty of cases where you really would just be guessing.
There is no simple way to get the encoding, but as mentioned above use
string str = System.Text.Encoding.Default.GetString(mybytearray);
if you have no clue of what the encoding is. If you are in europe the ISO-8859-1 is probably the encoding you have.
string str = System.Text.Encoding.GetEncoding("ISO-8859-1").GetString(mybytearray);
System.IO.File.ReadAllText does what you want.

Categories