I am trying to read an email from POP3 and change to the correct encoding when I find the charset in the headers.
I use a TCP Client to connect to the POP3 server.
Below is my code :
public string ReadToEnd(POP3Client pop3client, out System.Text.Encoding messageEncoding)
{
messageEncoding = TCPStream.CurrentEncoding;
if (EOF)
return ("");
System.Text.StringBuilder sb = new System.Text.StringBuilder(m_bytetotal * 2);
string st = "";
string tmp;
do
{
tmp = TCPStream.ReadLine();
if (tmp == ".")
EOF = true;
else
sb.Append(tmp + "\r\n");
//st += tmp + "\r\n";
m_byteread += tmp.Length + 2; // CRLF discarded by read
FireReceived();
if (tmp.ToLower().Contains("content-type:") && tmp.ToLower().Contains("charset="))
{
try
{
string charSetFound = tmp.Substring(tmp.IndexOf("charset=") + "charset=".Length).Replace("\"", "").Replace(";", "");
var realEnc = System.Text.Encoding.GetEncoding(charSetFound);
if (realEnc != TCPStream.CurrentEncoding)
{
TCPStream = new StreamReader(pop3client.m_tcpClient.GetStream(), realEnc);
}
}
catch { }
}
} while (!EOF);
messageEncoding = TCPStream.CurrentEncoding;
return (sb.ToString());
}
If I remove this line:
TCPStream = new StreamReader(pop3client.m_tcpClient.GetStream(), realEnc);
Everything works fine except that when the e-mail contains different charset characters I get question marks as the initial encoding is ASCII.
Any suggestions on how to change the encoding while reading data from the Network Stream?
You're doing it wrong (tm).
Seriously, though, you are going about trying to solve this problem in completely the wrong way. Don't use a StreamReader for this. And especially don't read 1 byte at a time (as you said you needed to do in a comment on an earlier "solution").
For an explanation of why not to use a StreamReader, besides the obvious "because it isn't designed to switch between encodings during the process of reading", feel free to read over another answer I gave about the inefficiencies of using a StreamReader here: Reading an mbox file in C#
What you need to do is buffer your reads (such as a 4k buffer should be fine). Then, as you are already having to do anyway, scan for the '\n' byte to extract content on a line-by-line basis, combining header lines that were folded.
Each header may have multiple encoded-word tokens which may each be in a separate charset, assuming they are properly encoded, otherwise you'll have to deal with undeclared 8-bit data and try to massage that into unicode somehow (probably by having a set of fallback charsets). I'd recommend trying UTF-8 first followed by a selection of charsets that the user of your library has provided before finally trying iso-8859-1 (make sure not to try iso-8859-1 until you've tried everything else, because any sequence of 8-bit text will convert properly to unicode using the iso-8859-1 character encoding).
When you get to text content of the message, you'll want to check the Content-Type header for a charset parameter. If no charset parameter is defined, it should be US-ASCII, but in practice it could be anything. Even if the charset is defined, it might not match the actual character encoding used in the text body of the message, so once again you'll probably want to have a set of fallbacks.
As you've probably guessed by this point, this is very clearly not a trivial task as it requires the parser to do on-the-fly character conversion as it goes (and the character conversion requires internal parser state about what the expected charset is at any given time).
Since I've already done the work, you should really consider using MimeKit which will parse the email and properly do charset conversion on the headers and the content using the appropriate charset encoding.
I've also written a Pop3Client class that is included in my MailKit library.
If your goal is to learn and write your own library, I'd still highly recommend reading over my code because it is highly efficient and does things in a proper way.
There are some ways you can detect the encoding by looking at the Byte Order Mark, which are the firts few bytes of the stream. These will tell you the encoding. However, the stream might not have a BOM, and in these cases it could be ASCII, UTF without BOM, or others.
You can convert your stream from one encoding to another with the Encoding Class:
Encoding textEncoding = Encoding.[your detected encoding here];
byte[] converted = Encoding.UTF8.GetBytes(textEncoding.GetString(TCPStream.GetBuffer()));
You may select your preferred encoding when converting.
Hope it answers your question.
edit
You may use this code to read your stream in blocks.
MemoryStream st = new MemoryStream();
int numOfBytes = 1024;
int reads = 1;
while (reads > 0)
{
byte[] bytes = new byte[numOfBytes];
reads = yourStream.Read(bytes, 0, numOfBytes);
if (reads > 0)
{
int writes = ( reads < numOfBytes ? reads : numOfBytes);
st.Write(bytes, 0, writes);
}
}
Related
first a depressing fact: https://www.base64decode.org/ can do what i want to do.
i´m trying to encode and decode (to and from base64) a model file (.shm) generated by the image processing tool MVTec Halcon because i want to store it in a xml file.
If i open it, it has this strange form:
HSTF ÿÿÿÿ¿€ Q¿ÙG®záH?Üä4©±w?Eè}‰#?ð ................
I´m using this methods to encode and decode it:
public static string Base64Encode(string text)
{
Byte[] textBytes = Encoding.Default.GetBytes(text);
return Convert.ToBase64String(textBytes);
}
public static string Base64Decode(string base64EncodedData)
{
Byte[] base64EncodedBytes = Convert.FromBase64String(base64EncodedData);
return Encoding.Default.GetString(base64EncodedBytes);
}
and calling the methods from a gui like this:
var model = File.ReadAllText(#"C:\Users\\Desktop\model_region_nut.txt");
var base64 = ImageConverter.Base64Encode(model);
File.WriteAllText(#"C:\Users\\Desktop\base64.txt", base64);
var modelneu = ImageConverter.Base64Decode(File.ReadAllText(#"C:\Users\\Desktop\base64.txt"));
File.WriteAllText(#"C:\Users\\Desktop\modelneu.txt", modelneu);
my result for modelneu is:
HSTF ?????? Q??G?z?H???4??w??E?}??#??
so you can see that there are lots of missing characters.. I guess the problem is caused by using .Default.
Thanks for your help,
Michel
If you're working with binary data, there is no reason at all to go through text decoding and encoding. Doing so only risks corrupting the data in various ways, even if you're using a consistent character encoding.
Just use File.ReadAllBytes() instead of File.ReadAllText() and skip the unnecessary Encoding step.
The problem is with reading file with unspecified encoding, check this question.
As mentioned there you can go with overload for ReadAllText to specify encoding and also for writing you must specofy encoding for WriteAllText I suggest using UTF-8 encoding so:
var model = File.ReadAllText(#"C:\Users\pichlerm\Desktop\model_region_nut.txt",Encoding.UTF8);
var base64 = ImageConverter.Base64Encode(model);
File.WriteAllText(#"C:\Users\\Desktop\base64.txt", base64,Encoding.UTF8);
var modelneu = ImageConverter.Base64Decode(File.ReadAllText(#"C:\Users\\Desktop\base64.txt"));
File.WriteAllText(#"C:\Users\pichlerm\Desktop\modelneu.txt", modelneu);
I'm using HttpClient to fetch some files. I put the content into a byte array (bytes). Now I need to detect the encoding. The contenttype will be either html, css, JavaScript or XML contenttype.
Currently I check the charset from headers, then check for a BOM (byte order mark) before I finally check the first part of the file for a charset meta tag.
Normally this works fine, because there are no conflicts.
But: Is that order correct (in case of conflict)?
The code I corrently use:
Encoding encoding;
try
{
encoding = Encoding.GetEncoding(responseMessage.Content.Headers.ContentType.CharSet);
}
catch
{
using (MemoryStream ms = new MemoryStream(bytes))
{
using (StreamReader sr = new StreamReader(ms, Encoding.Default, true))
{
char[] chars = new char[1024];
sr.Read(chars, 0, 1024);
string textDefault = new string(chars);
if (sr.CurrentEncoding == Encoding.Default)
{
encoding = Global.EncodingFraContentType(textDefault);
}
else
{
encoding = sr.CurrentEncoding;
}
}
}
}
responseInfo.Text = encoding.GetString(bytes);
Global.EncodingFraContentType is a regular expression that finds the charset defined either in XML declaration, or in a meta tag.
What order is the correct to detect charset/encoding?
The correct answer depends not on order, but on which actually gives the correct result, and there's no perfect answer here.
If there is a conflict, then the server has given you something incorrect. Since it's incorrect there can't be a "correct" order because there isn't a correct way of being incorrect. And, maybe the header and the embedded metadata are both wrong!
No even slightly common-used encoding can have something that looks like a BOM would look like in UTF-8 or UTF-16 at the beginning and still be a valid example of the content types you mention, so if there's a BOM then that wins.
(The one exception to that is if the document is so badly edited as to switch encoding part-way through, which is no unheard of, but then the buggy content is so very buggy as to have no real meaning).
If the content contains no octet that is greater than 0x7F then it doesn't matter and the header and metadata both claim it as different examples of US-ASCII, UTF-8, any of the ISO-8859 family of encodings, or any of the other encodings for which those octets all map to the same code point, then it doesn't really matter which you consider it to be, as the nett result is the same. Consider it to be whatever the metadata says, as then you don't need to rewrite it to match correctly.
If it's in UTF-16 without a BOM it is likely going to be clearly as such very soon as all of those formats have a lot of characters with special meaning in the range U+0000 to U+00FF (indeed, generally U+0020 to U+007F) and so you'll have lots of ranges with a zero byte every other character.
If it has octets above 0x7F and is valid UTF-8, then it's almost certainly UTF-8. (By the same token if it's not UTF-8 and has octets above 0x7F then it almost certainly can't be mistaken for UTF-8).
The trickiest reasonably common case is if you have conflicting claims about it being in two different encodings which are both single-octet-per-character encodings and an octet in the range 0x80-0xFF is present. This is the case that you can't be sure about. If one encoding is a subset of the other (especially when C1 controls are excluded) then you could go for the superset, but that requires storing knowledge about those encodings, and considerable amount of work. Most of the time I'd be inclined to just throw an exception, and when it's found in the logs see if I can get the source to fix their bug, or special-case that source, but that doesn't work if you are dealing with a very large number of disparate sources that you may not have a relationship with. Alas there is no perfect answer here.
Its worth noting also that sometimes both header and embedded metadata will agree with each other incorrectly. A common case is content in CP-1252 but claimed as being in ISO-8859-1.
According to W3C Faq
If you have a UTF-8 byte-order mark (BOM) at the start of your file then recent browser versions other than Internet Explorer 10 or 11 will use that to determine that the encoding of your page is UTF-8. It has a higher precedence than any other declaration, including the HTTP header.
When it comes to the http-header vs meta BOM takes precedence, as long as the meta is within the first 1024 it can take precedence, though there is no strict rule on that.
Conclusion - in order of importance:
Byte Order Mark (BOM): If present, this is AUTHORATIVE, since it was
added by the editor that actually saved the file (this can only be
present on unicode encodings).
Content-Type charset (in header set by the server): For dynamically created/processed files, it should be present (since the
server knows), but might not be for static files (the server just
sends those).
Inline charset: For xml, html and css the encoding can be be specified inside the document, in either xml prolog, html meta tag
or #charset in css. To read that you need to decode the first
part of the document using for instance 'Windows-1252' encoding.
Assume utf-8. This is the standard of the web and is today by far the most used.
If the found encoding equals 'ISO-8859-1', use 'Windows-1252' instead (required in html5 - read more at Wikipedia
Now try to decode the document using the found encoding. If error handling is turned on, that might fail! In that case:
Use 'Windows-1252'. This was the standard in old windows files and works fine as last try (there's still a lot of old files out there).
This will never throw errors. However it might of course be wrong.
I have made a method that implements this. The regex I use is able to find encodings specified as:
Xml: <?xml version="1.0" encoding="utf-8"?> OR <?xml encoding="utf-8"?>
html: <meta charset="utf-8" /> OR <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
css: #charset "utf-8";
(It works with both single and double qoutes).
You will need:
using System;
using System.IO;
using System.Net.Http;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
Here's the method that returns the decoded string (parameters are the HttpClient and the Uri):
public static async Task<string> GetString(HttpClient httpClient, Uri url)
{
byte[] bytes;
Encoding encoding = null;
Regex charsetRegex = new Regex(#"(?<=(<meta.*?charset=|^\<\?xml.*?encoding=|^#charset[ ]?)[""']?)[\w-]+?(?=[""';\r\n])",
RegexOptions.IgnoreCase | RegexOptions.CultureInvariant | RegexOptions.ExplicitCapture);
using (HttpResponseMessage responseMessage = await httpClient.GetAsync(url).ConfigureAwait(false))
{
responseMessage.EnsureSuccessStatusCode();
bytes = await responseMessage.Content.ReadAsByteArrayAsync().ConfigureAwait(false);
string headerCharset = responseMessage?.Content?.Headers?.ContentType?.CharSet;
byte[] buffer = new byte[0x1000];
Array.Copy(bytes, buffer, Math.Min(bytes.Length, buffer.Length));
using (MemoryStream ms = new MemoryStream(buffer))
{
using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding("Windows-1252"), true, buffer.Length, true))
{
string testString = await sr.ReadToEndAsync().ConfigureAwait(false);
if (!sr.CurrentEncoding.Equals(Encoding.GetEncoding("Windows-1252")))
{
encoding = sr.CurrentEncoding;
}
else if (headerCharset != null)
{
encoding = Encoding.GetEncoding(headerCharset, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
}
else
{
string inlineCharset = charsetRegex.Match(testString).Value;
if (!string.IsNullOrEmpty(inlineCharset))
{
encoding = Encoding.GetEncoding(inlineCharset, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
}
else
{
encoding = new UTF8Encoding(false, true);
}
}
if (encoding.Equals(Encoding.GetEncoding("iso-8859-1")))
{
encoding = Encoding.GetEncoding("Windows-1252", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
}
}
}
using (MemoryStream ms = new MemoryStream(bytes))
{
try
{
using (StreamReader sr = new StreamReader(ms, encoding, false, 0x8000, true))
{
return await sr.ReadToEndAsync().ConfigureAwait(false);
}
}
catch (DecoderFallbackException)
{
ms.Position = 0;
using (StreamReader sr = new StreamReader(ms, Encoding.GetEncoding("Windows-1252"), false, 0x8000, true))
{
return await sr.ReadToEndAsync().ConfigureAwait(false);
}
}
}
}
}
You should wrap the method call in a try/catch, since HttpClient can throw errors, if the request fails.
Update:
In .Net Core, you don't have the 'Windows-1252' encoding (big mistake IMHO), so here you must settle with 'ISO-8859-1'.
There are two programs involved. The first one has a string like "##########". The second one is a config tool to find "##########" and replace this string with user input from a textbox.
Now I have trouble in the replacing part. Here is the code.
//This is code from first program:
string myIP = "####################";
string myPort = "%%%%%%%%";
int port = Int32.Parse(myIP );
tcpClient.Connect(myIP , port);
//This is code from second program:
//Get bytes from textbox:
byte[] byte_IP = new byte[60];
byte_IP = System.Text.Encoding.ASCII.GetBytes(textBox1_ip.Text);
//Get all bytes in the first program:
byte[] buffer = File.ReadAllBytes(#"before.exe");
//Replace string with textbox input, 0x1c00 is where the "#" starts:
Buffer.BlockCopy( byte_IP, 0, buffer, 0x1c00, byte_IP.Length);
//Build a new exe:
File.WriteAllBytes(#"after.exe", buffer);
However, I get "127.0.0.1#.#.#.#.#.#." in the new exe. But I need "1.2.7...0...0...1........." to process as a valid host.
First I'd like to reiterate what has already been said in the comments: there are simpler ways to handle this stuff. That's what config files are for, or registry settings.
But if you absolutely must...
First, you have to match the encoding that the framework expects. Is the string stored as UTF8? UTF16? ASCII? Writing data in the wrong encoding will turn it into pure garbage, almost every time. Generally for strings in code like you're looking for you'll be wanting to use Encoding.UNICODE.
Next, you need some way to deal with strings of different lengths. The buffer you define needs to be large enough to contain the widest string you want to be able to set - 15 bytes for dotted numeric IPv4 addresses - but you have to allow for the minimum of 7 characters. Padding the remainder and removing that padding before using the value will probably suffice.
The minimum program I could think to use for testing this was:
class Program
{
static void Main(string[] args)
{
var addr = "###.###.###.###".TrimEnd();
Console.WriteLine("Address: [{0}]", addr);
}
}
Now in your patcher you will need to locate the starting position in the file and overwrite the bytes with the new string's bytes. Here's a Patch method, which calls a FindString method that you will have to write yourself:
static void PatchFile(string filename, string searchString, string replaceString)
{
// Open the file
using (var file = File.Open(filename, FileMode.Open, FileAccess.ReadWrite, FileShare.ReadWrite))
{
// Locate the search string in the file (needs to be implemented)
long pos = FindString(file, searchString);
if (pos < 0)
return;
// Pad and limit replacement string, then convert to bytes
string rep = string.Format("{0,-" + searchString.Length + "}", replaceString).Substring(0, searchString.Length);
byte[] replaceBytes = Encoding.Unicode.GetBytes(rep);
// Overwrite the located bytes with the replacement
file.Position = pos;
file.Write(replaceBytes, 0, replaceBytes.Length);
}
}
Hopefully it makes sense.
i have an input like: DisplaygröÃe
And i want output like: Displaygröÿe
With notepad++ problem was solved by: converting to ansi, encoding to utf8 and converting back to ansi.
I need to do this programmatically in c#.
I've tried converting to / from ansi, utf8, latin-1 and none work properly, it shows ? with a function that uses Encoding.Default.GetBytes, then
res = Enconding.Convert(src1,dest1,bytes) and
EncodingDest.GetChars(res);
where EncodingDest it represent output encoding..
Code is running in Console application, but same result are on WPF.
Doesn't matter with encoding is good for output only if it works, these problems also are for country's like spain, italy or sweden.
use System.Text.Encoding
var ascii = Encoding.ASCII.GetBytes("DisplaygröÃe");
var utf8 = Encoding.Convert(Encoding.ASCII, Encoding.UTF8, ascii);
var output = Encoding.UTF8.GetString(utf8);
When you output a string somewhere (like a TextWriter, or a Stream, or a byte[]), you should always specify the encoding, unless you want the UTF-8 output (the default one):
using(StreamWriter sw = new StreamWriter("file.txt", Encoding.GetEncoding("windows-1252"))
sw.WriteLine("Displaygröÿe");
#DanM: You need to know what character set your input is in.
"DisplaygröÃe" is what you will see if you take the string "Displaygröße" (suggested by Vlad) encode it to bytes as UTF-8, and then incorrectly decode it as latin1.
If you do the same with Displaygröÿe, you would see "Displaygröÿe" (the inverted question mark is literally there, it is not a placeholder for something that can't be displayed.) Technically, "DisplaygröÃe" probably has another character between the à and e, but it is a control code, and is thus invisible to you.
If you have an character set foo, this is true: my_string = foo_decode(foo_encode(my_string)). If you have another character set bar, this is true: barf = bar_decode(foo_encode(my_string)) where barf is garbage like you're seeing.
If you don't know what character set your input is in, you will only decode it correctly by chance.
It appears that your input files are in UTF-8, and you will need to decode the bytes from the file as such. (I don't speak enough C# to help you here... I only speak character encodings.)
using (var rdr = new StreamReader(fs, Encoding.GetEncoding(1252))) {
result = rdr.ReadToEnd();
}
we had similar problem when sending data to text printer, and only one I get working is this (written as extension):
public static byte[] ToAnsiMemBytes(this string input)
{
int length = input.Length;
byte[] result = new byte[length];
try
{
IntPtr bytes = Marshal.StringToCoTaskMemAnsi(input);
Marshal.Copy(bytes, result, 0, length);
}
catch (Exception)
{
result = null;
}
return result;
}
Is there a way to know how many bytes of a stream have been used by StreamReader?
I have a project where we need to read a file that has a text header followed by the start of the binary data. My initial attempt to read this file was something like this:
private int _dataOffset;
void ReadHeader(string path)
{
using (FileStream stream = File.OpenRead(path))
{
StreamReader textReader = new StreamReader(stream);
do
{
string line = textReader.ReadLine();
handleHeaderLine(line);
} while(line != "DATA") // Yes, they used "DATA" to mark the end of the header
_dataOffset = stream.Position;
}
}
private byte[] ReadDataFrame(string path, int frameNum)
{
using (FileStream stream = File.OpenRead(path))
{
stream.Seek(_dataOffset + frameNum * cbFrame, SeekOrigin.Begin);
byte[] data = new byte[cbFrame];
stream.Read(data, 0, cbFrame);
return data;
}
return null;
}
The problem is that when I set _dataOffset to stream.Position, I get the position that the StreamReader has read to, not the end of the header. As soon as I thought about it this made sense, but I still need to be able to know where the end of the header is and I'm not sure if there's a way to do it and still take advantage of StreamReader.
You can find out how many bytes the StreamReader has actually returned (as opposed to read from the stream) in a number of ways, none of them too straightforward I'm afraid.
Get the result of textReader.CurrentEncoding.GetByteCount(totalLengthOfAllTextRead) and then seek to this position in the stream.
Use some reflection hackery to retrieve the value of the private variable of the StreamReader object that corresponds to the current byte position within the internal buffer (different from that with the stream - usually behind, but no more than equal to of course). Judging by .NET Reflector, the this variable seems to be named bytePos.
Don't bother using a StreamReader at all but instead implement your custom ReadLine function built on top of the Stream or BinaryReader even (BinaryReader is guaranteed never to read further ahead than what you request). This custom function must read from the stream char by char, so you'd actually have to use the low-level Decoder object (unless the encoding is ASCII/ANSI, in which case things are a bit simpler due to single-byte encoding).
Option 1 is going to be the least efficient I would imagine (since you're effectively re-encoding text you just decoded), and option 3 the hardest to implement, though perhaps the most elegant. I'd probably recommend against using the ugly reflection hack (option 2), even though it's looks tempting, being the most direct solution and only taking a couple of lines. (To be quite honest, the StreamReader class really ought to expose this variable via a public property, but alas it does not.) So in the end, it's up to you, but either method 1 or 3 should do the job nicely enough...
Hope that helps.
So the data is utf8 (the default encoding for StreamReader). This is a multibyte encoding, so IndexOf would be inadvisable. You could:
Encoding.UTF8.GetByteCount(string)
on your data so far, adding 1 or 2 bytes for the missing line ending.
If you're needing to count bytes, I'd go with the BinaryReader. You can take the results and cast them about as needed, but I find its idea of its current position to be more reliable (in that since it reads in binary, its immune to character-set problems).
So your last line contains 'DATA' + an unknown amount of data bytes. You could extract the position by using IndexOf() with your last read line. Then readjust the stream.Position.
But I am not sure if you should use ReadLine() at all in this case. Maybe it would be better to read byte by byte until you reach the 'DATA' mark.
The line breaks are easily identifiable without needing to decode the stream first (except for some encodings rarely used for text files like EBCDIC, UTF-16, UTF-32), so you can just read each line as bytes and then decode the entire line:
using (FileStream stream = File.OpenRead(path)) {
List<byte> buffer = new List<byte>();
bool hasCr = false;
bool done = false;
while (!done) {
int b = stream.ReadByte();
if (b == -1) throw new IOException("End of file reached in header.");
if (b == 13) {
hasCr = true;
} else if (b == 10 && hasCr) {
string line = Encoding.UTF8.GetString(buffer.ToArray(), 0, buffer.Count);
if (line == "DATA") {
done = true;
} else {
HandleHeaderLine(line);
}
buffer.Clear();
hasCr = false;
} else {
if (hasCr) buffer.Add(13);
hasCr = false;
buffer.Add((byte)b);
}
}
_dataOffset = stream.Position;
}
Instead of closing the stream and open it again, you could of course just keep on reading the data.