Using Character Encoding with streamreader - c#

My program connects to an ftp server and list all the file that are in the specific folder c:\ClientFiles... The issue I'm having is that the files name have some funny character like – i.e. Billing–File.csv, but code removes replace these characters with a dash "-". When I try downloading the files its not found.
I've tried all the encoding types that are in the class Encoding but not is able to accommodate these character.
Please see my code listing the files.
UriBuilder ub;
if (rootnode.Path != String.Empty) ub = new UriBuilder("ftp", rootnode.Server, rootnode.Port, rootnode.Path);
else ub = new UriBuilder("ftp", rootnode.Server, rootnode.Port);
String uristring = ub.Uri.OriginalString;
req = (FtpWebRequest)FtpWebRequest.Create(ub.Uri);
req.Credentials = ftpcred;
req.UsePassive = pasv;
req.Method = WebRequestMethods.Ftp.ListDirectoryDetails;
try
{
rsp = (FtpWebResponse)req.GetResponse();
StreamReader rsprdr = new StreamReader(rsp.GetResponseStream(), Encoding.UTF8); //this is where the problem is.
Your help or advise will be highly appreciated

Not every encoding has a class in the encoding namespace. You can get a list of all encodings know in your system by using:
Encoding.GetEncodings()
(MSDN info for GetEncodings).
If you know what the name of the file should be, you can iterate through the list and see what encodings result in the correct filename.

Try:
StreamReader rsprdr = new StreamReader(rsp.GetResponseStream(), Encoding.GetEncodings(1251)) ;
You may also try "iso-8859-1" instead of 1251

Related

WebClient.DownloadString(url) won't work for urls that are in uni-code characters such as persian

I am trying to get the html content from a url that has Persian characters in it such as:
http://example.com/%D8%B7%D8%B1%D8%A7%D8%AD%DB%8C-%D9%88%D8%A8-%D8%B3%D8%A7%DB%8C%D8%AA-%D8%A2%D8%AA%D9%84%DB%8C%D9%87/website/Atelier
I am using this code:
using (WebClient client = new WebClient())
{
client.Encoding = Encoding.UTF8;
string data = client.DownloadString(urlTextWithPersianCharacters);
}
When the url is something like this, I get unreadable characters and symbols. This code is fine with other websites that have English urls and Persian content.
Edit: both answers worked find now that I am testing other websites. The problem is with one specific website that I am trying to get its content. Can the website block these kinds of requests?or use other encodings maybe?
What do you suggest me to do?
Try to convert your url string to URI:
Uri uri = new Uri("http://example.com/%D8%B7%D8%B1%D8%A7%D8%AD%DB%8C-%D9%88%D8%A8-%D8%B3%D8%A7%DB%8C%D8%AA-%D8%A2%D8%AA%D9%84%DB%8C%D9%87/website/Atelier");
using (WebClient client = new WebClient())
{
client.Encoding = Encoding.UTF8;
string data = client.DownloadString(uri);
}
The default System.Text.UTF8Encoding class is only capable of performing direct binary decoding of the UTF-8 format. In the example you have, you are attempting to decode a URL that is using "URL Encoding".
URL encoding occurs when special characters are encoded into a URL using hex encoding with % signs as markers.
To solve this issue, you will need to decode the URL into a UTF-8 string.
The System.Net.Uri.UnescapeDataString() method should be able to this for you.
string url = "http://example.com/%D8%B7%D8%B1%D8%A7%D8%AD%DB%8C-%D9%88%D8%A8-%D8%B3%D8%A7%DB%8C%D8%AA-%D8%A2%D8%AA%D9%84%DB%8C%D9%87/website/Atelier";
string result = Uri.UnescapeDataString(url);
In this example, result contains: http://example.com/طراحی-وب-سایت-آتلیه/website/Atelier
Edit: I did some research and saw that there are variances on how WebClient and WebRequest handle character encoding.Link to relevant article.
Try switching from WebClient and use WebRequest and see if that resolves you encoding problem.
There are many methods and solutions. Try which one fits your need
string testString = "http://test# space 123/text?var=val&another=two";
Console.WriteLine("UrlEncode: " + System.Web.HttpUtility.UrlEncode(testString));
Console.WriteLine("EscapeUriString: " + Uri.EscapeUriString(testString));
Console.WriteLine("EscapeDataString: " + Uri.EscapeDataString(testString));
Console.WriteLine("EscapeDataReplace: " + Uri.EscapeDataString(testString).Replace("%20", "+"));
Console.WriteLine("HtmlEncode: " + System.Web.HttpUtility.HtmlEncode(testString));
Console.WriteLine("UrlPathEncode: " + System.Web.HttpUtility.UrlPathEncode(testString));
//.Net 4.0+
Console.WriteLine("WebUtility.HtmlEncode: " + WebUtility.HtmlEncode(testString));
Console.WriteLine("WebUtility.UrlEncode: " + WebUtility.UrlEncode(testString));

How do I get ZipArchive.CreateEntry name encoding right?

I use the following code to create a zip archive with C#.
using (var zipArchive = new ZipArchive(compressedFileStream, ZipArchiveMode.Create, false))
{
var zipEntry = zipArchive.CreateEntry(name + ".pdf");
...
}
The name often consist of Swedish characters such as ÅÄÖ åäö. When I open the zip and look at the names the Swedish chars are garbled like this "Fl+Âdesm+ñtare.pdf".
I tried fixing the name encoding with this code. But it didn't work.
var iso = Encoding.GetEncoding("ISO-8859-1");
var utf8 = Encoding.UTF8;
var utfBytes = utf8.GetBytes(name);
var isoBytes = Encoding.Convert(utf8, iso, utfBytes);
var isoName = iso.GetString(isoBytes);
Any ideas?
Since DotNetZip is a dead project and this article is relevant to google searches, here is an alternative solution with the IO.Compression library :
Archive = New IO.Compression.ZipArchive(Stream, ZipArchiveMode, LeaveOpen, Text.Encoding.GetEncoding(Globalization.CultureInfo.CurrentCulture.TextInfo.OEMCodePage))
This might not cover all conversions, from what I gathered from the sources on the subject, the underlying code uses the local machine's (server) regional culture's encoding page for entry names. Mapping them with that encoding explicitly has fixed the issue for my client-domain, no guarantees that it's a silver bullet however.
You can try out DotNetZip library (get it via NuGet). Here is a code sample, where i use cp866 encoding:
private string GenerateZipFile(string filename, BetPool betPool)
{
using (var zip = new ZipFile(Encoding.GetEncoding("cp866")))
{
//zip.Password = AppConfigHelper.Key + DateTime.Now.Date.ToString("ddMMyy");
zip.AlternateEncoding = Encoding.GetEncoding("cp866");
zip.AlternateEncodingUsage = ZipOption.AsNecessary;
zip.AddFile(filename, "");
var zipFilename = FormZipFileName(betPool);
zip.Save(zipFilename);
return zipFilename;
}
}
using (var zip = new ZipArchive(ZipFilePath, ZipArchiveMode.Read, false, Encoding.GetEncoding("cp866")))

C# Encoding: Getting special characters from their codes

I am using a C# WinForms app to scrape some data from a webpage that uses charset ISO-8859-1. It works well for many special characters, but not all.
(* Below I use colons instead of semi-colons so that you will see the code that I see, and not the value of it)
I looked at the Page Source and I noticed that for the ones that won't display correctly, the actual code (e.g. &#363:) is in the Page Source, instead of the value. For example, in the Page Source I see Ry&#363: Murakami, but I expect to see Ryū Murakami. Also, there are many other codes that appear as codes, such as &#350: &#333: &#353: &#269: &#259: &#537: and many more.
I have tried using WebClient.DownloadString and WebClient.DownloadData.
Try #1 Code:
using (WebClient wc = new WebClient())
{
wc.Encoding = Encoding.GetEncoding("ISO-8859-1");
string WebPageText = wc.DownloadString("http://www.[removed].htm");
// Scrape WebPageText here
}
Try #2 Code:
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
using (WebClient wc = new WebClient())
{
wc.Encoding = iso;
byte[] AllData = wc.DownloadData("http://www.[removed].htm");
byte[] utfBytes = Encoding.Convert(iso, utf8, AllData);
string WebPageText = utf8.GetString(utfBytes);
// Scrape WebPageText here
}
I want to keep the special characters, so please don't suggest any RemoveDiacritics examples. Am I missing something?
Consider Decoding your HTML input.

How to disable base64-encoded filenames in HttpClient/MultipartFormDataContent

I'm using HttpClient to POST MultipartFormDataContent to a Java web application. I'm uploading several StringContents and one file which I add as a StreamContent using MultipartFormDataContent.Add(HttpContent content, String name, String fileName) using the method HttpClient.PostAsync(String, HttpContent).
This works fine, except when I provide a fileName that contains german umlauts (I haven't tested other non-ASCII characters yet). In this case, fileName is being base64-encoded. The result for a file named 99 2 LD 353 Temp Äüöß-1.txt
looks like this:
__utf-8_B_VGVtcCDvv73vv73vv73vv71cOTkgMiBMRCAzNTMgVGVtcCDvv73vv73vv73vv70tMS50eHQ___
The Java server shows this encoded file name in its UI, which confuses the users. I cannot make any server-side changes.
How do I disable this behavior? Any help would be highly appreciated.
Thanks in advance!
I just found the same limitation as StrezzOr, as the server that I was consuming didn't respect the filename* standard.
I converted the filename to a byte array of the UTF-8 representation, and the re-armed the bytes as chars of "simple" string (non UTF-8).
This code creates a content stream and add it to a multipart content:
FileStream fs = File.OpenRead(_fullPath);
StreamContent streamContent = new StreamContent(fs);
streamContent.Headers.Add("Content-Type", "application/octet-stream");
String headerValue = "form-data; name=\"Filedata\"; filename=\"" + _Filename + "\"";
byte[] bytes = Encoding.UTF8.GetBytes(headerValue);
headerValue="";
foreach (byte b in bytes)
{
headerValue += (Char)b;
}
streamContent.Headers.Add("Content-Disposition", headerValue);
multipart.Add(streamContent, "Filedata", _Filename);
This is working with spanish accents.
Hope this helps.
I recently found this issue and I use a workaround here:
At server side:
private static readonly Regex _regexEncodedFileName = new Regex(#"^=\?utf-8\?B\?([a-zA-Z0-9/+]+={0,2})\?=$");
private static string TryToGetOriginalFileName(string fileNameInput) {
Match match = _regexEncodedFileName.Match(fileNameInput);
if (match.Success && match.Groups.Count > 1) {
string base64 = match.Groups[1].Value;
try {
byte[] data = Convert.FromBase64String(base64);
return Encoding.UTF8.GetString(data);
}
catch (Exception) {
//ignored
return fileNameInput;
}
}
return fileNameInput;
}
And then use this function like this:
string correctedFileName = TryToGetOriginalFileName(fileRequest.FileName);
It works.
In order to pass non-ascii characters in the Content-Disposition header filename attribute it is necessary to use the filename* attribute instead of the regular filename. See spec here.
To do this with HttpClient you can do the following,
var streamcontent = new StreamContent(stream);
streamcontent.Headers.ContentDisposition = new ContentDispositionHeaderValue("attachment") {
FileNameStar = "99 2 LD 353 Temp Äüöß-1.txt"
};
multipartContent.Add(streamcontent);
The header will then end up looking like this,
Content-Disposition: attachment; filename*=utf-8''99%202%20LD%20353%20Temp%20%C3%84%C3%BC%C3%B6%C3%9F-1.txt
I finally gave up and solved the task using HttpWebRequest instead of HttpClient. I had to build headers and content manually, but this allowed me to ignore the standards for sending non-ASCII filenames. I ended up cramming unencoded UTF-8 filenames into the filename header, which was the only way the server would accept my request.

Ftp create a filename with utf-8 chars such as greek, german etc

I am trying to create a file to an ftp server with the following code (where I also tried with UseBinary option true and false)
string username = "name";
string password = "password";
string remotefolder = "ftp://ftp.myhost.gr/public_html/test/";
string remoteFileName = "δοκιμαστικό αρχείοüß-äCopy.txt";
string localFile = #"C:\test\δοκιμαστικό αρχείο - Copy.txt";
String ftpname = "ftp://ftp.myhost.gr/public_html/test" + #"/" + Uri.EscapeUriString(Program.remoteFileName);
FtpWebRequest request = (FtpWebRequest)WebRequest.Create(ftpname);
request.Proxy = null;
request.Credentials = new NetworkCredential(username, password);
request.UsePassive = true;
request.KeepAlive = true;
request.Method = WebRequestMethods.Ftp.UploadFile;
request.UseBinary = true;
//request.UseBinary = false;
byte[] content = System.IO.File.ReadAllBytes(localFile);
byte[] fileContents = new Byte[content.Length];
Array.Copy(content, 0, fileContents, 0, content.Length);
using (Stream uploadStream = request.GetRequestStream())
{
int contentLength = fileContents.Length;
uploadStream.Write(fileContents, 0, contentLength);
}
FtpWebResponse response = (FtpWebResponse)request.GetResponse();
Console.WriteLine(response.ExitMessage);
The problem is that file at my ftp server does not get the name
I request which contains English, greek and german characters --> "δοκιμαστικό αρχείοüß-äCopy.txt
1) What can I do with that?
There is some improvement once I change my regional settings--> Current language for non-Unicode programs to Greek Language but I still miss the german chars.
2) Why does a c# program depend on this setting? Is there a special methodology i should follow in order to avoid dependency from this setting?
Encoding nightmares arose again :(
It is not enough for you just to encode your string as UTF8 and send it as filename to FTP server. In the past all FTP servers understood ASCII only and nowadays to maintain backward compatibility - even if they are Unicode aware - when they start they treat all filenemes as ASCII too.
To make it all work you (your program) must first check what your server is capable of. Servers send their features after client connects - in your case you must check for FEAT UTF8. If your server sends that - it means it understands UTF8. Nevertheless - even if it understands it - you must tell it explicitly that from now on you will send your filenames UTF8 encoded and now it is the stuff that your program lacks (as your server supports utf8 as you've stated).
Your client must send to FTP server the following OPTS UTF8 ON. After sending that you may use UTF8 or speak UTF8-ish (so to speak) to your sever.
Read here for details Internationalization of the File Transfer Protocol
In your code change:
string localFile = #"C:\test\δοκιμαστικό αρχείο - Copy.txt";
String ftpname = "ftp://ftp.myhost.gr/public_html/test" + #"/" + Uri.EscapeUriString(Program.remoteFileName);
FtpWebRequest request = (FtpWebRequest)WebRequest.Create(ftpname);
To:
string remoteFileName = "δοκιμαστικό αρχείο - Copy.txt";
String ftpname = "ftp://ftp.myhost.gr/public_html/test" + #"/" + remoteFileName;
var escapedUriString = Uri.EscapeUriString(Encoding.UTF8.GetString(Encoding.ASCII.GetBytes(ftpname)));
var request = (FtpWebRequest)WebRequest.Create(escapedUriString);
This needs to be done because EscapeUriString's input parameter is escaped according to the RFC 2396 specification.
The RFC 2396 standard states:
When a new URI scheme defines a component that represents textual data
consisting of characters from the Universal Character Set [UCS], the
data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be
percent-encoded.
Hence the code change shown above will force this string to be inputted in the UTF-8 format.
With regards to:
Why does a c# program depend on this setting? Is there a special
methodology i should follow in order to avoid dependency from this
setting?
Uri.EscapeUriString needs input which follows the RFC 2396 specification, hence the need to pass it data in a format which it will understand.

Categories