Unable to properly download cyrillic-encoded HTML page in C# - c#

I am trying to download HTML webpage locally to my computer and this works fine, however, it is a Bulgarian article and it does not seem to display properly afterwards.
I have tried many encoding (Code Page Identifiers - WINDOWS-1251, UTF-8, etc.) from MSDN https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx but for some reason I cannot get it to open as intended.
For example:
Стара планина - величествената кръстница на Балканския полуостров
Displays as:
??N�?�N�?� ???�?�???????� - ???�?�??N�?�N?N�???�???�N�?� ??N�NSN?N�????N�?� ???� ?�?�?�???�??N?????N? ?????�N???N?N�N�????
Below I am posting my simple code. Your help will be much appreciated! :)
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;
namespace pageDownloader
{
class Program
{
public static void DownloadPage()
{
WebClient client = new WebClient();
string webpage = client.DownloadString("http://www.nasamnatam.com/statia/Stara_planina_velichestvenata_krystnica_na_Balkanskiia_poluostrov-2525.html");
System.IO.File.WriteAllText(#"C:\test\downloadedpage.html", webpage, Encoding.GetEncoding("windows-1251"));
}
static void Main()
{
DownloadPage();
}
}
}

Console.OutputEncoding = Encoding.UTF8;
string htmlCode = "";
WebClient client = new WebClient { Encoding = Encoding.UTF8 };
byte[] reply = client.DownloadData($"http://www.nasamnatam.com/statia/Stara_planina_velichestvenata_krystnica_na_Balkanskiia_poluostrov-2525.html");
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
Encoding encoding1251 = Encoding.GetEncoding("windows-1251");
var convertedBytes = Encoding.Convert(encoding1251, Encoding.UTF8, reply);
htmlCode = Encoding.UTF8.GetString(convertedBytes);

Related

Multipart/form-data request -- Uploading pdf is resulting in a blank file

I have an objective to send a pdf file from one server to a REST API which handles some archiving. I am using .NET Core 3.1 and the RestEase API library to help with some of the abstraction.
I have a simple console app that runs at a certain time everyday. The relevant code is as follows:
using System;
using System.Linq;
using System.Threading.Tasks;
using RestEase;
namespace CandidateUploadFile
{
class Program
{
static async Task Main(string[] args)
{
try
{
var apiClientBuilder = new ApiClientBuilder<ITestApi>();
var api = apiClientBuilder.GetApi("https://my.api.com");
var candidate = await api.GetCandidateByEmailAddress("tester#aol.com");
var fileName = "tester.pdf";
var fileBytesToUpload = await FileHelper.GetBytesFromFile($#"./{fileName}");
var result = await api.UploadCandidateFileAsync(fileBytesToUpload, candidate.Data.First().Id, fileName);
}
catch (Exception e)
{
System.Console.WriteLine(e);
}
}
}
}
apiClientBuilder does some auth-header adding, and that's really it. I'm certain that bit isn't relevant.
ITestApi looks like this:
using System.Collections.Generic;
using System.Net.Http;
using System.Threading.Tasks;
using Models;
using RestEase;
namespace CandidateUploadFile
{
public interface ITestApi : IApi
{
[Get("v1/candidates/{candidateId}")]
Task<Models.Response<Candidate>> GetCandidate([Path] string candidateId);
[Get("v1/candidates")]
Task<Models.Response<IEnumerable<Candidate>>> GetCandidateByEmailAddress([Query] string email);
[Get("v1/candidates")]
Task<Models.Response<IEnumerable<Candidate>>> GetCandidates();
[Post("v1/candidates/{candidateId}/files?perform_as=327d4d21-5cb0-4bc7-95f5-ae43aabc2db7")]
Task<string> UploadFileAsync([Path] string candidateId, [Body] HttpContent content);
[Get("v1/users")]
Task<Models.Response<IEnumerable<User>>> GetUsers();
}
}
It's UploadFileAsync that is really relevant here.
You'll note from Program.Main that I don't explicitly invoke UploadFileAsync. I instead invoke an extension method that basically wraps UploadFileAsync for the purpose of uploading the pdf using a multipart/form-data request. This approach is what comes as a recommendation in the RestEase library docs.. That extension method looks like this:
using System.Collections.Generic;
using System.Net.Http;
using System.Net.Http.Headers;
using System.Threading.Tasks;
namespace CandidateUploadFile
{
public static class ApiExtension
{
public static async Task<string> UploadCandidateFileAsync(this ITestApi api, byte[] data, string candidateId, string fileName)
{
var content = new MultipartFormDataContent();
var fileContent = new ByteArrayContent(data);
fileContent.Headers.ContentType = new MediaTypeHeaderValue("application/pdf");
fileContent.Headers.ContentDisposition = new ContentDispositionHeaderValue("form-data")
{
Name = "file",
FileName = fileName
};
content.Add(fileContent);
return await api.UploadFileAsync(candidateId, content);
}
}
}
So what will happen when my console app executes is: I will get a successful response from the upload endpoint, and the file on the archive server gets created, but it's blank.
It may be important to know that this does not happen when I send, say, a .txt file. The .txt file will save with the expected content.
Any insight would be helpful. I'm not sure where to start on this one.
Thank you!
The issue was due to what I was doing in my GetBytesFromFile static helper method.
My static helper was using UTF-8 encoding to encode the binary content in the .pdfs I was uploading. However, it was working fine with .txt files I was uploading, which can be expected.
Lesson learned: there is no need -- and makes no sense -- to try to encode binary content before assign it to the multipart/form-data content. I just had to "pass-through" the binary content as-is, more-or-less.

Capture Screenshots at Defined Time Intervals Automatically using asp.net c# web forms

I have one online test website, and I want to capture a screenshot at defined time intervals, automatically, using ASP.NET C# Web Forms. I need other related sample code.
I had tried one way. Now I am getting a URL, using capture the HTML response from an HTTP request. I want to store the HTML Image in a database table. I'm not getting any ideas regarding this.
Below is my C# code.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.Net;
using System.Text;
public partial class _Default : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
getimage();
}
public void getimage()
{
WebClient myClient = new WebClient();
string myPageHTML = null;
byte[] requestHTML;
// Gets the url of the page
string currentPageUrl = Request.Url.ToString();
UTF8Encoding utf8 = new UTF8Encoding();
// by setting currentPageUrl to mypage.aspx it will fetch the source (html)
// of the mypage.aspx and put it in the myPageHTML variable.
currentPageUrl = "http://localhost:49161/login.aspx";
requestHTML = myClient.DownloadData(currentPageUrl);
myPageHTML = utf8.GetString(requestHTML);
Response.Write(myPageHTML);
}
}

Get MP3 from Google Translate special letters

I am using this code:
using System.Net;
function() {
using (WebClient Client = new WebClient())
{
Client.DownloadFile("http://translate.google.com/translate_tts?tl=en&q=hello", "a.mp3");
}
}
Its working fine. Notice please the English language I am downloading. The main problem comes when I'd like to do the same with language using non-latin letter, for example Thai:
using System.Net;
function() {
using (WebClient Client = new WebClient())
{
Client.DownloadFile("http://translate.google.com/translate_tts?tl=th&q=สวัสดี", "a.mp3");
}
}
But this is giving me such a nonsence mp3 without that word sound. How to fix it please?
Notice the main structure of this website:
...translate.google.com/translate_tts?tl=**en**&q=**hello**"
...translate.google.com/translate_tts?tl=**th**&q=**สวัสดี**"
Use HttpUtility.UrlPathEncode("สวัสดี") to encode the Unicode characters.

Download HTML Page in C#

I am writing an app in c#,
Is there a way to download a HTML page by giving my program its URL only.
Foe example my program will get the URL www.google.com and download the HTML page?
Use WebClient.DownloadString().
Use the WebClient class.
This is extracted from a sample on the msdn doc page:
using System;
using System.Net;
using System.IO;
public static string Download (string uri)
{
WebClient client = new WebClient ();
Stream data = client.OpenRead (uri);
StreamReader reader = new StreamReader (data);
string s = reader.ReadToEnd ();
data.Close ();
reader.Close ();
return s;
}

Issue in writing special characters to Excel

I have a few reports that are exported to Excel. The problem is whereever there are special characters, it is being replaced by some funny symbols
For example, '-'(hyphen) was replaced by –...
Any help to solve the problem??
The most straight forward way is to encode the text file as UTF-8. I ran the following code, opened the resulting hyphen.txt file in Excel 2007 and it worked as expected:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
var hyphen = "\u2010\r\n";
var encoding = Encoding.UTF8;
var bytes = encoding.GetBytes(hyphen);
using (var stream = new System.IO.FileStream(#"c:\tmp\hyphen.txt", System.IO.FileMode.Create, System.IO.FileAccess.ReadWrite))
{
stream.Write(bytes, 0, bytes.Length);
}
}
}
}
This is the code -- view at PasteBin.

Categories