Wikipedia user agent problem when downloading images - c#

I am trying to download about 250 images from wikipedia with a c# .net console application.
After downloading 3 I get this error.
System.Net.WebException: 'The remote server returned an error: (403) Forbidden. Please comply with the User-Agent policy: https://meta.wikimedia.org/wiki/User-Agent_policy. '
I have read their User-Agent_policy page and added a user agent that complies with what they say. (to the best of my ability, I'm not a web-dev)
They say, make it descriptive, include the word bot if its a bot, include contact details in parentheses, all of which I have done.
I am also waiting 5 seconds in between each image.. I just really really dont wanna download them by hand in my browser.
static void DownloadImages()
{
var files = Directory.GetFiles(#"C:\projects\CarnivoraData", "*", SearchOption.AllDirectories);
var client = new WebClient();
client.Headers.Add("User-Agent", "bot by <My Name> (<My email address>) I am downloading an image of each carnivoran once (less than 300 images) for educational purposes");
foreach (var path in files)
{
//Console.WriteLine(path);
//Console.WriteLine(File.ReadAllText(path));
AnimalData data = JsonSerializer.Deserialize<AnimalData>(File.ReadAllText(path));
client.DownloadFile("https:" + data.Imageurl,#"C:\projects\CarnivoraImages\"+ data.Name +Path.GetExtension(data.Imageurl));
System.Threading.Thread.Sleep(5000);
}
}
Any suggestions?

Ok I got this to work. I think they key was using httpclient to download the files instead of webclient, and using DefaultRequestHeaders.UserAgent.ParseAdd
var httpClient = new HttpClient();
httpClient.DefaultRequestHeaders.UserAgent.ParseAdd("<My Name>/1.0 (<My Email>) bot");
I didnt even bother waiting between images, downloaded them all in about a minute.
Also as a bonus, heres how to download a file using httpclient (its a lot messier than webclient!)
static async Task GetFile(HttpClient httpClient,string filepath, string url)
{
using (var stream = await httpClient.GetStreamAsync(new Uri(url)))
{
using (var fileStream = new FileStream(filepath, FileMode.CreateNew))
{
await stream.CopyToAsync(fileStream);
}
}
}

Related

Can you download a file from an HttpContent stream from within a Windows Forms Application?

I recently developed a .NET Web App that downloaded zip files from a certain, set location on our network. I did this by retrieving the content stream and then passing it back to the View by returning the File().
Code from the .NET Web App who's behavior I want to emulate:
public async Task<ActionResult> Download()
{
try
{
HttpContent content = plan.response.Content;
var contentStream = await content.ReadAsStreamAsync(); // get the actual content stream
if (plan.contentType.StartsWith("image") || plan.contentType.Contains("pdf"))
return File(contentStream, plan.contentType);
return File(contentStream, plan.contentType, plan.PlanFileName);
}
catch (Exception e)
{
return Json(new { success = false });
}
}
plan.response is constructed in a separate method then stored as a Session variable so that it is specific to the user then accessed here for download.
I am now working on a Windows Forms Application that needs to be able to access and download these files from the same location. I am able to retrieve the response content, but I do not know how to proceed in order to download the zip within a Windows Forms Application.
Is there a way, from receiving the content stream, that I can download this file using a similar method within a Windows Form App? It would be convenient as accessing the files initially requires logging in and authenticating the user and thus can not be accessed normally with just a filepath.
Well, depending on what you're trying to accomplish, here's a pretty simplistic example of download a file from a URL and saving it locally:
string href = "https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-zip-file.zip";
WebRequest request = WebRequest.Create(href);
using (WebResponse response = request.GetResponse())
{
using (Stream dataStream = response.GetResponseStream())
{
Uri uri = new Uri(href);
string fileName = Path.GetTempPath() + Path.GetFileName(uri.LocalPath);
using (FileStream fs = new FileStream(fileName, FileMode.OpenOrCreate))
{
dataStream.CopyTo(fs);
}
}
}

C# Server downloads the file and transfers it to the user at the same time. Download fails [Google Drive]

I'm writing a program with ASP.NET Core.
The program will download the file from Google Drive without storing it in Memory or disk and transfer it to the user.
Download function in Google Drive official libraries does not continue until the download has finished. For this reason, I send a normal GET request to the API and read the file as a stream and return to the user.
But after a certain size, downloading in programs like IDM or browser results in an error.
In short, I want to make a program that uses the server as a bridge, and it should not be interrupted.
[HttpGet]
[DisableRequestSizeLimit]
public async Task<FileStreamResult> Download([FromQuery(Name = "file")] string fileid)
{
if (!string.IsNullOrEmpty(fileid))
{
var decoded = Base64.Base64Decode(fileid);
var file = DriveAPI.service.Files.Get(decoded);
file.SupportsAllDrives = true;
file.SupportsTeamDrives = true;
var fileinf = file.Execute();
var filesize = fileinf.FileSize;
var cli = new HttpClient(DriveAPI.service.HttpClient.MessageHandler);
//var req = await cli.SendAsync(file.CreateRequest());
var req = await cli.GetAsync($"https://www.googleapis.com/drive/v2/files/{decoded}?alt=media", HttpCompletionOption.ResponseHeadersRead);
//var req = await DriveAPI.service.HttpClient.GetAsync($"https://www.googleapis.com/drive/v2/files/{decoded}?alt=media", HttpCompletionOption.ResponseHeadersRead);
var contenttype = req.Content.Headers.ContentType.MediaType;
if (contenttype == "application/json")
{
var message = JObject.Parse(req.Content.ReadAsStringAsync().Result).SelectToken("error.message");
if (message.ToString() == "The download quota for this file has been exceeded")
{
throw new Exception("Google Drive Günlük İndirme Kotası Aşıldı. Lütfen 24-48 Saat Sonra Tekrar Deneyin.");
}
else
{
throw new Exception(message.ToString());
}
}
else
{
return File(req.Content.ReadAsStream(), contenttype, fileinf.OriginalFilename, false);
}
}
else
{
return null;
}
}
Some errors are written to the log file when downloading:
Received an unexpected EOF or 0 bytes from the transport stream.
Unable to read data from the transport connection.
etc.
If user is using IDM, the error is:
Server sent wrong answer on restart command
If user is downloading from browser, the error is:
Network Error
I started a 1.5Gb file download with 8 mbps internet when approximate download size is 900Mb, IDM stopped download with said error.
I have no idea other than returning FileStreamResult in ASP.NET Core - how to download concurrent files?

SSH.NET Upload Multiple files asynchronously throws an exception

I am creating an application that will
process a CSV file,
create JObject for each record in CSV file and save the JSON as txt file, and finally
upload all these JSON files to SFTP server
After looking around for a free library for the 3rd point, I decided to use SSH.NET.
I have created the following class to perform the upload operation asynchronously.
public class JsonFtpClient : IJsonFtpClient
{
private string _sfptServerIp, _sfptUsername, _sfptPassword;
public JsonFtpClient(string sfptServerIp, string sfptUsername, string sfptPassword)
{
_sfptServerIp = sfptServerIp;
_sfptUsername = sfptUsername;
_sfptPassword = sfptPassword;
}
public Task<string> UploadDocumentAsync(string sourceFilePath, string destinationFilePath)
{
return Task.Run(() =>
{
using (var client = new SftpClient(_sfptServerIp, _sfptUsername, _sfptPassword))
{
client.Connect();
using (Stream stream = File.OpenRead(sourceFilePath))
{
client.UploadFile(stream, destinationFilePath);
}
client.Disconnect();
}
return (destinationFilePath);
});
}
}
The UploadDocumentAsync method returns a TPL Task so that I can call it to upload multiple files asynchronously.
I call this UploadDocumentAsync method from the following method which is in a different class:
private async Task<int> ProcessJsonObjects(List<JObject> jsons)
{
var uploadTasks = new List<Task>();
foreach (JObject jsonObj in jsons)
{
var fileName = string.Format("{0}{1}", Guid.NewGuid(), ".txt");
//save the file to a temp location
FileHelper.SaveTextIntoFile(AppSettings.ProcessedJsonMainFolder, fileName, jsonObj.ToString());
//call the FTP client class and store the Task in a collection
var uploadTask = _ftpClient.UploadDocumentAsync(
Path.Combine(AppSettings.ProcessedJsonMainFolder, fileName),
string.Format("/Files/{0}", fileName));
uploadTasks.Add(uploadTask);
}
//wait for all files to be uploaded
await Task.WhenAll(uploadTasks);
return jsons.Count();
}
Although the CSV file results in thousands of JSON records, but I want to upload these in batches of at least 50. This ProcessJsonObjects always receives a list of 50 JObjects at a time which I want to upload asynchronously to the SFTP server. But I receive the following error on client.Connect(); line of the UploadDocumentAsync method:
Session operation has timed out
Decreasing the batch size to 2 works fine but sometimes results in the following error:
Client not connected.
I need to be able to upload many files at the same time. Or tell me if IIS or SFTP server needs configuration for this type of operation and what is it.
What am I doing wrong? Your help is much appreciated.

When I use the .NET WebClient DownloadFileAsync I randomly get zero length files returned

I'm trying to download files from my FTP server - multiples at the same time. When i use the DownloadFileAsync .. random files are returned with a byte[] Length of 0. I can 100% confirm the file exists on the server and has content AND there FTP server (running Filezilla Server) isn't erroring and say's the file has been transferred.
private async Task<IList<FtpDataResult>> DownloadFileAsync(FtpFileName ftpFileName)
{
var address = new Uri(string.Format("ftp://{0}{1}", _server, ftpFileName.FullName));
var webClient = new WebClient
{
Credentials = new NetworkCredential(_username, _password)
};
var bytes = await webClient.DownloadDataTaskAsync(address);
using (var stream = new MemoryStream(bytes))
{
// extract the stream data (either files in a zip OR a file);
return result;
}
}
When I try this code, it's slower (of course) but all the files have content.
private async Task<IList<FtpDataResult>> DownloadFileAsync(FtpFileName ftpFileName)
{
var address = new Uri(string.Format("ftp://{0}{1}", _server, ftpFileName.FullName));
var webClient = new WebClient
{
Credentials = new NetworkCredential(_username, _password)
};
// NOTICE: I've removed the AWAIT and a different method.
var bytes = webClient.DownloadData(address);
using (var stream = new MemoryStream(bytes))
{
// extract the stream data (either files in a zip OR a file);
return result;
}
}
Can anyone see what I'm doing wrong, please? Why would the DownloadFileAsync be randomly returning zero bytes?
Try out FtpWebRequest/FtpWebResponse classes. You have more available to you for debugging purposes.
FtpWebRequest - http://msdn.microsoft.com/en-us/library/system.net.ftpwebrequest(v=vs.110).aspx
FtpWebResponse - http://msdn.microsoft.com/en-us/library/system.net.ftpwebresponse(v=vs.110).aspx
Take a look at http://netftp.codeplex.com/. It appears as though almost all methods implement IAsyncResult. There isn't much documentation on how to get started, but I would assume that it is similar to the synchronous FTP classes from the .NET framework. You can install the nuget package here: https://www.nuget.org/packages/System.Net.FtpClient/

How to get response stream after uploading a file

I am working on a metro app. I used background uploader to upload file but my question is how I get response value after uploading it. I coded like this:
BackgroundUploader uploader = new BackgroundUploader();
uploader.SetRequestHeader("Content-Disposition", "form-data");
uploader.SetRequestHeader("name", "userfile");
uploader.SetRequestHeader("filename", App.ViewModel.DeviceId + ".png");
uploader.SetRequestHeader("Content-Type", "multipart/form-data");
UploadOperation upload = uploader.CreateUpload(uri, file);
await upload.StartAsync();
I came up with the following after noticing there were BytesReceived in my upload progress object.
async private Task<string> GetUploadResponseBody(UploadOperation operation)
{
string responseBody = string.Empty;
using (var response = operation.GetResultStreamAt(0))
{
uint size = (uint)operation.Progress.BytesReceived;
IBuffer buffer = new Windows.Storage.Streams.Buffer(size);
var f = await response.ReadAsync(buffer, size, InputStreamOptions.None);
using (var dr = DataReader.FromBuffer(f))
{
responseBody = dr.ReadString(dr.UnconsumedBufferLength);
}
}
return responseBody;
}
upload.StartAsync().Completed = UploadCompletedHandler;
...
void UploadCompletedHandler(IAsyncOperationWithProgress<TResult, TProgress> asyncInfo,
AsyncStatus asyncStatus)
{
// get a response body from an asyncInfo using the asyncInfo.GetResults() method
}
Follow this resources:
UploadOperation.StartAsync | startAsync Method (Windows)
IAsyncOperationWithProgress Interface (Windows)
AsyncOperationWithProgressCompletedHandler Delegate (Windows)
I've been looking for the same thing for the last few days and no luck. Finally discovered that you can not do this. You can get the "headers" of the response but there is no way of getting the "body" of the response from a BackgroundTransfer getResponseInformation() method.
Currently it's a limitation of the windows API. Hope they'll add it soon.
http://msdn.microsoft.com/en-us/library/windows/apps/windows.networking.backgroundtransfer.responseinformation.aspx#properties
The workaround is you can add your custom header in the response. For this you need to modify your server side script. But if you don't have any control on your server side script then use a proxy script that will do the communication between your app and remote server. For my case I created a proxy script in php that communicates with the remote server and after getting the response I'm adding it into a custom header key.
Then in the app in your complete method use this:
function complete(e){
var mykey = e.getResponseInformation().headers.lookup("mykey");
}
Hope that'll help.

Categories