Using Webclient with Foreach Loop to Download Webpages About 100,000 - c#

I am trying to build a small application where when I enter the a list of around 100,000 to 200,0000 urls it should go and download the html and save it in a relative folder.
I have 2 solution but each a some problems I have trying to figure out the best approach.
First Solution: Synchronize Method
Below is the code I am using
currentline = 0;
var lines = txtUrls.Lines.Where(line => !String.IsNullOrWhiteSpace(line)).Count();
string urltext = txtUrls.Text;
List<string> list = new List<string>(
txtUrls.Text.Split(new string[] { "\r\n" },
StringSplitOptions.RemoveEmptyEntries));
lblStatus.Text = "Working";
btnStart.Enabled = false;
foreach (string url in list)
{
using (WebClient client = new WebClient())
{
client.DownloadFile(url, #".\pages\page" + currentline + ".html");
currentline++;
}
}
lblStatus.Text = "Finished";
btnStart.Enabled = true;
the code works fine however it's slow and also randomly after 5000 urls it's stops working and the process says it's completed. (Please note I am using this code on background worker but make this code simpler to view I am showing only the relevant code.)
Second Solution : Asynchronize Method
int currentline = 0;
string urltext = txtUrls.Text;
List<string> list = new List<string>(
txtUrls.Text.Split(new string[] { "\r\n" },
StringSplitOptions.RemoveEmptyEntries));
foreach (var url in list)
{
using (WebClient webClient = new WebClient())
{
webClient.DownloadFileCompleted += new AsyncCompletedEventHandler(Completed);
webClient.DownloadProgressChanged += new DownloadProgressChangedEventHandler(ProgressChanged);
webClient.DownloadFileAsync(new Uri(url), #".\pages\page" + currentline + ".html");
}
currentline++;
label1.Text = "No.of Lines Completed: " + currentline;
}
this code works super fast but most of the time I am getting downloaded files with 0KB and I am sure the network is fast since I am testing in OVH Dedi server.
Can anyone point what I am doing wrong ? or tips on improving it or entirely different solution to this problem.

Instead of using DownloadFile() try use
public async Task GetData()
{
WebClient client = new WebClient();
var data = await client.DownloadDataTaskAsync("http://xxxxxxxxxxxxxxxxxxxxx");
}
you will get data formated in byte[]. Then you just call:
File.WriteAllBytes() to save them to disk.

Related

How to set preftix and delete after character in each line in txt file with c#

Hello i'm getting some strings from a Web API
Like This
mail-lf0-f100.google.com,209.85.215.100
mail-vk0-f100.google.com,209.85.213.100
mail-ua1-f100.google.com,209.85.222.100
mail-ed1-f100.google.com,209.85.208.100
mail-lf1-f100.google.com,209.85.167.100
mail-ej1-f100.google.com,209.85.218.100
mail-pj1-f100.google.com,209.85.216.100
mail-wm1-f100.google.com,209.85.128.100
mail-io1-f100.google.com,209.85.166.100
mail-wr1-f100.google.com,209.85.221.100
mail-vs1-f100.google.com,209.85.217.100
mail-ot1-f100.google.com,209.85.210.100
mail-qv1-f100.google.com,209.85.219.100
mail-yw1-f100.google.com,209.85.161.100
it give me some string records and i want to do this Operations.
I want to make it line by line like it show in original
I want to remove everything after comma in eachline
I want to set a prefix before each line for the example:
Example: Google.com to>> This is Dns: Google.com
and This is my code . What should i edit and what should i add?
string filePath = "D:\\google.txt";
WebClient wc = new WebClient();
string html = wc.DownloadString("https://api.hackertarget.com/hostsearch/?q=Google.com");
File.CreateText(filePath).Close();
string number = html.Split(',')[0].Trim();
File.WriteAllText(filePath, number);
MessageBox.Show("Finish");
you are actually close. Check the solution below
var filePath = #"C:\Users\Mirro\Documents\Visual Studio 2010\Projects\Assessment2\Assessment2\act\actors.txt";
WebClient client = new WebClient();
string html = client.DownloadString("https://api.hackertarget.com/hostsearch/?q=google.com");
string[] lines = html.Split(
new[] { "\r\n", "\r", "\n" },
StringSplitOptions.None
);
var res = lines.Select(x => (x.Split(',')[0]).Trim()).ToList();
//res.Dump();
System.IO.File.WriteAllLines(filePath, lines);
.Net Fiddle
Hi
Thanks to Derviş Kayımbaşıoğlu
Also i find my solution from another way
string path = #"D:\Google.txt";
var url = "https://api.hackertarget.com/hostsearch/?q=google.com";
var client = new WebClient();
//StreamWriter myhost = new StreamWriter(path);
using (var stream = client.OpenRead(url))
using (var reader = new StreamReader(stream))
{
string line;
string realString;
using (StreamWriter myhost = new StreamWriter(path))
{
while ((line = reader.ReadLine()) != null)
{
realString = "This is the dns: " + line.Split(',')[0].Trim();
myhost.WriteLine(realString);
}
}
}
MessageBox.Show("Finish");

WebClient DownloadString method stops downloading string after few hours of running

I have an EXE which downloads string from an API hosted on cloud. this exe serves well upto one or two hours, it downloads all the string from requested URI but after certain requests/ one to two hours,, it does not download any string. I also tried with DownloadStringAsync method but same behavior with that as well. Following is the code.
static void Main(string[] args)
{
for (int i = 0; i < 100000; i++)
{
using (var webClient = new WebClient())
{
webClient.Headers.Clear();
webClient.Headers.Add("MyId", "037a1289-a1c6-e611-80d9-000d3a213f57");
webClient.Headers.Add("Content-Type", "application/json; charset=utf-8");
string downloadedString = webClient.DownloadString(new Uri("https://mycloudapiurl.com/api/MySet/GetMySets?id=D6364A82-9A3C-E711-80E0-000D3A213F57"));
if (!string.IsNullOrWhiteSpace(downloadedString) && downloadedString != "null")
{
Console.WriteLine("Downloaded " + i + "times");
}
else
{
Console.WriteLine("Downloaded string is null for API URL");
}
}
Thread.Sleep(10000);
}
}
this execution stops after around 100 to 120 iterations. The same thing occurs in my real application. not able to figure out the cause which stops downloading after certain iterations

CSV Reading as empty

I've got a page where by intervals of around 10 minutes a csv file is uploaded to a folder (received by an http link). This csv file has to be uploaded to sql. I've managed to get the csv files and save them in the folder, but the problem that I've got is that when I try and read the data it shows that the file is empty (but it is not)... It doesn't throw any errors, but when I run it with a debug, it shows an "Object reference not set to an instance of an object".
This is my code...
Method of what has to happen for this entire process:
private void RunCSVGetToSqlOrder()
{
DAL d = new DAL();
GetGeoLocations();
string geoLocPath = Server.MapPath("~\\GeoLocations\\GeoLocation.csv");
string assetListPath = Server.MapPath("~\\GeoLocations\\AssetList.csv");
d.InsertAssetGeoLocation(ImportGeoLocations(geoLocPath));
d.InsertAssetList(ImportAssetList(assetListPath));
DeleteFileFromFolder();
}
Getting the csv and saving into a folder (working):
private void GetGeoLocations()
{
string linkGeoLoc = "http://app03.gpsts.co.za/io/api/asset_group/APIAssetGroupLocation.class?zqcEK60SxfoP4fVppcLoCXFWUfVRVkKS#auth_token#auth_token";
string filepathGeoLoc = Server.MapPath("~\\GeoLocations\\GeoLocation.csv");
using (WebClient wc = new WebClient())
{
wc.DownloadFileAsync(new System.Uri(linkGeoLoc), filepathGeoLoc);
}
}
Read csv file and import to sql:
private static DataTable ImportGeoLocations(string csvFilePath)
{
DataTable csvData = new DataTable();
try
{
using (TextFieldParser csvReader = new TextFieldParser(csvFilePath))
{
// csvReader.TextFieldType = FieldType.Delimited;
csvReader.SetDelimiters(new string[] { "," });
csvReader.HasFieldsEnclosedInQuotes = true;
csvReader.TrimWhiteSpace = true;
string[] colFields = csvReader.ReadFields();
foreach (string column in colFields)
{
DataColumn datecolumn = new DataColumn(column);
datecolumn.AllowDBNull = true;
csvData.Columns.Add(datecolumn);
}
while (!csvReader.EndOfData)
{
string[] fieldData = csvReader.ReadFields();
//Making empty value as null
for (int i = 0; i < fieldData.Length; i++)
{
if (fieldData[i] == "")
{
fieldData[i] = null;
}
}
csvData.Rows.Add(fieldData);
}
}
}
catch (Exception ex)
{
}
return csvData;
}
The above code gives the error of "Object reference not set to an instance of an object" on the line, but this is most probably due because it's reading the csv as empty(null)...
string[] colFields = csvReader.ReadFields();
I'm not sure what I'm doing wrong... Any advice would be greatly appreciated...
------------ EDIT --------------
The csv file after the download looks as follows:
-------- Solution ------------
Below is the solution:
private void RunCSVGetToSqlOrder()
{
GetGeoLocations();
DeleteFileFromFolder();
}
private void GetGeoLocations()
{
string linkGeoLoc = "http://app03.gpsts.co.za/io/api/asset_group/APIAssetGroupLocation.class?zqcEK60SxfoP4fVppcLoCXFWUfVRVkKS#auth_token#auth_token";
string filepathGeoLoc = Server.MapPath("~\\GeoLocations\\GeoLocation.csv");
using (WebClient wc = new WebClient())
{
wc.DownloadFileAsync(new System.Uri(linkGeoLoc), filepathGeoLoc);
wc.DownloadFileCompleted += new AsyncCompletedEventHandler(wc_DownloadFileCompletedGeoLoc);
}
string linkAssetList = "http://app03.gpsts.co.za/io/api/asset_group/APIAssetGroupLocation.class?zqcEK60SxfoP4fVppcLoCXFWUfVRVkKS#auth_token#auth_token";
string filepathAssetList = Server.MapPath("~\\GeoLocations\\AssetList.csv");
using (WebClient wc = new WebClient())
{
wc.DownloadFileAsync(new System.Uri(linkAssetList), filepathAssetList);
wc.DownloadFileCompleted += new AsyncCompletedEventHandler(wc_DownloadFileCompletedAssetList);
}
}
void wc_DownloadFileCompletedGeoLoc(object sender, System.ComponentModel.AsyncCompletedEventArgs e)
{
DAL d = new DAL();
string geoLocPath = Server.MapPath("~\\GeoLocations\\GeoLocation.csv");
d.InsertAssetGeoLocation(ImportGeoLocations(geoLocPath));
}
void wc_DownloadFileCompletedAssetList(object sender, System.ComponentModel.AsyncCompletedEventArgs e)
{
DAL d = new DAL();
string assetListPath = Server.MapPath("~\\GeoLocations\\AssetList.csv");
d.InsertAssetList(ImportAssetList(assetListPath));
}
The call to wc.DownloadFileAsync returns a Task object and the downloaded file is only complete after the Task is completed.
This issue is hard to catch with the debugger, because the task will have enough time to complete when a breakpoint is reached or the file is manually inspected later. However, when the code runs without break point, the call d.InsertAssetGeoLocation(ImportGeoLocations(geoLocPath)) will be reached before the download is complete and therefore the csv file will be empty.
Possible solutions:
Redesign the code with async / await
Use wd.DownloadFileCompleted event for continuation
Use the synchronous wd.DownloadFile method
... just to name a few.
From the screenshot, it looks like the CSV file is not read properly on your computer. Depending on the language and region settings, the .CSV sometimes has ";" as seperators instead of ",".
Could you try manually replacing all "," with ";" and see if it solves the issue?

Adding multiple files to attachments using WebClient

I m trying to send email having multiple attachments from my website dashboard. However, my code is only able to attach the last file. I m using **WebClient()** to download files from Cloud system and then adding them as attachments.
This is the code I have written:
foreach (var link in attachments)
{
var uri = new Uri(link);
var s = uri.Segments[1];
var tempDirectory = #"c:\tempFolder\" + #"\"+ s;
WebClient webClient = new WebClient();
webClient.DownloadFile(link, tempDirectory);
Attachment attachment = new Attachment(tempDirectory);
attachment.ContentDisposition.Inline = true;
attachment.ContentDisposition.DispositionType = DispositionTypeNames.Inline;
attachment.ContentId = model.Id.ToString();
attachment.ContentType.Name = s;
email.Attachments.Add(attachment);
}
Any help would be appreciated
EDITED:
I attached two files, so getting these links in each loop:
Loop 1:
abc.com/TestEmail1.txt
Loop 2:
abc.com/TestEmail2.txt
Removing these lines has solved my problem
attachment.ContentDisposition.Inline = true;
attachment.ContentDisposition.DispositionType = DispositionTypeNames.Inline;
attachment.ContentId = model.Id.ToString();
attachment.ContentType.Name = s;

How can I download several files from the iInternet and wait for everyone to download?

I need to download several HTML files in my ASP.NET application. The average file size is about 100 KB.
Right now I'm using the following code.
foreach (var item in items)
{
string url = (string)item.Element("link");
string title = (string)item.Element("title");
string fileName = Server.MapPath(title + ".html");
// Add the attachement
WebClient client = new WebClient();
client.Encoding = System.Text.Encoding.UTF8;
client.DownloadFileCompleted += new AsyncCompletedEventHandler((a, b) =>
{
System.Threading.Thread.Sleep(2000);
message.Attachments.Add(new Attachment(fileName));
counter++;
// If we've downloaded all the items, send the message with the items attached to it
if (counter == totalItems)
{
SendMessage(message);
}
});
client.DownloadFileAsync(new Uri(url), fileName);
}
As you can see I'm downloading the files asynchronously, but the foreach loop doesn't care that the file hasn't been downloaded yet, it goes to the next iterated item.
As a result of this, some of the files are not downloaded.
Use the CountdownEvent Class to count down the number of remaining files.
var cde = new CountdownEvent(items.Count);
foreach (var item in items)
{
...
client.DownloadFileCompleted += (a, b) =>
{
lock (message)
{
message.Attachments.Add(new Attachment(fileName));
cde.AddCount();
}
};
...
}
// If we've downloaded all the items,
// send the message with the items attached to it
cde.Wait();
lock (message)
{
SendMessage(message);
}
If you're using .NET Framework 4 you could use Task class and WaitAll method.
It could as simple as a bug in your code to populate totalItems.
Try if (counter == items.Count())
instead of
if (counter == totalItems)

Categories