C# Multithreading Loop Datatable - c#

I have a datatable with 1000 records. Each row has a column with a link.I will loop the datatable and fetch record from the website using the link in the datatable. The code is working fine , but this is taking too much time to retrieve the records. So I need to pass it in multiple threads and fetch records and add all the records to a single datatable. I an using C# , Visual studio 2015.
How can we do using threading C#, Any help appreciated.
Existing code is as below.
for (int i = 0; i < dt.Rows.Count; i++)
{
String years = String.Empty;
dt.Rows[i]["Details"] = GetWebText(dt.Rows[i]["link"].ToString());
}
private String GetWebText(String url)
{
var html = new HtmlAgilityPack.HtmlDocument();
string text= html.LoadHtml(new WebClient().DownloadString(url));
return text;
}

You are going to run in to issues here with the thread-safety of write operations with data tables. So you need to ensure that the operations that you perform are separated nice.
The good thing is that you are actually doing three distinct steps and you can easily break them apart and parallelize the slow part while keeping it thread-safe.
Here's what your code is doing:
var url = dt.Rows[i]["link"].ToString();
var webText = GetWebText(url);
dt.Rows[i]["Details"] = webText;
Let's process the data in these three steps, but only parallize the GetWebText part.
This is how:
var data =
dt
.AsEnumerable()
.Select(r => new { Row = r, Url = r["link"].ToString() })
.AsParallel()
// This `Select` is the only part run in parallel
.Select(x => new { x.Row, WebText = GetWebText(x.Url) })
.ToArray();
foreach (var datum in data)
{
datum.Row["Details"] = datum.WebText;
}

Blocking Collections can solve the problem:
Blocking<string> links= new BlockingCollection<string>();\\ using System.Collections.Concurrent;
Blocking<string> results= new BlockingCollection<string>();
public static void main()
{
//get your datatable
for (int i = 0; i < dt.Rows.Count; i++)
{
ThreadStart t = new ThreadStart(threads);
Thread th = new Thread(t);
th.Start();
}
for (int i = 0; i < dt.Rows.Count; i++)
{
links.add(dt.Rows[i]["link"].ToString());
}
for (int i = 0; i < dt.Rows.Count; i++)
{
dt.Rows[i]["Details"] = results.Take();
}
}
public void threads()
{
while(true)
{
string url= Links.take();//block if links is empty
var html = new HtmlAgilityPack.HtmlDocument();
string text= html.LoadHtml(new WebClient().DownloadString(url));
results.add(text);//add result to the other queue
}
}

Related

How can I access multi-element List data stored in a public class?

My first question on SO:
I created this public class, so that I can store three elements in a list:
public class myMultiElementList
{
public string Role {get;set;}
public string Country {get;set;}
public int Commonality {get;set;}
}
In my main class, I then created a new list using this process:
var EmployeeRolesCountry = new List<myMultiElementList>();
var rc1 = new myMultiElementList();
rc1.Role = token.Trim();
rc1.Country = country.Trim();
rc1.Commonality = 1;
EmployeeRolesCountry.Add(rc1);
I've added data to EmployeeRolesCountry and have validated that has 472 lines. However, when I try to retrieve it as below, my ForEach loop only retrieves the final line added to the list, 472 times...
foreach (myMultiElementList tmpClass in EmployeeRolesCountry)
{
string d1Value = tmpClass.Role;
Console.WriteLine(d1Value);
string d2Value = tmpClass.Role;
Console.WriteLine(d2Value);
int d3Value = tmpClass.Commonality;
Console.WriteLine(d3Value);
}
This was the most promising of the potential solutions I found on here, so any pointers greatly appreciated.
EDIT: adding data to EmployeeRolesCountry
/*
Before this starts, data is taken in via a csvReader function and parsed
All of the process below is concerned with two fields in the csv
One is simply the Country. No processing necessary
The other is bio, and this itself needs to be parsed and cleansed several times to take roles out
To keep things making sense, I've taken much of the cleansing out
*/
private void File_upload_Click(object sender, EventArgs e)
{
int pos = 0;
var EmployeeRolesCountry = new List<myMultiElementList>();
var rc1 = new myMultiElementList();
int a = 0;
delimiter = ".";
string token;
foreach (var line in records.Take(100))
{
var fields = line.ToList();
string bio = fields[5];
string country = fields[4];
int role_count = Regex.Matches(bio, delimiter).Count;
a = bio.Length;
for (var i = 0; i < role_count; i++)
{
//here I take first role, by parsing on delimiter, then push back EmployeeRolesCountry with result
pos = bio.IndexOf('.');
if (pos != -1)
{
token = bio.Substring(0, pos);
string original_token = token;
rc1.Role = token.Trim();
rc1.Country = country.Trim();
rc1.Commonality = 1;
EmployeeRolesCountry.Add(rc1);
a = original_token.Length;
bio = bio.Remove(0, a + 1);
}
}
}
}
EDIT:
When grouped by multiple properties, this is how we iterate through the grouped items:
var employeesGroupedByRolwAndCountry = EmployeeRolesCountry.GroupBy(x => new { x.Role, x.Country });
employeesGroupedByRolwAndCountry.ToList().ForEach
(
(countryAndRole) =>
{
Console.WriteLine("Group {0}/{1}", countryAndRole.Key.Country, countryAndRole.Key.Role);
countryAndRole.ToList().ForEach
(
(multiElement) => Console.WriteLine(" : {0}", multiElement.Commonality)
);
}
);
__ ORIGINAL POST __
You are instantiating rc1 only once (outside the loop) and add the same instance to the list.
Please make sure that you do
var rc1 = new myMultiElementList();
inside the loop where you are adding the elements, and not outside.
All references are the same in your case:
var obj = new myObj();
for(i = 0; i < 5; i++)
{
obj.Prop1 = "Prop" + i;
list.Add(obj);
}
now the list has 5 elements, all pointing to the obj (the same instance, the same object in memory), and when you do
obj.Prop1 = "Prop" + 5
you update the same memory address, and all the pointers in the list points to the same instance so, you are not getting 472 copies of the LAST item, but getting the same instance 472 times.
The solution is simple. Create a new instance every time you add to your list:
for(i = 0; i < 5; i++)
{
var obj = new myObj();
obj.Prop1 = "Prop" + i;
list.Add(obj);
}
Hope this helps.

Optimizing HTTP request and multiple Split on CSV file

I'm trying to read a CSV file from a website, then split the initial string by \n, then split again by ,.
When I try to print out the content of one of the arrays, it was very slow, it takes almost one second between each Console.WriteLine() that prints each element.
I'm not entirely sure why it takes such great deal of time to print.
Any pointers will help
public List<string[]> list = new List<string[]>();
public List<string[]> Content
{
get
{
using (var url = new WebClient())
{
_content = url.DownloadString("https://docs.google.com/spreadsheets/d/1DDhAd98p5RwXqvV53P2YvaujIQEg28HjeXasrCge9Qo/pub?output=csv");
}
var urlArr = _content.Split('\n');
foreach (var i in urlArr)
{
var contentArr = i.Split(',');
List.Add(contentArr);
}
return list;
}
}
Main
var data = new ReadCSV();
for(var i = 0; i < data.Content[2].Length; i++)
Console.WriteLine(data.Content[2][i]);
You should cache the results in a variable, either in the Content property or before the loop because currently your code downloads and split the string every time in the loop which is why it is taking 1 second
So, your code should look like this:
var data = new ReadCSV();
var content = data.Content[2];
for(var i = 0; i < content.Length; i++)
Console.WriteLine(content[2][i]);

How do I make List of Lists? And then add to each List values?

class ExtractLinks
{
WebClient contents = new WebClient();
string cont;
List<string> links = new List<string>();
List<string> FilteredLinks = new List<string>();
List<string> Respones = new List<string>();
List<List<string>> Threads = new List<List<string>>();
public void Links(string FileName)
{
HtmlDocument doc = new HtmlDocument();
doc.Load(FileName);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[#href]"))
{
HtmlAttribute att = link.Attributes["href"];
if (att.Value.StartsWith("http://rotter.net/forum/scoops1"))
{
links.Add(att.Value);
}
}
for (int i = 0; i < links.Count; i++)
{
int f = links[i].IndexOf("#");
string test = links[i].Substring(0, f);
FilteredLinks.Add(test);
}
for (int i = 0; i < FilteredLinks.Count; i++)
{
contents.Encoding = System.Text.Encoding.GetEncoding(1255);
cont = contents.DownloadString(FilteredLinks[i]);
GetResponsers(cont);
}
}
private void GetResponsers(string contents)
{
int f = 0;
int startPos = 0;
while (true)
{
string firstTag = "<FONT CLASS='text16b'>";
string lastTag = "&n";
f = contents.IndexOf(firstTag, startPos);
if (f == -1)
{
break;
}
int g = contents.IndexOf(lastTag, f);
startPos = g + lastTag.Length;
string responser = contents.Substring(f + firstTag.Length, g - f - firstTag.Length);
foreach (List<string> subList in Threads)
{
}
}
}
}
I created this variable :
List<List<string>> Threads = new List<List<string>>();
The first thing I don't know yet how to do is how to create inside Threads number of Lists according to the FilteredLinks.Count inside the Links method.
Second thing is in the GetResponsers method I did:
foreach (List<string> subList in Threads)
{
}
But what I want is that first time it will add all the values from variable responser to the first List in Threads. Then when it's getting to the break; it stop then and then in the Links methods its calling GetResponsers(cont); again this time I want that all the values in responser to be added to the second List in Threads.
I know that each time it's getting to the break; it will get the next FilteredLink from FilteredLinks.
How do I create number of Lists in Threads according to the FilteredLinks.Count?
How do I make the code in GetResponsers to add the responser ?
You don't need to specify the count for the number of lists in Threads, since it is a list, you can simply keep adding lists to it. So the first part is correct where you are declaring it.
The second part --> Your calling method will change. Look below for the calling method.
The third part --> Change private void GetResponsers(string contents) to private void GetResponsers(List threadList, string contents). Look below for implementation change.
Also the loop will look like this then
//other code you have
List<List<string>> Threads = new List<List<string>>();
public void Links(string FileName)
{
// ...other code you have
for (int i = 0; i < FilteredLinks.Count; i++)
{
threads.Add(new List<string>);
contents.Encoding = System.Text.Encoding.GetEncoding(1255);
cont = contents.DownloadString(FilteredLinks[i]);
GetResponsers(threads[threads.Count - 1], cont);
}
}
private void GetResponsers(List<string> threadList, string contents)
{
int f = 0;
int startPos = 0;
while (true)
{
string firstTag = "<FONT CLASS='text16b'>";
string lastTag = "&n";
f = contents.IndexOf(firstTag, startPos);
if (f == -1)
{
break;
}
int g = contents.IndexOf(lastTag, f);
startPos = g + lastTag.Length;
string responser = contents.Substring(f + firstTag.Length, g - f - firstTag.Length);
threadList.Add(responser);
}
}
PS: Please excuse the formatting.
How do i make List of Lists ? And then add to each List values?
The following codesnippet demonstrates you, how to handle List<List<string>>.
List<List<string>> threads = new List<List<string>>();
List<string> list1 = new List<string>();
list1.Add("List1_1");
list1.Add("List1_2")
threads.Add(list1);
List<string> list2 = new List<string>();
list1.Add("List2_1");
list1.Add("List2_2")
list1.Add("List2_3")
threads.Add(list2);
How do i create number of Lists in Threads according to the
FilteredLinks.Count ?
for(int i = 0; i < FilteredLinks.Count; i++)
{
var newList = new List<string>();
newList.Add("item1"); //add whatever you wish, here.
newList.Add("item2");
Threads.Add(newList);
}
I'm afraid I can't help you with Question #2, since I don't understand what you try to achieve there exactly.

Add Multiple record using Linq-to-SQL

I want to add Multiple rows into Table using Linq to SQL
public static FeedbackDatabaseDataContext context = new FeedbackDatabaseDataContext();
public static bool Insert_Question_Answer(List<QuestionClass.Tabelfields> AllList)
{
Feedback f = new Feedback();
List<Feedback> fadd = new List<Feedback>();
for (int i = 0; i < AllList.Count; i++)
{
f.Email = AllList[i].Email;
f.QuestionID = AllList[i].QuestionID;
f.Answer = AllList[i].SelectedOption;
fadd.Add(f);
}
context.Feedbacks.InsertAllOnSubmit(fadd);
context.SubmitChanges();
return true;
}
When I add records into list object i.e. fadd the record is overwrites with last value of AllList
I'm late to the party, but I thought you might want to know that the for-loop is unnecessary. Better use foreach (you don't need the index).
It gets even more interesting when you use LINQ (renamed method for clarity):
public static void InsertFeedbacks(IEnumerable<QuestionClass.Tabelfields> allList)
{
var fadd = from field in allList
select new Feedback
{
Email = field.Email,
QuestionID = field.QuestionID,
Answer = field.SelectedOption
};
context.Feedbacks.InsertAllOnSubmit(fadd);
context.SubmitChanges();
}
By the way, you shouldn't keep one data context that you access all the time; it's better to create one locally, inside a using statement, that will properly handle the database disconnection.
You should create object of Feedback in the scope of for loop, so change your method to :
public static bool Insert_Question_Answer(List<QuestionClass.Tabelfields> AllList)
{
List<Feedback> fadd = new List<Feedback>();
for (int i = 0; i < AllList.Count; i++)
{
Feedback f = new Feedback();
f.Email = AllList[i].Email;
f.QuestionID = AllList[i].QuestionID;
f.Answer = AllList[i].SelectedOption;
fadd.Add(f);
}
context.Feedbacks.InsertAllOnSubmit(fadd);
context.SubmitChanges();
return true;
}

When searching for an item in a generic list should I use LINQ or Contains?

I have a generic List and I have to find a particular string in this list. Could you please let me know which is the best approach in the below?
if (strlist.Contains("Test"))
{
// String found
}
or
string res = (from d in strlist where d == "Test" select d).SingleOrDefault();
if (res == "Test")
{
//found
}
Please consider the list may be very big populated from database. Your thoughts on this are highly appreciated.
If you have List<string> (or even IEnumerable<string>) and Contains meets your needs, then use Contains.
If you need some extra handling that Contains doesn't provide, I would suggest using Any():
if(strList.Any(s => s.StartsWith("Tes"))
{
// Found
}
The two methods will behave differently if there is more than one match; the first one will return true and the second one will throw an exception.
To correct that, change SingleOrDefault to FirstOrDefault.
To answer the question, you should call Contains if you're searching for an exact match and Any if you aren't.
For example:
if (strings.Contains("SomeString", StringComparer.OrdinalIgnoreCase))
if (strings.Any(s => s.StartsWith("b"))
You really should use HashSet<string> as the the performance of Contains is dramatically better. Now if you need to use a list for other operations you can simply have both available.
var list = BuildListOfStrings();
var set = new HashSet<string>(list);
if (set.Contains("Test"))
{
// ...
}
Now you have the ability to find items in the set as a O(1) operation.
Test
static void Main(string[] args)
{
var lst = GenerateStrings().Take(5000000).ToList();
var hsh = new HashSet<string>(lst);
var found = false;
var count = 100;
var sw = Stopwatch.StartNew();
for (int i = 0; i < count; i++)
{
hsh = new HashSet<string>(lst);
}
Console.WriteLine(TimeSpan.FromTicks(sw.ElapsedTicks / count));
sw = Stopwatch.StartNew();
for (int i = 0; i < count; i++)
{
found = lst.Contains("12345678");
}
Console.WriteLine(TimeSpan.FromTicks(sw.ElapsedTicks / count));
sw = Stopwatch.StartNew();
for (int i = 0; i < count; i++)
{
found = hsh.Contains("12345678");
}
Console.WriteLine(TimeSpan.FromTicks(sw.ElapsedTicks / count));
Console.WriteLine(found);
Console.ReadLine();
}
private static IEnumerable<string> GenerateStrings()
{
var rnd = new Random();
while (true)
{
yield return rnd.Next().ToString();
}
}
Result
0.308438 s
0.0197868 s
0.0 s
So what does this tell us? If you are making a small amount of calls to Contains use a List<string>, otherwise use a HashSet<string>.

Categories