Why this difference between foreach vs Parallel.ForEach?

Why this difference between foreach vs Parallel.ForEach? - c#

Can anyone explain to me in simple langauage why I get a file about 65 k when using foreach and more then 3 GB when using Parallel.ForEach?
The code for the foreach:
// start node xml document
var logItems = new XElement("log", new XAttribute("start", DateTime.Now.ToString("yyyy-MM-ddTHH:mm:ss")));
var products = new ProductLogic().SelectProducts();
var productGroupLogic = new ProductGroupLogic();
var productOptionLogic = new ProductOptionLogic();
// loop through all products
foreach (var product in products)
{
// is in a specific group
var id = Convert.ToInt32(product["ProductID"]);
var isInGroup = productGroupLogic.GetProductGroups(new int[] { id }.ToList(), groupId).Count > 0;
// get product stock per option
var productSizes = productOptionLogic.GetProductStockByProductId(id).ToList();
// any stock available
var stock = productSizes.Sum(ps => ps.Stock);
var hasStock = stock > 0;
// get webpage for this product
var productUrl = string.Format(url, id);
var htmlPage = Html.Page.GetWebPage(productUrl);
// check if there is anything to log
var addToLog = false;
XElement sizeElements = null;
// if has no stock or in group
if (!hasStock || isInGroupNew)
{
// page shows => not ok => LOG!
if (!htmlPage.NotFound) addToLog = true;
}
// if page is ok
if (htmlPage.IsOk)
{
sizeElements = GetSizeElements(htmlPage.Html, productSizes);
addToLog = sizeElements != null;
}
if (addToLog) logItems.Add(CreateElement(productUrl, htmlPage, stock, isInGroup, sizeElements));
}
// save
var xDocument = new XDocument(new XDeclaration("1.0", "utf-8", "yes"), new XElement("log", logItems));
xDocument.Save(fileName);
Use of the parallel code is a minor change, just replaced the foreach with Parallel.ForEach:
// loop through all products
Parallel.ForEach(products, product =>
{
... code ...
};
The methods GetSizeElements and CreateElements are both static.
update1
I made the methods GetSizeElements and CreateElements threadsafe with a lock, also doesn't help.
update2
I get answer to solve the problem. That's nice and fine. But I would like to get some more insigths on why this codes creates a file that is so much bigger then the foreach solutions. I am trying get some more sense in how the code is working when using threads. That way I get more insight and can I learn to avoid the pitfalls.

One thing stands out:
if (addToLog)
logItems.Add(CreateElement(productUrl, htmlPage, stock, isInGroup, sizeElements));
logItems is not tread-safe. That could be your core problem but there are lots of other possibilities.
You have the output files, look for the differences.

Try to define the following parameters inside the foreach loop.
var productGroupLogic = new ProductGroupLogic();
var productOptionLogic = new ProductOptionLogic();
I think the only two is used by all of your threads inside the parallel foreach loop and the result is multiplied unnecessaryly.

Related

Doc2Vec (Or Word2Vec) In Catalyst C#: How Do I get it to give results? (FastText)

I'm trying to replicate results from Gensim in C# to compare results and see if we need to bother trying to get Python to work within our broader C# context. I have been programming in C# for about a week, am usually a Python coder. I managed to get LDA to function and assign topics with C#, but there is no Catalyst model (that I could find) that does Doc2Vec explicitly, but rather I need to do something with FastText as they have in their sample code:
// Training a new FastText word2vec embedding model is as simple as this:
var nlp = await Pipeline.ForAsync(Language.English);
var ft = new FastText(Language.English, 0, "wiki-word2vec");
ft.Data.Type = FastText.ModelType.CBow;
ft.Data.Loss = FastText.LossType.NegativeSampling;
ft.Train(nlp.Process(GetDocs()));
ft.StoreAsync();
The claim is that it is simple, and fair enough... but what do I do with this? I am using my own data, a list of IDocuments, each with a label attached:
using (var csv = CsvDataReader.Create("Jira_Export_Combined.csv", new CsvDataReaderOptions
{
BufferSize = 0x20000
}))
{
while (await csv.ReadAsync())
{
var a = csv.GetString(1); // issue key
var b = csv.GetString(14); // the actual bug
// if (jira_base.Keys.ToList().Contains(a) == false)
if (jira.Keys.ToList().Contains(a) == false)
{ // not already in our dictionary... too many repeats
if (b.Contains("{panel"))
{
// get just the details/desc/etc
b = b.Substring(b.IndexOf("}") + 1, b.Length - b.IndexOf("}") - 1);
try { b = b.Substring(0, b.IndexOf("{panel}")); }
catch { }
}
b = b.Replace("\r\n", "");
jira.Add(a, nlp.ProcessSingle(new Document(b,Language.English)));
} // end if
} // end while loop
From a set of Jira Tasks and then I add labels:
foreach (KeyValuePair<string, IDocument> item in jira) { jira[item.Key].Labels.Add(item.Key); }
Then I add to a list (based on a breakdown from a topic model where I assign all docs that are at or above a threshold in that topic to the topic, jira_topics[n] where n is the topic numner, as such:
var training_lst = new List<IDocument>();
foreach (var doc in jira_topics[topic_num]) { training_lst.Add(jira[doc]); }
When I run the following code:
// FastText....
var ft = new FastText(Language.English, 0, $"vector-model-topic_{topic_num}");
ft.Data.Type = FastText.ModelType.Skipgram;
ft.Data.Loss = FastText.LossType.NegativeSampling;
ft.Train(training_lst);
var wtf = ft.PredictMax(training_lst[0]);
wtf is (null,NaN). [hence the name]
What am I missing? What else do I need to do to get Catalyst to vectorize my data? I want to grab the cosine similarities between the jira tasks and some other data I have, but I can't even get the Jira data into anything resembling a vectorization I can apply to something. Help!
Update:
So, Predict methods apparently only work for supervised learning in FastText (see comments below). And the following:
var wtf = ft.CompareDocuments(training_lst[0], training_lst[0]);
Throws an Implementation error (and only doesn't work with PVDM). How do I use PVDM, PVDCbow in Catalyst?

DbSet.Where() Returning No Records on Query Even When They Exist in Dataset

Okay, so I'm going crazy here. I've used DbSet.Where's 1000 times and for whatever reason it's not working in this particular xunit test. The issue seems to be rooted with my where statement trying to get a list of recipeid's = 1 so I can delete them. Whe I stop th ecode and look at my locals the params are set to 1 where designated, but the where won't pick it up.
I've consolidated the code a bit to make it more readable here, but it still doesn't work as is. What the heck am I missing?
[Fact]
public void DeleteIngredientListWithId_ReturnsProperCount()
{
//Arrange
var dbOptions = new DbContextOptionsBuilder<IngredientDbContext>()
.UseInMemoryDatabase(databaseName: $"IngredientDb{Guid.NewGuid()}")
.Options;
var sieveOptions = Options.Create(new SieveOptions());
var fakeIngredientOne = new Ingredient { RecipeId = 1 };
var fakeIngredientTwo = new Ingredient { RecipeId = 1 };
var fakeIngredientThree = new Ingredient { RecipeId = 2 };
//Act
using (var context = new IngredientDbContext(dbOptions))
{
context.Ingredients.AddRange(fakeIngredientOne, fakeIngredientTwo, fakeIngredientThree);
var service = new IngredientRepository(context, new SieveProcessor(sieveOptions));
var ingredients = context.Ingredients.Where(i => i.RecipeId == 1).ToList();
context.Ingredients.RemoveRange(ingredients);
context.SaveChanges();
//Assert
var ingredientList = context.Ingredients.ToList();
ingredientList.Should().ContainEquivalentOf(fakeIngredientThree);
ingredientList.Should().HaveCount(1);
context.Database.EnsureDeleted();
}
}

It looks like you might not be persisting the records you added to the database before you subsequently try to query (and then remove them). The Where method is looking at the database, which is empty until you SaveChanges()
Until you save the changes, the pending additions are probably waiting for you in context.Ingredients.Local

HTMLAgilityPack selects nodes from first iteration through divs

I'm trying to use HTMLAgilityPack to parse some website for the first time. Everything works as expected but only for first iteration. On each iteration I get unique div with its data, but SelectNodes() always gets data from first iteration.
The code listed below explains the problem
All the properties for station get values from first iteration.
static void Main(string[] args)
{
List<Station> stations = new List<Station>();
wClient = new WebClient();
wClient.Proxy = null;
wClient.Encoding = encode;
for (int i = 1; i <= 1; i++)
{
HtmlDocument html = new HtmlDocument();
string link = string.Format("http://energybase.ru/powerPlant/index?PowerPlant_page={0}&pageSize=20&q=/powerPlant", i);
html.LoadHtml(wClient.DownloadString(link));
var stationList = html.DocumentNode.SelectNodes("//div[#class='items']").First().ChildNodes.Where(x=>x.Name=="div").ToList();//get list of nodes with PowerStation Data
foreach (var item in stationList) //each iteration returns Item with unique InnerHTML
{
Station st = new Station();
st.Name = item.SelectNodes("//div[#class='col-md-20']").First().SelectNodes("//div[#class='name']").First().ChildNodes["a"].InnerText;//gets name from first iteration
st.Url = item.SelectNodes("//div[#class='col-md-20']").First().SelectNodes("//div[#class='name']").First().ChildNodes["a"].Attributes["href"].Value;//gets url from first iteration and so on
st.Company = item.SelectNodes("//div[#class='col-md-20']").First().SelectNodes("//div[#class='name']").First().ChildNodes["small"].ChildNodes["em"].ChildNodes["a"].InnerText;
stations.Add(st);
}
}
Maybe I am not getting some of essentials of OOP?

Your code can be greatly simplified by using the full power of XPath.
var stationList = html.DocumentNode.SelectNodes("//div[#class='items']/div");
// XPath-expression may be so: "//div[#class='items'][1]/div"
// where [1] means first node
foreach (var item in stationList)
{
Station st = new Station();
st.Name = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']/a").InnerText;
st.Url = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']/a").Attributes["href"].Value;
string rawText = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']/small/em").InnerText;
st.Company = HttpUtility.HtmlDecode(rawText.Trim());
stations.Add(st);
}
Your mistake was to use XPath descendants axis: //div.
Even better rewrite code like this:
var divName = item.SelectSingleNode("div[#class='col-md-20']/div[#class='name']");
var nodeA = divName.SelectSingleNode("a");
st.Name = nodeA.InnerText;
st.Url = nodeA.Attributes["href"].Value;
string rawText = divName.SelectSingleNode("small/em").InnerText;
st.Company = HttpUtility.HtmlDecode(rawText.Trim());

This article contains some good exaples on various aspects of html agility pack.
have a look into this article, it would give you a quick start.

How does paging work on the C# Facebook sdk

Cannot find any documentation for this...
Currently using the following code to get a list of my photos:
FacebookApp fb = new FacebookApp(accessToken);
dynamic test = fb.Get("me/photos");
I'm cycling through the first 25 photos that it returns. Simple.
Now how do it get it to return the next 25?
So far I've tried this:
FacebookApp fb = new FacebookApp(accessToken);
string query = "me/photos";
while (true)
{
dynamic test = fb.Get(query);
foreach (dynamic each in test.data)
{
// do something here
}
query = test.paging.next;
}
but it fails throwing:
Could not parse '2010-08-30T17%3A58%3A56%2B0000' into a date or time.
Do I have to use a fresh dynamic variable for every request, or am I going about this the wrong way completely?

Ended up finding this:
// first set (1-25)
var parameters = new ExpandoObject();
parameters.limit = 25;
parameters.offset = 0;
app.Api("me/friends", parameters);
// next set (26-50)
var parameters = new ExpandoObject();
parameters.limit = 25;
parameters.offset = 25;
app.Api("me/friends", parameters);

I also found you can use this.
// for the first 25 albums (in this case) 1-25
dynamic albums = client.Get("me/albums", new { limit = "25", offset = "0"});
// for the next 25 albums, 26-50
dynamic albums = client.Get("me/albums", new { limit = "25", offset = "25"});
Worked the same as you used above.

Errors when creating a custom Querable object with MVC and Subsonic pagedlist

hiya, i have the following code but when i try and create a new IQuerable i get an error that the interface cannot be implemented, if i take away the new i get a not implemented exception, have had to jump back and work on some old ASP classic sites for past month and for the life of me i can not wake my brain up into C# mode.
Could you please have a look at below and give me some clues on where i'm going wrong:
The code is to create a list of priceItems, but instead of a categoryID (int) i am going to be showing the name as string.
public ActionResult ViewPriceItems(int? page)
{
var crm = 0;
page = GetPage(page);
// try and create items2
IQueryable<ViewPriceItemsModel> items2 = new IQueryable<ViewPriceItemsModel>();
// the data to be paged,but unmodified
var olditems = PriceItem.All().OrderBy(x => x.PriceItemID);
foreach (var item in olditems)
{
// set category as the name not the ID for easier reading
items2.Concat(new [] {new ViewPriceItemsModel {ID = item.PriceItemID,
Name = item.PriceItem_Name,
Category = PriceCategory.SingleOrDefault(
x => x.PriceCategoryID == item.PriceItem_PriceCategory_ID).PriceCategory_Name,
Display = item.PriceItems_DisplayMethod}});
}
crm = olditems.Count() / MaxResultsPerPage;
ViewData["numtpages"] = crm;
ViewData["curtpage"] = page + 1;
// return a paged result set
return View(new PagedList<ViewPriceItemsModel>(items2, page ?? 0, MaxResultsPerPage));
}
many thanks

you do not need to create items2. remove the line with comment try and create items2. Use the following code. I have not tested this. But I hope this works.
var items2 = (from item in olditems
select new ViewPriceItemsModel
{
ID = item.PriceItemID,
Name = item.PriceItem_Name,
Category = PriceCategory.SingleOrDefault(
x => x.PriceCategoryID == item.PriceItem_PriceCategory_ID).PriceCategory_Name,
Display = item.PriceItems_DisplayMethod
}).AsQueryable();

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Why this difference between foreach vs Parallel.ForEach? - c#

One thing stands out: if (addToLog) logItems.Add(CreateElement(productUrl, htmlPage, stock, isInGroup, sizeElements)); logItems is not tread-safe. That could be your core problem but there are lots of other possibilities. You have the output files, look for the differences.

Try to define the following parameters inside the foreach loop. var productGroupLogic = new ProductGroupLogic(); var productOptionLogic = new ProductOptionLogic(); I think the only two is used by all of your threads inside the parallel foreach loop and the result is multiplied unnecessaryly.

Related

Doc2Vec (Or Word2Vec) In Catalyst C#: How Do I get it to give results? (FastText)

DbSet.Where() Returning No Records on Query Even When They Exist in Dataset

HTMLAgilityPack selects nodes from first iteration through divs

How does paging work on the C# Facebook sdk

Errors when creating a custom Querable object with MVC and Subsonic pagedlist

Categories

Resources