I've created an algorithm which weighs the relevance of a list of articles against two lists of keywords that correlate to attributes of the article.
It works great and is super efficient... but it's a mess. It's not terribly readable, so it's difficult to discern what's going.
The operation in pseudo code goes something like this:
Loop through every article in a list called articles(List<Article>)
For every article, loop through every role in a list of roles (List<string>)
Check to see if the current article has any roles (Article.Roles = List<string>)
If yes, then loop through each role in the article and try to match a role in the article to the role in the current loop
If a match is found, add weight to the article. If the index of the role on the article and the role in the roles list are both index 0 (in primary position) add extra weight for two matching primaries
Repeat for topics, but with no bonus for primary matches
What would be a better way to write the following code? I can't use foreach except in one or two places, because I need to match indexes to know what value to add on a match.
private static List<Article> WeighArticles(List<Article> articles, List<string> roles, List<string> topics, List<string> industries)
{
var returnList = new List<Article>();
for (int currentArticle = 0; currentArticle < articles.Count; currentArticle++)
{
for (int currentRole = 0; currentRole < roles.Count; currentRole++)
{
if (articles[currentArticle].Roles != null && articles[currentArticle].Roles.Count > 0)
{
for (int currentArticleRole = 0; currentArticleRole < articles[currentArticle].Roles.Count; currentArticleRole++)
{
if (articles[currentArticle].Roles[currentArticleRole].ToLower() == roles[currentRole].ToLower())
{
if (currentArticleRole == 0 && currentRole == 0)
articles[currentArticle].Weight += 3;
else
articles[currentArticle].Weight += 1;
}
}
}
}
for (int currentTopic = 0; currentTopic < topics.Count; currentTopic++)
{
if (articles[currentArticle].Topics != null && articles[currentArticle].Topics.Count > 0)
{
for (int currentArticleTopic = 0; currentArticleTopic < articles[currentArticle].Topics.Count; currentArticleTopic++)
{
if (articles[currentArticle].Topics[currentArticleTopic].ToLower() == topics[currentTopic].ToLower())
{
articles[currentArticle].Weight += 0.8;
}
}
}
}
returnList.Add(articles[currentArticle]);
}
return returnList;
}
//Article Class stub (unused properties left out)
public class Article
{
public List<string> Roles { get; set; }
public List<string> Topics { get; set; }
public double Weight { get; set; }
}
If you'll examine your code, you'll find that you are asking Article class to many times for data. Use Tell, Don't Ask principle and move weight adding logic to Article class, where it should belong. That will increase cohesion of Article, and make your original code much more readable. Here is how your original code will look like:
foreach(var article in articles)
{
article.AddWeights(roles);
article.AddWeights(topics);
}
And Article will look like:
public double Weight { get; private set; } // probably you don't need setter
public void AddWeights(IEnumerable<Role> roles)
{
const double RoleWeight = 1;
const double PrimaryRoleWeight = 3;
if (!roles.Any())
return;
if (Roles == null || !Roles.Any())
return;
var pirmaryRole = roles.First();
var comparison = StringComparison.CurrentCultureIgnoreCase;
if (String.Equals(Roles[0], primaryRole, comparison))
{
Weight += PrimaryRoleWeight;
return;
}
foreach(var role in roles)
if (Roles.Contains(role, StringComparer.CurrentCultureIgnoreCase))
Weight += RoleWeight;
}
Adding topics weights:
public void AddWeights(IEnumerable<Topic> topics)
{
const double TopicWeight = 0.8;
if (Topics == null || !Topics.Any() || !topics.Any())
return;
foreach(var topic in topics)
if (Topics.Contains(topic, StringComparer.CurrentCultureIgnoreCase))
Weight += TopicWeight;
}
Okay, you have several design flaws in your code:
1 - It's too procedural. You need to learn to think to write code to tell the machine "what you want" as opposed to "how to do it", similar to the analogy of going to a bar and instructing the bartender about the exact proportions of everything instead of just asking for a drink.
2 - Collections Should NEVER be null. Which means that checking for articles[x].Roles != null makes no sense at all.
3 - iterating on a List<string> and comparing each with someOtherString makes no sense either. Use List<T>.Contains() instead.
4 - You're grabbing each and every one of the items in the input list and outputting them in a new list. Also nonsense. Either return the input list directly or create a new list by using inputList.ToList()
All in all, here's a more idiomatic C# way of writing that code:
private static List<Article> WeighArticles(List<Article> articles, List<string> roles, List<string> topics, List<string> industries)
{
var firstRole = roles.FirstOrDefault();
var firstArticle = articles.FirstOrDefault();
var firstArticleRole = firstArticle.Roles.FirstOrDefault();
if (firstArticleRole != null && firstRole != null &&
firstRole.ToLower() == firstArticleRole.ToLower())
firstArticle.Weight += 3;
var remaining = from a in articles.Skip(1)
from r in roles.Skip(1)
from ar in a.Roles.Skip(1)
where ar.ToLower() == r.ToLower()
select a;
foreach (var article in remaining)
article.Weight += 1;
var hastopics = from a in articles
from t in topics
from at in a.Topics
where at.ToLower() == t.ToLower()
select a;
foreach (var article in hastopics)
article.Weight += .8;
return articles;
}
There are even better ways to write this, such as using .Take(1) instead of .FirstOrDefault()
Use the Extract Method refactoring on each for loop and give it a semantic name WeightArticlesForRole, WeightArticlesForTopic, etc. this will eliminate the nested loops(they are still there but via function call passing in a list).
It will also make your code self documenting and more readable, as now you have boiled a loop down to a named method that reflects what it accomplishes. those reading your code will be most interested in understanding what it accomplishes first before trying to understand how it accomplishes it. Semantic/conceptual function names will facilitate this. They can use GoTo Definition to determine the how after they udnerstand the what. Provide the summary tag comment for each method with elaborated explanation(similar to your pseudo code) and now others can wrap their head around what your code is doing without having to tediously read code they aren't concerned with the implementation details of.
The refactored methods will likely have some dirty looking parameters, but they will be private methods so I generally don't worry about this. However, sometimes it helps me see what dependencies are there that should probably be removed and restructure the code in the call such that it can be reused from multiple places. I suspect with some params for the weighting and delegate functions you might be able to combine WeightArticlesForRole and WeightArticlesForTopic into a single function to be reused in both places.
Related
I have two var of code:
first:
struct pair_fiodat {string fio; string dat;}
List<pair_fiodat> list_fiodat = new List<pair_fiodat>();
// list filled 200.000 records, omitted.
foreach(string fname in XML_files)
{
// get FullName and Birthday from file. Omitted.
var usersLookUp = list_fiodat.ToLookup(u => u.fio, u => u.dat); // create map
var dates = usersLookUp[FullName];
if (dates.Count() > 0)
{
foreach (var dt in dates)
{
if (dt == BirthDate) return true;
}
}
}
and second:
struct pair_fiodat {string fio; string dat;}
List<pair_fiodat> list_fiodat = new List<pair_fiodat>();
// list filled 200.000 records, omitted.
foreach(string fname in XML_files)
{
// get FullName and Birthday from file. Omitted.
var members = from s in list_fiodat where s.fio == FullName & s.dat == Birthdate select s;
if (members.Count() > 0 return true;
}
They make the same job - searching user by name and birthday.
The first one work very quick.
The second is very slowly (10x-50x)
Tell me please if it possible accelerate the second one?
I mean may be the list need in special preparing?
I tried sorting: list_fiodat_sorted = list_fiodat.OrderBy(x => x.fio).ToList();, but...
I skip your first test and change Count() to Any() (count iterate all list while any stop when there are an element)
public bool Test1(List<pair_fiodat> list_fiodat)
{
foreach (string fname in XML_files)
{
var members = from s in list_fiodat
where s.fio == fname & s.dat == BirthDate
select s;
if (members.Any())
return true;
}
return false;
}
If you want optimize something, you must leave comfortable things that offer the language to you because usually this things are not free, they have a cost.
For example, for is faster than foreach. Is a bit more ugly, you need two sentences to get the variable, but is faster. If you iterate a very big collection, each iteration sum.
LINQ is very powerfull and it's wonder work with it, but has a cost. If you change it for another "for", you save time.
public bool Test2(List<pair_fiodat> list_fiodat)
{
for (int i = 0; i < XML_files.Count; i++)
{
string fname = XML_files[i];
for (int j = 0; j < list_fiodat.Count; j++)
{
var s = list_fiodat[j];
if (s.fio == fname & s.dat == BirthDate)
{
return true;
}
}
}
return false;
}
With normal collections there aren't difference and usually you use foeach, LINQ... but in extreme cases, you must go to low level.
In your first test, ToLookup is the key. It takes a long time. Think about this: you are iterating all your list, creating and filling the map. It's bad in any case but think about the case in which the item you are looking for is at the start of the list: you only need a few iterations to found it but you spend time in each of the items of your list creating the map. Only in the worst case, the time is similar and always worse with the map creation due to the creation itself.
The map is interesting if you need, for example, all the items that match some condition, get a list instead found a ingle item. You spend time creating the map once, but you use the map many times and, in each time, you save time (map is "direct access" against the for that is "sequencial").
I currently have a list of objects that I am trying sort for a custom made grid view. I am hoping that I can achieve it without creating several customized algorithms. Currently I have a method called on page load that sorts the list by customer name, then status. I have a customized status order (new, in progress, has issues, completed, archived) and no matter which sort is used (customer, dates, so on) it should sort the status in the correct order. For example:
I have two customers with two orders each, the first customer is Betty White, the second is Mickey Mouse. Currently, Betty has a new order, and a completed order and Mickey has an order in progress and another on that has issues. So the display order should be:
Betty, New :: Betty, Completed
Mickey, In Progress :: Mickey, Has Issues
I am currently using Packages.OrderBy(o => o.Customer).ThenBy(o => o.Status). This works effectively to get the customers sorted, however this doesn't eliminate the custom sorting of the status property.
What would be the most efficient and standards acceptable method to achieve this result?
case PackageSortType.Customer:
Packages = Packages.OrderBy(o => o.Customer).ThenBy(o=>o.Status).ToList<Package>();
break;
I previously created a method that sorted by status only, however it is my belief that throwing the OrderBy into that algorithm would just jumble the status back up in the end.
private void SortByStatus() {
// Default sort order is: New, In Progress, Has Issues, Completed, Archived
List<Package> tempPackages = new List<Package>();
string[] statusNames = new string[5] { "new", "inProgress", "hasIssue", "completed", "archived" };
string currentStatus = string.Empty;
for (int x = 0; x < 5; x++) {
currentStatus = statusNames[x];
for (int y = 0; y < Packages.Count; y++) {
if (tempPackages.Contains(Packages[y])) continue;
else {
if (Packages[y].Status == currentStatus)
tempPackages.Add(Packages[y]);
}
}
}
Packages.Clear();
Packages = tempPackages;
}
Also, I'm not sure if it is relevant or not; however, the Packages list is stored in Session.
EDIT
Thanks to Alex Paven I have resolved the issue of custom sorting my status. I ended up creating a new class for the status and making it derive from IComparable, then created a CompareTo method that forced the proper sorting of the status.
For those who are curious about the solution I came up with (it still needs to be cleaned up), it's located below:
public class PackageStatus : IComparable<PackageStatus> {
public string Value { get; set; }
int id = 0;
static string[] statusNames = new string[5] { "new", "inProgress", "hasIssue", "completed", "archived" };
public int CompareTo(PackageStatus b) {
if (b != null) {
if (this == b) {
return 0;
}
for (int i = 0; i < 5; i++) {
if (this.Value == statusNames[i]) { id = i; }
if (b.Value == statusNames[i]) { b.id = i; }
}
}
return Comparer<int>.Default.Compare(id, b.id);
}
}
Use:
Packages.OrderBy(o => o.Customer).ThenBy(o => o.Status).ToList<Package>();
I'm not sure what exactly you're asking; why can't you use the Linq expressions in your first code sample? There's OrderByDescending in addition to OrderBy, so you can mix and match the sort order as you desire.
I have an Asp.Net MVC 5 website and I'm using Entity Framework code first to access its database. I have a Restaurants table and I want to let users search these with a lot of parameters. Here's what I have so far:
public void FilterModel(ref IQueryable<Restaurant> model)
{
if (!string.IsNullOrWhiteSpace(RestaurantName))
{
model = model.Where(r => r.Name.ToUpper().Contains(RestaurantName));
}
if (Recommended)
{
model = model.Where(r => r.SearchSponsor);
}
//...
}
Basically I look for each property and add another Where to the chain if it's not empty.
After that, I want to group the result based on some criteria. I'm doing this right now:
private static IQueryable<Restaurant> GroupResults(IQueryable<Restaurant> model)
{
var groups = model.GroupBy(r => r.Active);
var list = new List<IGrouping<bool, Restaurant>>();
foreach (var group in groups)
{
list.Add(group);
}
if (list.Count < 1)
{
SortModel(ref model);
return model;
}
IQueryable<Restaurant> joined, actives, inactives;
if (list[0].FirstOrDefault().Active)
{
actives = list[0].AsQueryable();
inactives = list.Count == 2 ? list[1].AsQueryable() : null;
}
else
{
actives = list.Count == 2 ? list[1].AsQueryable() : null;
inactives = list[0].AsQueryable();
}
if (actives != null)
{
//....
}
if (inactives != null)
{
SortModel(ref inactives);
}
if (actives == null || inactives == null)
{
return actives ?? inactives;
}
joined = actives.Union(inactives).AsQueryable();
return joined;
}
This works but it's got a lot of complications which I rather not talk about for the sake of keeping this question small.
I was wondering if this is the right and efficient way to do it. It seems kind of "dirty"! Lots of ifs and Wheres. Stored procedures, inverted indices, etc. This is my first "big" project and I want to learn from your experience to do this the "right" way.
Looking at the GroupResults Method I get a little confused about what you are doing. It seems the intention is to receive an arbitrary list of restuarants and return an ordered list of restaurants ordered by Active and some other criteria.
If thats true you may just do something like this and your job's done:
model.OrderBy(x => x.Active).ThenBy(x => Name);
If SortModel is somehow more sophisticated you may either add a comparer to the statement or stick with your current solution but change it to this:
if (model == null || !model.Any())
{
return model;
}
var active = model.Where(x=>x.Active);
var inactives = model.Where(x=>!x.Active);
// if (inactives == null) //not needed as where always return at least an empty list. Mabye check for inactive.Any()
SortModel(ref inactives); //You may also remove the ref as it's an reference anyway
joined = actives.Union(inactives).AsQueryable();
return joined;
Regarding the way you are handling your searching, I think it is simple, easy to read and understand, and it works. New team members will be able to look at that code and know immediately what it is doing and how it works. I think that is a pretty good indication that your approach is sound.
I have a problem with slow "building" a list and I don't have idea how to speed it up.
Here is my code:
private static ConcurrentBag<Classe<PojedynczeSlowa>> categoryClasses = new ConcurrentBag<Classe<PojedynczeSlowa>>();
private const int howManyStudents = 20;
private static int howManyClasses;
private static EventWaitHandle[] ewhClass;
private static List<Classe<Words>> deserializeClasses;
//...
public static void CreateCategoryClasses()
{
deserializeClasses = Deserialize();
howManyClasses = deserializeClasses.Count;
ewhClass = new EventWaitHandle[howManyClasses];
for (var i = 4; i >= 0; --i)
{
categoryClasses.Add(new Classe<PojedynczeSlowa>(((Categories) i).ToString()));
}
WaitCallback threadMethod = ParseCategories;
ThreadPool.SetMaxThreads(howManyStudents, howManyClasses);
for (var i = 0; i < howManyClasses; ++i)
{
ewhClass[i] = new EventWaitHandle(false, EventResetMode.AutoReset);
ThreadPool.QueueUserWorkItem(threadMethod, i);
}
for (var i = 0; i < howManyClasses; ++i)
{
ewhClass[i].WaitOne();
}
var xmls = new XmlSerializer(typeof(List<Classe<PojedynczeSlowa>>)); //poprawić!!
using (var sw = new StreamWriter(#"categoryClasses.xml"))
{
xmls.Serialize(sw, categoryClasses.ToList());
}
}
private static void ParseCategories(object index)
{
int sum;
var i = index as int?;
if (deserializeClasses[i.Value].Category == Categories.PEOPLE.ToString())
{
foreach (var word in deserializeClasses[i.Value].Bag)
{
sum =
deserializeClasses.Count(
clas =>
clas.Bag.Where(x => clas.Category == deserializeClasses[i.Value].Category)
.Contains(word));
if (!categoryClasses.ElementAt(0).Bag.Contains(new PojedynczeSlowa(word.Word, sum)))
{
categoryClasses.ElementAt(0)
.Bag.Add(new PojedynczeSlowa(word.Word,
Convert.ToDouble(sum)/
Convert.ToDouble(deserializeClasses.Count(x => x.Category == deserializeClasses[i.Value].Category))));
}
}
}
//rest of the code which adds elements to the list on other indexes.
ewhClass[(i).Value].Set();
}
I might add that:
deserializeClasses contains about 18550 elements of class "Word", and any of this elements ("Word") contains a list of string and int, average size of this list is about 200-250 elements. I use .net 4.5.1
Thanks for help!
A couple things (I don't have enough rep to comment so my comments are coming in here too)...
1) Class definitions would be very helpful. For example, you have
if (!categoryClasses.ElementAt(0).Bag.Contains(new PojedynczeSlowa(word.Word, sum)))
which will never be true if you haven't overridden object.Equals (did you?). Also, it's much harder to know what's going on with an incomplete sample.
2) Your code
sum = deserializeClasses.Count(clas => clas.Bag.Where(x => clas.Category == deserializeClasses[i.Value].Category).Contains(word));
doesn't make use of x at all. Consider
sum = deserializeClasses.Count(clas => clas.Category == deserializeClasses[i.Value].Category && clas.Bag.Contains(word));
This avoids much potential enumeration and could speed up the average cost even though the worst case cost remains the same.
3) Dictionaries are your friend. Consider making some temp dictionaries that are indexed by whatever you're checking against. I'm having a hard time figuring out exactly what you're trying to do (see comment 1) but I'm guessing you could save quite a bit of performance cost, particularly that Contains() call, with using a Dictionary.
4) I'm not sure that multithreading is going to save you anything here. I'm guessing it will make things slower since this looks to be CPU bound and you are adding CPU overhead with thread switching.
I would help out with some code but I'm in a bit of a hurry and don't have time to guess at the rest of the missing code to get everything to compile.
I've produced a function to get back a random set of submissions depending on the amount passed to it, but I worry that even though it works now with a small amount of data when the large amount is passed through, it would become efficent and cause problems.
Is there a more efficent way of doing the following?
public List<Submission> GetRandomWinners(int id)
{
List<Submission> submissions = new List<Submission>();
int amount = (DbContext().Competitions
.Where(s => s.CompetitionId == id).FirstOrDefault()).NumberWinners;
for (int i = 1 ; i <= amount; i++)
{
bool added = false;
while (!added)
{
bool found = false;
var randSubmissions = DbContext().Submissions
.Where(s => s.CompetitionId == id && s.CorrectAnswer).ToList();
int count = randSubmissions.Count();
int index = new Random().Next(count);
foreach (var sub in submissions)
{
if (sub == randSubmissions.Skip(index).FirstOrDefault())
found = true;
}
if (!found)
{
submissions.Add(randSubmissions.Skip(index).FirstOrDefault());
added = true;
}
}
}
return submissions;
}
As I say, I have this fully working and bringing back the wanted result. It is just that I'm not liking the foreach and while checks in there and my head has just turned to mush now trying to come up with the above solution.
(Please read all the way through, as there are different aspects of efficiency to consider.)
There are definitely simpler ways of doing this - and in particular, you really don't need to perform the query for correct answers repeatedly. Why are you fetching randSubmissions inside the loop? You should also look at ElementAt to avoid the Skip and FirstOrDefault - and bear in mind that as randSubmissions is a list, you can use normal list operations, like the Count property and the indexer!
The option which comes to mind first is to perform a partial shuffle. There are loads of examples on Stack Overflow of a modified Fisher-Yates shuffle. You can modify that code very easily to avoid shuffling the whole list - just shuffle it until you've got as many random elements as you need. In fact, these days I'd probably implement that shuffle slightly differently to you could just call:
return correctSubmissions.Shuffle(random).Take(amount).ToList();
For example:
public static IEnumerable<T> Shuffle<T>(this IEnumerable<T> source, Random rng)
{
T[] elements = source.ToArray();
for (int i = 0; i < elements.Length; i++)
{
// Find an item we haven't returned yet
int swapIndex = i + rng.Next(elements.Length - i);
T tmp = elements[i];
yield return elements[swapIndex];
elements[swapIndex] = tmp;
// Note that we don't need to copy the value into elements[i],
// as we'll never use that value again.
}
}
Given the above method, your GetRandomWinners method would look like this:
public List<Submission> GetRandomWinners(int competitionId, Random rng)
{
List<Submission> submissions = new List<Submission>();
int winnerCount = DbContext().Competitions
.Single(s => s.CompetitionId == competitionId)
.NumberWinners;
var correctEntries = DbContext().Submissions
.Where(s => s.CompetitionId == id &&
s.CorrectAnswer)
.ToList();
return correctEntries.Shuffle(rng).Take(winnerCount).ToList();
}
I would advise against creating a new instance of Random in your method. I have an article on preferred ways of using Random which you may find useful.
One alternative you may want to consider is working out the count of the correct entries without fetching them all, then work out winning entries by computing a random selection of "row IDs" and then using ElementAt repeatedly (with a consistent order). Alternatively, instead of pulling the complete submissions, pull just their IDs. Shuffle the IDs to pick n random ones (which you put into a List<T>, then use something like:
return DbContext().Submissions
.Where(s => winningIds.Contains(s.Id))
.ToList();
I believe this will use an "IN" clause in the SQL, although there are limits as to how many entries can be retrieved like this.
That way even if you have 100,000 correct entries and 3 winners, you'll only fetch 100,000 IDs, but 3 complete records. Hope that makes sense!