I'm doing some machine learning stuff and I want to take some random samples and determine if a human agrees with the computer. To do this a user just votes up or down on a given item. Then I want to be able to sort by the items with the highest rating. I want to use something more complicated than simply up-down to get good results.
I've looked into the Wilson Interval Score and it seems like a decent solution, but I'm wondering if there are other alternatives.
I'm going to be using C# 4.0 if that matters.
Edit: Added below example;
Lets suppose I have 3 items and multiple people have voted on them according to the table:
Item Up Down
1 6 1
2 60 11
3 100 40
In this example I would like Item 3 to be listed first, item 2 second and 3 third. This is a rough approximation of my expectations.
Item 3 has the most responses and highest relative approval. Item 2 has more responses than Item 1 despite having a lower percentage approval.
I'm trying to list the items in terms of some sort of relative metric and algrotithm without using something like percent approval or net score; something more complicated.
You can impliment the IComparable interface for you class. Impliment the CompareTo(T other) method. Create a case where this obj is less than the other obj and return -1. If they are the same, return 0. If this obj is greater than the other obj return 1.
When you sort a collection using the .Sort() method, it will use your rules.
Is this what you are looking for?
Related
I'm trying to write a program to optimize equipment configurations for a game. The problem is as follows:
There are six different equipment slots for a character to place an item. This is represented by 6 lists of individual items for each slot in the code containing all of the equipment owned by the player altogether.
The program will calculate the total stats of the character for each possible combination of equipment (1 from each list). These calculated stats can be filtered by specific stat min/max values and then also sorted by a specific stat to pinpoint a certain target set of stats for their character.
The program should be able to perform these queries without running out of memory or taking hours, and of course, the main problem is sifting through several billion possible combinations.
I'm not sure what the name of any supporting data structures or search algorithms to accomplish this would be called (in order to do more research towards a solution). I have come up with the following idea but I'm not sure if i'm on the right track or if someone can point me in a more effective direction.
The idea i'm pursuing is to use recursion, where each list (for each possible equipment slot) is set into a tree structure, with each progressive list acting as a child of the last. E.G.:
Weapons List
|
-----Armor List
|
------Helm List... etc
Each layer of the tree would keep a dictionary of every child path it can take containing the IDs of 1 item from each list and progressively calculating the stats given to the character (simple addition of stats from weapon + armor + helm as it traverses the tree and so on...)
When any stat with a min/max filter being applied hits it's boundary for that stat (namely, if the stat goes over the maximum before it reaches the bottom layer of the tree, it eliminates that "path" from the dictionary thus removing that entire leg of possible results from being traversed).
The main goal here is to reduce the possible tree paths to be traversed by the search algorithm and remove as many invalid results before the tree needs to calculate them to make the search as fast as possible and avoid any wasteful cycles. This seems pretty straightforward when removing items based on a "maximum" filter since when adding each item's stats progressively we can quickly tell when a stat has crossed it's expected maximum -- however when it comes to stopping paths based on a minimum total stat, I can't wrap my head around how to predict and remove these paths that won't end up above the minimum by the sixth item.
To simplify the idea, think of it like this:
I have 3 arrays of numbers
[X][0][1][2]
[0] 5 3 2
[1] 1 0 8
[2] 3 2 7
[3] 2 1 0
I want to find all combinations from the 3 arrays (sums) that are minimum of 9 and maximum of 11 total.
Each array must select at least but no more than 1 item and the sum of those selected values is what is being searched. This would need to be able to scale up to search 6+ arrays of 40+ values each essentially. Is the above approach on the right track or what is the best way to go about this (mainly using c#)
You should be able to filter out a lot of items by using a lower and upper bound for each slot:
var minimum = slots.Sum(slot => slot.Minimum);
var maximum = slots.Sum(slot => slot.Maximum);
foreach (var slot in slots)
{
var maxAvailable = maximum - slot.Maximum;
var minAvailable = minimum - slot.Minimum;
var filtered = slot.Items
// If we choose the strongest item in all the other slots and it's still below the minimum
.Where(item => item.Value + maxAvailable >= request.MinimumValue)
// If we choose the weakest item in all the other slots and its still above the maximum
.Where(item => item.Value + minAvailable <= request.MaximumValue);
}
After doing this, you can guarantee that all your combinations will be above the requested minimum, however some combinations may also be above the requested maximum, so combine this with the logic you have so far and I think you should get pretty optimal performance.
One of my clients wants to use a unique code for his items (long story..) and he asked me for a solution. The code will consist in 4 parts in which the first one is the zip code where the item is sent from, the second one is the supplier registration number, the third number is the year when the item is sent and the last part is a three division alphanumeric unique character.
As you can see the first three parts are static fields which will never change for the same sender in the same year. So we can say that the last part is the identifier part for that year. This part is 3-division alpahnumeric which means starting from 000 and ending with ZZZ.
The problem is that my client, for some reasonable reasons, wants this part to be not sequential. For example this is not what he wants:
06450-05-2012-000
06450-05-2012-001
06450-05-2012-002
...
06450-05-2012-ZZY
06450-05-2012-ZZZ
The last part should produced randomly like:
06450-05-2012-A17
06450-05-2012-0BF
06450-05-2012-002
...
06450-05-2012-T7W
06450-05-2012-22C
But it should also non-repetitive. So once a possible id is generated the possibility should be discarded from the selection pool.
I am looking for an effective way to do this.
If I only record selected possibilities and check a newly created one against them there is always a worst case possibility that it keeps producing already selected ones, especially near the end.
If I create all possibilities at once and record them in a table or a file it may take a while after every item creation because it will lookup for a non-selected record. By the way 26 letters + 10 digits means 46.656 possible combinations, and there is a chance that there may be a 4th divison added which means 1.679.616 possible combinations.
Is there a more effective way you can suggest? I will use C# for coding and MS SQL for databese..
If it doesn't have to be random, you could maybe simply choose a fixed but "unpredictable" addend which is relatively prime to 26 + 10 == 36 == 2²·3². This means, just choose a fixed addend divisible by neither 2 nor 3.
Then keep adding this fixed number to your previous serial number every time you need a new serial number. This is to be done modulo 46656 (or 1679616) of course.
Mathematics guarantees you won't get the same number twice (before no more "free" numbers are left).
As the addend, you could use const int addend = 26075 since it's 5 modulo 6.
If you expect to create far less than 36^3 entries for each zip-supplier-year tuple, you should probably just pick a random value for the last field and then check to see if it exists, repeating if it does.
Even if you create half of the maximum number of possible entries, new entries still have an expected value of only one failure. Assuming your database is indexed on the overall identifier, this isn't too great a price to pay.
That said, if you expect to use all but a few possible identifiers, then you should probably create all the possible records in advance. It may sounds like a high cost, but each space in memory storing an unused record will eventually store a real record.
I'd expect the first situation is more likely, but if not, or if there's some other combination of the two, please add a comment with some more information and I'll revise my answer.
I think options depend on the amount of the codes that are going to be used:
If you expect to use most of them within a year, then it is better to pre-generate. If done right, lookup should be really fast. And you are going to have 1.679.616 items per year in your DB anyway, so you will have to do such things right.
On the other hand, is it good that you are expecting to use most of them? It may leave you without codes if there are suddenly more items than expected.
If you expect to use only a small amount, then random+existence check might be a way to go, however it is unclear what amount it should be for that to be best (I am pretty sure it is possible to calculate that though).
I am trying to create a tool for a game called Monster Hunter (for personal-use)). I have worked with permutations before, but nothing this complex so i am totally stuck.
In the game you wear 5 pieces of armor. Each piece has skill points for one of many different skills. If you have 10+ skill points in a particular skill after calculating the whole set, you earn that skill.
Example:
Foo Head: Attack +2, Guard + 2
Foo Chest: Defense + 5
Foo Body: Guard + 2, Attack + 5, Defense +2
Foo Arm: Attack + 3, Speed + 4
Foo Legs: Attack + 5, Guard + 6, Defense + 3
The above set would result in 10+ in Attack, Defense, and Guard (not speed).
I would like to figure out how to find all combinations of armor pieces given 2-3 user-specified skills. So if you selected "Attack" and "Speed", it would give you all possible combinations of 5 pieces of armor that will result in +10 in both "Attack" and "Speed". There are about 60 different items for each of the 5 categories.
I know I can use LINQ to filter each of the 5 categories of armor parts so that I only get back a list of all the items that include one of the 2 specified skills, but I am lost on how to do the permutations since I am juggling 2-3 user-specified skills...
I wish I had working code to show, but I am so lost at this point I don't know where to start. I am not looking for the answer, per se, but advice on how to get there. Thanks.
1) I would try to find just for 1 skill, then filter that item set for the second / third
2) to avoid taking too much time/memory/recursion : i would sort the 5 * 60 items based on that only skill. Then i would create combinations by seeking the ones that add up to more than 10, starting from the upper skills, and stopping either when 10 is reached, or when it won't be reached.
The function that builds all combinations would look like :
1 : if we have total item skill >10 : all combination with other items are ok . stop.
2 : if current item skill is count <10 seek in the array for next biggest item for a not weared piece.
if in the array we reached 0 OR we reached a value such that (current count + value*number of piece type left ) <10 then its time to stop :-)
Otherwise add its skill count, note piece of armor type as used, then call your function for all items that might match.
well i may not be precise enough but you see the idea : use condition for the call to avoid exploding recursivity. Because 60*60*60*60*60 is a lot. and (quick)sorting 5*60=300 items is nothing.
To store your combinations, you might want to add the 'anything goes' case, to avoid storing / computing too many combination for nothing. (ex : if you have Carmak's Magical Hat, you have +100 in Coding, and you can dress any way you want, the bugs will dye ! :-) )
I'm trying to see if a specific algorithm can be translated to the kind of map-reduce index RavenDB/CouchDB uses, ie, "pre-computed" map-reduce (which means the indexes are refreshed on insertion and updates, not when performing the actual query).
Let's say we have a typical online store with 50,000 products, grouped in categories. Every product has a collection of "Attribute Values", ie, something like "[Red, Round, Metal]".
Since we have so much products on our website, and there's probably a lot of items in each of the categories, we want to give the user another way to "filter" the products he's currently seeing.
For example, if a category is "Less than $20", there's a whole bunch of products in this category. But our user only need to see products which are less than $20 and Red. Unfortunately, there's no sub-category "Red" in the "Less than $20" category.
Our algorithm would take the current list of products, and generate a list of "interesting" Attributes and Attribute Values, ie, given a list of products, it would output something like:
Color
Red (40)
Blue (32)
Yellow (17)
Material
Metal (37)
Plastic (36)
Wood (23)
Shape
Square (56)
Round (17)
Cylinder (12)
Could this sort of algorithm be somehow pre-computed à la RavenDB/CouchDB map-reduce index? If not, why exactly (so I can identify that kind of algorithm in the future) and if yes, how?
A C# 4.0 Visual Studio Test Solution is available that demonstrates the potential data structures and sample data, as well as a try at a map-reduce implementation (which doesn't seem to be pre-computable).
General case: It's always possible to use a CouchDB-style map-reduce view, but it's not necessarily practical.
In the end, it's mostly a counting-based argument: if you need to ask the question for any subset of your 500,000 products, then your database must be able to provide a distinct answer to each of 2500,000 different possible questions, which uses a prohibitive amount of memory if you have to emit a B-tree leaf for every one of them (and you need to emit data unless the answer to most of these queries is zero, false, an empty set or a similar null value).
CouchDB provides a first small optimization through the existence of range queries (meaning that in an ideal case, it can use as little as N B-tree leaves to answer N2 questions). However, in your example, this would only reduce the number of leaves down to 2250,000 (and that's a theoretical lower bound).
CouchDB provides a second small optimization through key prefix queries, meaning that you can compress [A], [A,B] and [A,B,C] queries into a single [A,B,C] key. So, instead of your 2250,000 possibilities, you're down to a "mere" 2249,999 ...
So, while you could think up an emitting strategy for answering the question for any subset, it would take more storage space than is actually available on our planet. In the general case, to answer N different questions you need to emit at least sqrt(N/2) B-tree leaves, so count your questions and determine if that lower bound on the number of leaves is acceptable.
Only for categories and subcategories: if you give up on arbitrary lists of products and only ask questions of the form "give me the significant attributes in category A filtered by attributes B and C", then your number of emits drops to:
AvgCategories * AvgAttr * 2 ^ (AvgAttr - 1) * 500,000
You're basically emitting for each product the keys [Category,Attr,Attr,...] for all categories of the product and all combinations of attributes of the product, which lets you query by category + attributes. If you have on average 1 category and 3 attributes per product, this works out to about 6 million entries, which is fairly acceptable.
This should be quite straightforward to implement in something like CouchDB. Have the map phase of your index output one key, value pair for each attribute the object has, with the value simply being '1'. Then, have the reduce phase sum up all input values and output the sum. The end result will be an index of the form you describe.
What kind of algorithm is this, I know pretty much nothing but this is what I'm trying to do in code... I have class 'Item', properties int A and int B -- I have multiple lists of List<Item> with a random amount of Item in each list, incosistent with any other List. I must choose 1 item from each list to get the highest possible value of the sum Item.A while conforming that the sum of Item.B must also be at minimum a certain number. In the future there might also be another property Item.C to conform to that the sum must be equal to a certain number. I have no idea how to write this :(
So to put it this way;
class Item
int A
int B
int C
I have a 10x different List<Item> each with a random number of Item inside
We must find the exactly the best combination to have
a) Highest sum of Item.A
b) Constraint that the sum of Item.B must be higher than X
c) Constraint that the sum of Item.C must be equal to X
I have no idea how to code this to be fast and efficient. :(
As mentionend in my comment, this is a Binary Programming problem, which can be cast as a multi-dimensional Knapsack problem. I would first try to solve it with an off-the-shelf Mixed Integer Programming (MIP) solver like the one suggested by Lieven in one of his comments (lpSolve), given that you "only" have got some 100-200 binary variables. You might have to play a little bit around with the parameters. Some MIP solvers allow you to add search heuristics, which might be helpful. Given your constraints, I must admit I don't have a feeling how long a standard MIP solver will take, but I wouldn't hold my breath.
If a mixed-integer programming solver is not fast enough for you, you want to look at some more specialised algorithms. For your problem, the ones presented in Knapsack Problems, chapter 11.10 on the multiple-choice Knapsack problem (almost exactly your problem) and chapter 9 are relevant.
Edit: based on your comments, the good news is that your data ranges are pretty good and the problem seems solvable in a reasonable time. This paper (DOI in case the link vanishes) presents an algorithm that according to the authors solves problems of your size within seconds (see section 4.4 and 5.1). The bad news is that it contains a lot of math...
I posted this question as an unregistered user and after clicking register, it didn't associate my unregistered user with my registered user, nice =/
In regards to the comment by van:
Typically there will be about 14 lists or so
Within each list there will be usually around 5-15 'Items'
Each item has those 3 properties.
We must exactly 1 item from each list.
We are looking for the maximum value of PropertyA when we calculate the sum of all PropertyA after choosing one item from each list
The constraints are PropertyB and PropertyC which the chosen combination must confirm too, once again using the sum of the values across the combination.
It must also be the most optimal solution, not an approximation.