I have an application that visitors come and go.
I m working with a data provider that gives me information about users such as their gender, age, location, and information about their personalities etc.
Now, i d like to target these users with appropriate content.
In short, I have content and users with their personality information, i need to display the best content that matches their character, personality etc.
I am aware that given a list of content and a user, i will be searching for the best possible content for the user, ie: A* search.
How would you design and implement such application?
Which algorithm(s)/data structures you would use? graphs ? adjacency list? matrix?
I would suggest solving this problem using Bayesian inference.
Bayesian Classifiers
As the problem is currently stated, the only classification of the content that is available is the distribution of the users which have visited it and the characteristics of those users. The joint probability distribution across all user-characteristic dimensions for all users is the classifier for that content.
So how does one use the above information? Given content A with user access distribution B for all users and a target user characteristic profile C, one can compute the probability that the latter user would be interested in content A. If one performs this computation against all content relative to user profile C, one gets a list of interest probability values for all of the content. Sort that list by the probability values to identify the best possible content for the target user.
In many cases, only a subset of user characteristic parameters may be predictive of the value of a given content item to users. This is a common situation for Bayesian classifiers in general and has led to the development of Bayesian networks, which are structured graphs of key variables and their conditional dependencies. Such networks can be modeled via Bayesian inference methods as well.
Bayesian Network Software
The WEKA Data Mining software is an open-source Java library which implements many common classification methods including Bayesian network classifiers, and it is well worth trying out. I can't recommend any specific C# equivalent packages, but a quick web search identified at least one commercial Bayesian package for .NET, Bayes Server.
Recommended Reading
There is a pretty large body of literature surrounding bayesian classifiers, and it is a very sound technique that is use in SPAM filtering, drug discovery, etc. Two books that I can recommend for this are listed below. Bolstad's book is for beginners, while Pearl's book is more advanced.
Bolstad, William M. (2007). Introduction to Bayesian Statistics, Second Edition, John Wiley.
Judea Pearl (2000). Causality: Models, Reasoning, and Inference, Cambridge University Press.
Very interesting question!
You're talking about best possible content. But you didn't mention a measurement. I guess under content you've meant some form of advertisement and “best” means most efficient, i.e. having highest CTR.
So you have a function:
f(gender, age, location, personality, ..., advertisement) -> CTR
Each visit you get fixed gender, age, location, etc. Under fixed I mean: you already have this visitor, you can't vary his age. And you have a parameter that you can change: advertisement. Your goal is to maximize CRT.
Varying advertisements you can gather statistics for CTR under different combinations. Once you have minimal initial knowledge you can try to use optimization theory methods, particularly nonlinear programming to find optimal advertisement parameter for given gender, age, location, etc. Continue to gather CTR statistics to make subsequent decisions more and more precise.
P.S. There was a startup showcase on TechCrunch. They did similar thing and have had fantastic results. So if you will success, think about starting your own business ;)
Despite Googling around a fair amount, the only things that surfaced were on neural networks and using existing APIs to find tags about an image, and on webcam tracking.
What I would like to do is create my own data set for some objects (a database containing the images of a product (or a fingerprint of each image), and manufacturer information about the product), and then use some combination of machine learning and object detection to find if a given image contains any product from the data I've collected.
For example, I would like to take a picture of a chair and compare that to some data to find which chair is most likely in the picture from the chairs in my database.
What would be an approach to tackling this problem? I have already considered using OpenCV, and feel that this is a starting point and probably how I'll detect the object, but I've not found how to use this to solve my problem.
I think in the end it doesn't matter what tool you use to tackle your problem. You will probably need some kind of machine learning. It's hard to say which method would result in the best detection, for this I'd recommend to use a tool like weka. It's a collection of multiple machine learning algorithms and lets you easily try out what works best for you.
Before you can start trying out the machine learning you will first need to extract some features out of your dataset. Since you can hardly compare the images pixel by pixel which would result in huge computational effort and does not even necessarily provide the needed results. Try to extract features which make your images unique, like average colour or brightness, maybe try to extract some shapes or sizes out of the image. So in the end you will feed your algorithm just with the features you extracted out of your images and not the images itself.
Which are good features is hard to define, it depends on your special case. Generally it helps to have not just one but multiple features covering completely different aspects of the image. To extract the features you could use openCV, or any other image processing tool you like. Get the features of all images in your dataset and get started with the machine learning.
From what I understood, you want to build a Content Based Image Retrieval system.
There are plenty of methods to do this. What defines the best method to solve your problem has to do with:
the type of objects you want to recognize,
the type of images that will be introduced to search the objects,
the priorities of your system (efficiency, robustness, etc.).
You gave the example of recognizing chairs. In your system which would be the determining factor for selecting the most similar chair? The color of the chair? The shape of the chair? These are typical question that you have to answer before choosing the method.
Either way one of the most used methods to solve such problems is the Bag-of-Words model (also Referred the Bag of Features). I wish I could help more but for that I need that you explain it better which are the final goals of your work / project.
I'm writing a desktop UI (.Net WinForms) to assist a photographer clean up his image meta data. There is a list of 66k+ phrases. Can anyone suggest a good open source/free .NET component I can use that employs some sort of algorithm to identify potential candiates for consolidation? For example there may be two or more entries which are actually the same word or phrase that only differ by whitespace or punctuation or even slight mis-spelling. The application will ultimately rely on the user to action the consolidation of phrases but having an effective way to automatically find potential candidates will prove invaluable.
Let me introduce you to the Levenshtein distance formula. It is awesome:
In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance.
Personally I used this in a healthcare setting, where Provider names were checked for duplicates. Using the Levenshtein process, we gave them a confidence rating and allowed them to determine if it was a true duplicate or something unique.
I know this is an old question, but I feel like this answer can help people who are dealing with the same issue in current time.
Please have a look at https://github.com/JakeBayer/FuzzySharp
It is a c# NuGet package that has multiple methods that implement a certain way of fuzzy search. Not sure, but perhaps Fosco's anwer is also used in one of them.
I just noticed a comment about this package, but I think it deserves a better place inside this question
I'm building a system which will have a few channels feeding different clients (MonoDroid, MonoTouch, Asp.Net Mvc, REST API)
I'm trying to adopt an SOA archetecture and also trying to adopt the persistence by reachability pattern (http://www.udidahan.com/2009/06/29/dont-create-aggregate-roots/)
My question relates to the design of the archetecture. How best to split the system into discreet chunks to benefit from SOA.
In my model have a SystemImplementation which represents the an installation of the system iteself. And also an Account entity.
The way I initially thought about designing this was to create the services as:
SystemImplementationService - responsible for managing things related to the actual installation itself such as branding, traffic logging etc
AccountService - responsible for managing the users assets (media, network of contacts etc)
Logically the registration of a new user account would happen in AccountService.RegisterAccount where the service can take care of validating the new account (duped username check etc), hashing the pw etc
However, in order to achieve persistence by reachability I'd need to add the new Account to the SystemImplementation.Accounts collection for it to save in the SystemImplementation service automatically (using nhibernate i can use lazy=extra to ensure when i add the new account to the collection it doesn't automatically load all accounts)
For this to happen I'd probably need to create the Account in AccountService, pass back the unsaved entity to the client and then have the client call SystemImplementation.AssociateAccountWithSystemImplementation
So that I don't need to call the SystemImplementation service from the AccountService (as this, correct me if I'm wrong - is bad practise)
My question is then - am i splitting the system incorrectly? If so, how should I be splitting a system? Is there any methodology for defining the way a system should be split for SOA? Is it OK to call a WCF service from in a service:
AccountService.RegisterAccount --> SystemImplementation.AssociateAccountWithSystemImplementation
I'm worried i'm going to start building the system based on some antipatterns which will come to catch me later :)
You have a partitioning issue, but you are not alone, everyone who adopts SOA comes up against this problem. How best to organize or partition my system into relevant pieces?
For me, Roger Sessions is talking the most sense around this topic, and guys like Microsoft are listening in a big way.
The papers that changed my thinking in this can be found at http://www.objectwatch.com/whitepapers/ABetterPath-Final.pdf, but I really recommend his book Simple Architectures for Complex enterprises.
In that book he introduces equivalence relations from set theory and how they relate to the partitioning of service contracts.
In a nutshell,
The rules to formulating partitions can be summarized into five laws:
Partitions must be true partitions.
a. Items live in one partition only, ever.
Partitions must be appropriate to the problem at hand.
a. Partitions only minimize complexity when they are appropriate to the problem
at hand, e.g. a clothing store organized by color would have little value to
customers looking for what they want.
The number of subsets must be appropriate.
a. Studies show that there seems to be an optimum number of items in a
subset, adding more subsets, thus reducing the number of items in each
subset, has very little effect on complexity, but reducing the number of
subsets, thus increasing the number of elements in each subset seems to
add to complexity. The number seems to sit in the range 3 – 12, with 3 – 5
being optimal.
The size of the subsets must be roughly equal
a. The size of the subsets and their importance in the overall partition must be
roughly equivalent.
The interaction between the subsets must be minimal and well defined.
a. A reduction in complexity is dependent on minimizing both the number and
nature of interactions between subsets of the partition.
Do not stress to much if at first you get it wrong, the SOA Manifesto tell us we should value Evolutionary refinement over pursuit of initial perfection .
Good luck
With SOA, the hardest part is deciding on your vertical slices of functionality.
The general principles are...
1) You shouldn't have multiple services talking to the same table. You need to create one service that encompasses an area of functionality and then be strict by preventing any other service from touching those same tables.
2) In contrast to this, you also want to keep each vertical slice as narrow as it can be (but no narrower!). If you can avoid complex, deep object graphs, all the better.
How you slice your functionality depends very much on your own comfort level. For example, if you have a relationship between your "Article" and your "Author", you will be tempted to create an object graph that represents an "Author", which contains a list of "Articles" written by the author. You would actually be better off having an "Author" object, delivered by "AuthorService" and the ability to get "Article" object from the "ArticleService" based simply on the AuthorId. This means you don't have to construct a complete author object graph with lists of articles, comments, messages, permissions and loads more every time you want to deal with an Author. Even though NHibernate would lazy-load the relevant parts of this for you, it is still a complicated object graph.
I'm working on an information system (in C#) that (while my users use it) gathers statistical data on what pieces of information (tables & records) each user is requesting the most, and what parts of the interface he/she uses most. I'm using this statistical data to make the application adaptive to the user's needs, both in the way the interface presents itself (eg: tab/pane-ordering) as in the way of using the frequently viewed information to (eg:) show higher in search results/suggestion-lists.
What i'm looking for is an algorithm/formula to determine the current 'hotness'/relevance of these objects for a specific user. A simple 'hitcounter' for each object won't be sufficient because the user might view some information quite frequently for a period of time, and then moving on to the next, making the old information less relevant. So i think my algorithm also needs some sort of sliding/historical principle to account for the changing popularity of the objects in the application over time.
So, the question is:
Does anybody have some sort of algorithm that accounts for that 'popularity over time' ?
Preferably with some explanation on the parameters :)
PS I've looked at other posts like Popularity algorithm but i could't quite port it to my specific case. Any help is appreciated.
Rather than try and guess what the user wants, why not ask the user to design the layout of the information.
My Yahoo, as an example, allows the user to specify what types of information he or she wants to see, and where on the screen the information is placed.
Your statistical information could be used to make suggestions to the user of where to place the information on the screen. Basically, the system could suggest that the most accessed information over the last month be placed on the upper left. But ideally, each user should decide which layout of the information makes the most sense for him or her.
This was a hard question for me to summarize so we may need to edit this a bit.
About four years ago, we had to translate our asp.net application for our clients in Mexico. Extensibility and scalability were not that much of a concern at the time (oh yes, I just said those dreadful words) because we only have U.S. and Mexican customers.
Rather than use resource files, we replaced every single piece of static text in our application with some type of server control (asp.net label for example). We store each and every English word in a SQL database. We have added the ability to translate the English text into another language and also can add cultural overrides. For example, hello can be translated to ¡hola! in one language and overridden to ¡bueno! in a different culture. The business has full control over these translations because will built management utilities for them to control everything. The translation kicks in when we detect that the user has a browser culture other than en-us. Every form descends from a base form that iterates through each server control and executes a translation (translation data is stored as a datatable in an application variable for a culture). I'm still amazed at how fast the control iteration is.
The problem
The business is very happy with how the translations work. In addition to the static content that I mentioned above, the business now wants to have certain data translated as well. System notes are a good example of a translation they want. Example "Sent Letter #XXXX to Customer" - the business wants the "Sent Letter to Customer" text translated based on their browser culture.
I have read a couple of other posts on SO that talk about localization but they don't address my problem. How do you translate a phrase that is dynamically generated? I could easily read the English text and translate "Sent", "Letter", "to" and "Customer", but I guarantee that it will look stupid to the end user because it's a phrase. The dynamic part of the system-generated note would screw up any look-ups that we perform on the phrase if we stored the phrase in English, less the dynamic text.
One thought I had... We don't have a table of system generated note types. I suppose we could create one that had placeholders for dynamic data and the translation engine would ignore the placeholder markers. The problem with this approach is that our SQL server database is a replication of an old pick database and we don't really know all the types of system generated phrases (They are deep in the pic code base, in subroutines, control files, etc.). Things like notes, ticklers, and payment rejection reasons are all stored differently. Trying to normalize this data has proven difficult. It would be a huge effort to go back and identify and change every pick program that generated a message.
This question is very close; but I'm not dealing with just system-generated status messages but rather an infinite number of phrases and types of phrases with no central generation mechanism.
Any ideas?
The lack of a "bottleneck" -- what you identify as the (missing) "central generation mechanism" -- is the architectural problem in this situation. Ideally, rearchitecting to put such a bottleneck in place (so you can keep using your general approach with a database of culture-appropriate renditions of messages, just with "placeholders" for e.g. the #XXXX in your example) would be best.
If that's just unfeasible, you can place the "bottleneck" at the other end of the pipe -- when a message is about to be emitted. At that point, or few points, you need to try and match the (English) string that's about to be emitted with a series of well-crafted regular expressions (with "placeholders" typically like (.*?)...) and thereby identify the appropriate key for the DB lookup. Yes, that still is a lot of work, but at least it should be feasible without the issues you mention wrt old translated pick code.
We use technique you propose with insertion points.
"Sent letter #{0:Letter Num} to Customer {1:Customer Full Name}"
Which might be (in reverse Pig Latin, say):
"Ustomercay {1:Customer Full Name} asway entsay etterlay #{0:Letter Num}"
Note that this handles cases where the particular target langue reverses the order of insertion etc. It does not handle subtleties like first, second, etc, which have to be handled with application logic/more phrases:
"This is your {0:first, second, third} warning"
In a pinch I suppose you could try something like foisting the job off onto Google if you don't have a translation on hand for a particular phrase, and stashing the translation for later.
Stashing the translations for later provides both a data collection point for building a message catalog and a rough (if sometimes laughably wonky) dynamically built starter set of translations. Once you begin the process, track which translations have been reviewed and how frequently each have been hit. Frequently hit machine translations can then be reviewed and refined.
Dynamic machine translation is not suitable for a product that you actually expect people to pay money for. The only way to do it is with static templates containing insertion points (as Cade Roux has demonstrated in his answer).
There's no getting around a thorough refactoring of your code to make this feasible. The alternative is to do nothing with those phrases (which is what you're doing now, and it's working out okay, right?). Usually no translation is better than embarrassingly bad translation.