I want to use ML.Net Multi-class classification in my current project that collects error logs from one my company systems.
Point is to add tags to errors and one point in the future train a model to predict and assign tags to incoming logs.
I'm using a model builder and I can't see my table relations, I store all logs in one table, tags in another and all relations in the third one.
|Logs| <-- |LogId|TagId| --> |Tags|
My goal is to classify table with TagId column based on Logs table - is that possible? or do I have to have everything in one table?
Generally speaking, machine learning algorithms are dealing with the fully 'denormalized' and 'prepared' data: every training example is vector of floats ('features'), and one 'ground truth' value.
ML.NET helps with some of the typical pre-processing tasks, like text featurization, one-hot encoding, rescaling/normalization, but it doesn't provide pretty much any 'relational' functionality (no JOINs).
So, you should de-normalize / 'flatten' your data before you pass it to ML.NET.
Related
I'm currently trying to implement a table within my SQL database. I'm looking to create a table that can be used to check if a user on my website has liked a post. The idea is to have a table with one axes iterating the posts on the website and one axis with the userID values iterated. Then in each box hold a binary value as to whether they have liked it. I'm just wondering how I would implement this. I have been doing this in C# by creating classes and converting these into server side code using Entity Framework 6.4.0.
Any help would be great.
What you are suggesting is a normalized structure for your use case; it would, for example, require adding more columns to the table everytime a post is added to the database (or a user, depending on whether you use rows or columns).
A typical database solution would be a bridge table, that represents the many to many relationship between posts and users.
Say table user_like_posts, with the following columns:
user_id -- foreign key to the "users" table
post_id -- foreign key to the "posts" table
You may want to add additional columns to the bridge table, like the timestamp when the user liked the post, or the-like.
Will every user have an opinion on every post? If not then you don't have the data you described. If users and posts are not related one to one then you have a simple relation. For each post that a user likes (or dislikes?) there is an entry for that user:
Likes/Dislikes Table:
User identifier
Post identifier
The binary value that indicates like or dislike
If the table only indicates 'likes' then you don't need the last column.
A design like this would work even if every user and every post is in this table. The table might get large in a hurry and keep growing every time you introduced a new post. But if this table only includes actual 'likes' (and/or 'dislikes') it should be manageable.
For a class you just have an enumerable that has the posts 'liked' (and possibly another that indicates the posts 'disliked.')
Think about what you are trying to represent. Ask yourself questions. Don't just latch on to an idea and try to 'do' it.
Will every user have an opinion of every post?
Do you need to store both 'likes' and 'dislikes?'
Can there be a 'neutral' opinion on a post?
Can users change their opinions?
You can only discover the correct data structure by asking and answering all the questions that matter to your situation (my list is not exhaustive - it is only an example.)
I'm developing a tax calculation system that applies various taxes based on a set of supplied criteria.
The information frequently changes, so I'm trying to create a way to store all these logic rules in the database.
As you can imagine, there is a lot of compound logic involved in applying taxes.
For example, a tax might only apply if A is true, B is less than 100, and C equals 7.
My current design is terrible.
I have a few database columns for very common criteria filtering, such as location and tax year.
For more complex logic, I have a column that holds JavaScript, and in code, I run an interpreter to filter the results. Performance and maintainability suck.
I'd like to improve this design by making the logic entirely data-driven, but I'm having trouble figuring out how to correctly represent this logic within a relational database. What is a good way to model this logic in the database?
I have worked on this similar issue for over a year now for a manufacturing cost generation application. Similarly, it takes in loads of product design data input and base on the design, and other inventory considerations such as quantity, bulk purchase options, part supplier, electrical ratings etc. The result is a list of direct materials, labour and costs.
I knew from the onset that what I need is some kind of query language instead of a computational one, and it has to be scripted, not compiled. But I have yet to find a perfect solution:
METHOD 1 - SQL
I created tables that represents my objects and columns that represents properties and then manually typed in the all the SQL SELECT statments required in an item_rules table. What I did was to first save the object into the database, then then I did
rules = SELECT * FROM item_rules
foreach(rules as _rule)
{
count = SELECT COUNT(*) FROM (_rule[select_statement]) as T1
if(count > 1) itemlist.add(_rule[item_that_satisfy_rule])
}
What it does is it takes each rule in the item_rules table and run it against my object that is now in the tables. e.g. SELECT * FROM my_object WHERE A=5 AND B>10. If I successfully pick it up, I get a positive count and then I know I should include the corresponding rule item to my items list.
METHOD 2 - NCALC
Instead of storing the queries in SQL format, I found the NCALC opensource expression parsing library. NCALC takes a string expression and option variable and computes a result. The string expressions can be stored in plain text on the filesystem.
METHOD 3 - EXCEL
EXCEL is actually a very good piece of software for doing data lookups. You can create the formulas in excel and then feed data from your application into excel and then let excel run the formulas to give you the results. Advantage is that many people knows how to use excel, so different people can maintain it.
But like I say, none of these are perfect for me. I am just sharing and hopefully we can get better recommedations.
If you are to go with Jake's approach, You can use Dynamic Sql too.
I am working on a large project that contains many reference / look up type tables. This is maybe not the correct place to ask this question but I would like to find out the name that people give in English to these kind of tables:
The tables contain data such as status_code, status_type. They are preloaded and the data in them will probably never change.
Please excuse my not knowing this but ENglish is not my first language and I need to do a presentatation to talk about these kind of tables.
People give different names - LookUpData, ReferenceData, StaticData, etc depending upon the properties of data. Is this answer your questions? If not, probably you need to be more elaborative.
The most often used english name I've encountered is 'static data'. This indicates that the data does not change very often.
Aside: one interesting aspect of static data is that it is a good candidate for caching.
Where the tables contain data that's interesting to the application, but do reflect not the outside world - status codes, for instance - I tend to call them enumeration tables.
Where the table contains values that change rarely, but do reflect the outside world - cities, currencies, etc. - I tend to call them lookup tables.
Where tables contain "friendly" values - e.g translating ISO country/locale codes - I tend to call them reference tables.
We tend to use the term "Reference Data". But there isn't a standard term, it varies a lot in my experience. I haven't heard "Static Data" before, but I like it.
Let me first describe the situation. We host many Alumni events over the course of each year and provide online registration forms for each event. There is a large chunk of data that is common for each event:
An Event with dates, times, managers, internal billing info, etc.
A Registration record with info about the payment and total amount charged per form submission
Bio/Demographic and alumni data about the 1 or more attendees (name, address, degree, etc.)
We store all of the above data within columns in tables as you would expect.
The trouble comes with the 'extra' fields we are asked to put on the forms. Maybe it is a dinner and there is a Veggie or Carnivore option, perhaps there is lodging and there are bed or smoking options, or perhaps there is an optional transportation option. There are tons of weird little "can you add this to the form?" types of requests we receive.
Currently, we JSONify any non-standard data and store it all in one column (per attendee) called 'extras'. We can read this data out in code but it is not well suited to querying. Our internal staff would like to generate a quick report on Veggie dinners needed for instance.
Other than creating a separate table for each form that holds the specific 'extra' data items, are there any other approaches that could make my life (and reporting) easier? Anyone working in a simialr environment?
This is actually one of the toughest problem to solve efficiently. The SQL Server Customer Advisory Team has dedicated a white-paper to the topic which I highly recommend you read: Best Practices for Semantic Data Modeling for Performance and Scalability.
You basically have 3 options:
semantic database (entity-attribute-value)
XML column
sparse columns
Each solution comes with ups and downs. Out of the top of my hat I'd say XML is probably the one that gives you the best balance of power and flexibility, but the optimal solution really depends on lots of factors like data set sizes, frequency at which new attributes are created, the actual process (human operators) that create-populate-use these attributes etc, and not at least your team skill set (some might fare better with an EAV solution, some might fare better with an XML solution). If the attributes are created/managed under a central authority and adding new attributes is a reasonable rare event, then the sparse columns may be a better answer.
Well you could also have the following db structure:
Have a table to store custom attributes
AttributeID
AttributeName
Have a mapping table between events and attributes with:
AttributeID
EventID
AttributeValue
This means you will be able to store custom information per event. And you will be able to reuse your attributes. You can include some metadata as
AttributeType
AllowBlankValue
to the attribute to handle it easily afterwards
Have you considered using XML instead of JSON? Difference: XML is supported (special data type) and has query integration ;)
quick and dirty, but actually nice for querying: simply add new columns. it's not like the empty entries in the previous table should cost a lot.
more databasy solution: you'll have something like an event ID in your table. You can link this to an n:m table connecting events to additional fields. And then store the additional field data in a table with additional_field_id, record_id (from the original table) and the actual value. Probably creates ugly queries, but seems politically correct in terms of database design.
I understand "NoSQL" (not only sql ;) databases like couchdb let you store arbitrary fields per record, but since you're already with SQL Server, I guess that's not an option.
This is the solution that we first proposed in ASP.NET Forums (that later became Community Server), and that the ASP.NET team built a similar version of in the ASP.NET 2.0 Membership when they released it:
Property Bags on your domain objects
For example:
Event.Profile() or in your case, Event.Extras().
Basically, a property bag is a serialized collection of data stored in a name/value pair in a column (or columns). The ASP.NET 2.0 Membership went the route of storing names in a semi-colon delimited list, and values in the same:
Table: aspnet_Profile
Column: PropertyNames (separated by semi-colons, and has start index and end index)
Column: PropertyValues (separated by semi-colons, and only stores the string value)
The downside to that approach is it is all strings, and manually has to be parsed (even though the membership system does it for you automatically).
Recently, my current method is I've built FormCollection and NameValueCollection C# extension methods that automatically serialize the collections to an XML result. And I store that XML in the table in it's own column associated with that entity. I also have a deserializer C# extension on XElement that deserializes that data back to the collection at runtime.
This gives you the power of actually querying those properties in XML, via SQL (though, that can be slow though - always flatten out your read-only data).
The final note is runtime querying: The general rule we follow is, if you are going to query a property of an entity in normal application logic, then you move that property to an actual column on the table - and create the appropriate indexes. If that data will never be queried directly (for example, Linq-to-Sql or EF), then leave it in the XML Property Bag.
Property Bags gives you the power of extending your domain models however you like, without having to modify the db schema.
I have a project that requires user-defined attributes for a particular object at runtime (Lets say a person object in this example). The project will have many different users (1000 +), each defining their own unique attributes for their own sets of 'Person' objects.
(Eg - user #1 will have a set of defined attributes, which will apply to all person objects 'owned' by this user. Mutliply this by 1000 users, and that's the bottom line minimum number of users the app will work with.) These attributes will be used to query the people object and return results.
I think these are the possible approaches I can use. I will be using C# (and any version of .NET 3.5 or 4), and have a free reign re: what to use for a datastore. (I have mysql and mssql available, although have the freedom to use any software, as long as it will fit the bill)
Have I missed anything, or made any incorrect assumptions in my assessment?
Out of these choices - what solution would you go for?
Hybrid EAV object model. (Define the database using normal relational model, and have a 'property bag' table for the Person table).
Downsides: many joins per / query. Poor performance. Can hit a limit of the number of joins / tables used in a query.
I've knocked up a quick sample, that has a Subsonic 2.x 'esqe interface:
Select().From().Where ... etc
Which generates the correct joins, then filters + pivots the returned data in c#, to return a datatable configured with the correctly typed data-set.
I have yet to load test this solution. It's based on the EA advice in this Microsoft whitepaper:
SQL Server 2008 RTM Documents Best Practices for Semantic Data Modeling for Performance and Scalability
Allow the user to dynamically create / alter the object's table at run-time. This solution is what I believe NHibernate does in the background when using dynamic properties, as discussed where
http://bartreyserhove.blogspot.com/2008/02/dynamic-domain-mode-using-nhibernate.html
Downsides:
As the system grows, the number of columns defined will get very large, and may hit the max number of columns. If there are 1000 users, each with 10 distinct attributes for their 'Person' objects, then we'd need a table holding 10k columns. Not scalable in this scenario.
I guess I could allow a person attribute table per user, but if there are 1000 users to start, that's 1000 tables plus the other 10 odd in the app.
I'm unsure if this would be scalable - but it doesn't seem so. Someone please correct me if I an incorrect!
Use a NoSQL datastore, such as CouchDb / MongoDb
From what I have read, these aren't yet proven in large scale apps, based on strings, and are very early in development phase. IF I am incorrect in this assessment, can someone let me know?
http://www.eflorenzano.com/blog/post/why-couchdb-sucks/
Using XML column in the people table to store attributes
Drawbacks - no indexing on querying, so every column would need to be retrieved and queried to return a resultset, resulting in poor query performance.
Serializing an object graph to the database.
Drawbacks - no indexing on querying, so every column would need to be retrieved and queried to return a resultset, resulting in poor query performance.
C# bindings for berkelyDB
From what I read here: http://www.dinosaurtech.com/2009/berkeley-db-c-bindings/
Berkeley Db has definitely proven to be useful, but as Robert pointed out – there is no easy interface. Your entire wOO wrapper has to be hand coded, and all of your indices are hand maintained. It is much more difficult than SQL / linq-to-sql, but that’s the price you pay for ridiculous speed.
Seems a large overhead - however if anyone can provide a link to a tutorial on how to maintain the indices in C# - it could be a goer.
SQL / RDF hybrid.
Odd I didn't think of this before. Similar to option 1, but instead of an "property bag" table, just XREF to a RDF store?
Querying would them involve 2 steps - query the RDF store for people hitting the correct attributes, to return the person object(s), and use the ID's for these person object in the SQL query to return the relational data. Extra overhead, but could be a goer.
The ESENT database engine on Windows is used heavily for this kind of semi-structured data. One example is Microsoft Exchange which, like your application, has thousands of users where each user can define their own set of properties (MAPI named properties). Exchange uses a slightly modified version of ESENT.
ESENT has a lot of features that enable applications with large meta-data requirements: each ESENT table can have about ~32K columns defined; tables, indexes and columns can be added at runtime; sparse columns don't take up any record space when not set; and template tables can reduce the space used by the meta-data itself. It is common for large applications to have thousands of tables/indexes.
In this case you can have one table per user and create the per-user columns in the table, creating indexes on any columns that you want to query. That would be similar to the way that some versions of Exchange store their data. The downside of this approach is that ESENT doesn't have a query engine so you will have to hand-craft your queries as MakeKey/Seek/MoveNext calls.
A managed wrapper for ESENT is here:
http://managedesent.codeplex.com/
In a EAV model you don't have to have many joins, as you can just have the joins you need for the query filtering. For the resultset, return property entries as a separate rowset.
That is what we are doing in our EAV implementation.
For example, a query might return persons with extended property 'Age' > 18:
Properties table:
1 Age
2 NickName
First resultset:
PersonID Name
1 John
2 Mary
second resultset:
PersonID PropertyID Value
1 1 24
1 2 'Neo'
2 1 32
2 2 'Pocahontas'
For the first resultset, you need an inner join for the 'age' extended property
to query the basic Person object entity part:
select p.ID, p.Name from Persons p
join PersonExtendedProperties pp
on p.ID = pp.PersonID
where pp.PropertyName = 'Age'
and pp.PropertyValue > 18 -- probably need to convert to integer here
For the second resultset, we are making an outer join of the first resultset with PersonExtendedProperties table to get the rest of the extended properties. It's a 'narrow' resultset, we do not pivot the properties in sql, so we don't need multiple joins here.
Actually we use separate tables for different types to avoid data type conversion, to have extended properties indexed and easily queriable.
My recommendation:
Allow properties to be marked as indexable. Have a smallish hard limit on number of indexable properties, and on columns per object. Have a large hard limit on total column types in all objects.
Implement indexes as separate tables (one per index) joined with main table of data (main table has large unique key for object). (Index tables can then be created/dropped as required).
Serialize the data, including the index columns, plus put the index propertoes in first class relational columns in their dedicated index tables. Use JSON instead of XML to save space in the table. Enforce short column name policy (or long display name and short stored name policy) to save space and increase performance.
Use quarks for field identifiers (but only in the main engine to save RAM and speed some read operations -- don't rely on quark pointer comparison in all cases).
My thought on your options:
1 is a possible. Performance clearly will be lower than if field ID columns not stored.
2 is a no in general DB engines not all happy about dynamic schema changes. But a possible yes if your DB engine is good at this.
3 Possible.
4 Yes though I'd use JSON.
5 Seems like 4 only less optimized??
6 Sounds good; would go with if happy to try something new and also if happy about reliability and performance but usually would want to go with more mainstream technology. I'd also like to reduce the number of engines involved in coordinating a transaction to less then would be true here.
Edit: But of course though I've recommened something there can be no general right answer here -- profile various data models and approaches with your data to see what runs best for your application.
Edit: Changed last edit wording.
Assuming you an place a limit, N, on how many custom attributes each user can define; just add N extra columns to the Person table. Then have a separate table where you store per-user metadata to describe how to interpret the contents of those columns for each user. Similar to #1 once you've read in the data, but no joins needed to pull in the custom attributes.
For a problem similar to your problem, we have used the "XML Column" approach (the fourth one in your survey of methods). But you should note that many databases (DBMS) support index for xml values.
I recommend you to use one table for Person which contains one xml column along with other common columns. In other words, design the Person table with columns that are common for all person records and add a single xml column for dynamic and differing attributes.
We are using Oracle. it supports index for its xml-type. Two types of indices are supported: 1- XMLIndex for indexing elements and attributes within an xml, 2- Oracle Text Index for enabling full-text search in text fields of the xml.
For example, in Oracle you can create an index such as:
CREATE INDEX index1 ON table_name (XMLCast(XMLQuery ('$p/PurchaseOrder/Reference'
PASSING XML_Column AS "p" RETURNING CONTENT) AS VARCHAR2(128)));
and xml-query is supported in select queries:
SELECT count(*) FROM purchaseorder
WHERE XMLCast(XMLQuery('$p/PurchaseOrder/Reference'
PASSING OBJECT_VALUE AS "p" RETURNING CONTENT)
AS INTEGER) = 25;
As I know, other databases such as PostgreSQL and MS SQL Server (but not mysql) support such index models for xml value.
see also:
http://docs.oracle.com/cd/E11882_01/appdev.112/e23094/xdb_indexing.htm#CHDEADIH