Related
I'm currently trying to implement a table within my SQL database. I'm looking to create a table that can be used to check if a user on my website has liked a post. The idea is to have a table with one axes iterating the posts on the website and one axis with the userID values iterated. Then in each box hold a binary value as to whether they have liked it. I'm just wondering how I would implement this. I have been doing this in C# by creating classes and converting these into server side code using Entity Framework 6.4.0.
Any help would be great.
What you are suggesting is a normalized structure for your use case; it would, for example, require adding more columns to the table everytime a post is added to the database (or a user, depending on whether you use rows or columns).
A typical database solution would be a bridge table, that represents the many to many relationship between posts and users.
Say table user_like_posts, with the following columns:
user_id -- foreign key to the "users" table
post_id -- foreign key to the "posts" table
You may want to add additional columns to the bridge table, like the timestamp when the user liked the post, or the-like.
Will every user have an opinion on every post? If not then you don't have the data you described. If users and posts are not related one to one then you have a simple relation. For each post that a user likes (or dislikes?) there is an entry for that user:
Likes/Dislikes Table:
User identifier
Post identifier
The binary value that indicates like or dislike
If the table only indicates 'likes' then you don't need the last column.
A design like this would work even if every user and every post is in this table. The table might get large in a hurry and keep growing every time you introduced a new post. But if this table only includes actual 'likes' (and/or 'dislikes') it should be manageable.
For a class you just have an enumerable that has the posts 'liked' (and possibly another that indicates the posts 'disliked.')
Think about what you are trying to represent. Ask yourself questions. Don't just latch on to an idea and try to 'do' it.
Will every user have an opinion of every post?
Do you need to store both 'likes' and 'dislikes?'
Can there be a 'neutral' opinion on a post?
Can users change their opinions?
You can only discover the correct data structure by asking and answering all the questions that matter to your situation (my list is not exhaustive - it is only an example.)
looking for examples/tutorial for custom user fields, not via EAV
EAV is going to be problematic for various reasons such as performance
there are many base entities/tables with over 100000 records each
there will likely be over a dozen attributes
the records are to be displayed in a flat ui grid incl. custom fields so flattening them would be an issue while maintaining performance
Looking at enabling this via DDL where all custom fields would go into a matching table such as
<tablename>_custom_<userid>
and all user attributes would map to a column each and all their metadata stored in a metadata table
the retrieval would be simpler where the query would simply be
select *
from <tablename> A, tableName_custom_userid B
where B.KeyField = A.KeyField --( perhaps using outer join, haven't gone that far yet )
Wondering if there are any gotchas down the road that i need to be aware of ?
of course any samples/pointers would be helpful to kickstart the effort
specifically would appreciate any advice on using DDL for Sql Server compact 4
One technique I have seen used is to use a sort of 'hard-coded' EAV pattern. Don't hang up! It worked well with the dataset sizes you were talking about and didn't actually use EAV - it was only EAV-esque.
The idea is to have a set of tables to store these custom attributes within it, with some triggers (described below) on them. The custom attributes tablesets store metadata about the attribute (what table it goes with, data type, constraints, etc). You can get very fancy with this but I did not haev the need.
The triggers on your meta-tables are there to re-generate views that rollup base+extension into first class objects within the DB. So instead of table person + employee extension table, you have an employee view that includes both. When you drop a new value into the custom attributes tables, the triggers will re-roll the views and include the new stuff. If you wanted to go nuts, you could also have the triggers re-write stored procedures as well. Depending on how your mid-tier code is structured, you would still be forced to re-code some, however this would be the case anyway should you be applying rules that read the data.
In testing, I found that for the relatively small # of records you're talking about, performance was somewhat slower but followed roughly the same pattern of degradation (2x the number of records, ~2x as slow).
-- edits --
How I saw it done, you had a table that represented your first class objects, so a row for 'person' and a row for 'employee,' etc. We'll call that FCO. Then you had a secondary table that stored what tables represented the FCO. We'll call that Srcs.. For person, there would be one row, which is the person table. For Employee, there would be two rows, the person table and the Employee extension. There is a third table, called Attribs, which stores the columns from the tables that constitute the FCO. For simplicity, we'll say Employee has ID, Name and Address, and Employee has Hire Date and Department, and obviously PersonID referring back to Person table. So, 2 rows in FCO table (person and employee), 3 rows in Src table, 8 rows in Attribs.
The view, we'll call it vw_Employee, selects PersonID, Name, Address, Hire Date, Department from the two tables. It is built by a SQL stored procedure we'll call OnMetadataChange.
This SP is fired (by trigger or batch process), and its purpose is to generate the CREATE VIEW statements. It will iterate through every First Class Object, collect which fields from which tables constitute the view, and will issue a CREATE statement based on that. So OnMetadataChange produces a DROP and CREATE for each view, it generates a dynamic SQL statement that is executed once per entry in FCO table. It is preferable to do this with Triggers but not necessary. Hopefully your FCO definitions won't change too often, and when they do, there will probably be a code release as well. You can run your OnMetadataChange SP at that time.
The end result is a 2-layer database. The views constitute the First Class Object layer, which is meaningful to the application. The application only uses views. The tables constitute the 'physical' layer, which the application shouldn't care about. The meta-tables are essentially your mapping between the FCO layer and the physical layer. It takes some time to set it up, but it's quite effective, and gives you many of the benefits of EAV, while at the same time giving you the concrete benefits of 3nf tables (indexability, etc).
If you'd like I can throw some sample SQL out there.
Part of the problem you are having is that you are trying to store schema-less data in a SQL database, which is not its strength. There are three approaches that would make your life far easier:
1) Have a column which stores the serialized custom fields, with whatever format is mst convenient. For example, this column could store xml. Upsides are that you can use SQL Server Compact and pulling back a record is trivial. Downsides are that you always have to pull/push the entire xml blob to do an update, and it is difficult to impossible to query on any custom fields.
2) Upgrade to SQL Server Express, and use XML columns. This is nearly the same as the first suggestion, except that any server ready version of SQL Server has native support for XML data. These columns can have indexes added and fields within the data can be used in queries.
3) Use a Schema-less Database, like MongoDB or CouchDB. These databases are all about storing schemaless data, so your custom fields will be no different than any other field. As such, you can index and query custom fields. Upsides are that custom data is incredibly easy to work with, downsides are that you would have to spend some time rethinking how you store data to fit within their model.
If you do not need to query based on custom fields, or if you can query custom fields within business logic, then the first option can work for you. In any other case, I would err towards something with more capabilities than compact. If cost is the deciding factor, both SQL Server Express and MongoDB are free.
Let me first describe the situation. We host many Alumni events over the course of each year and provide online registration forms for each event. There is a large chunk of data that is common for each event:
An Event with dates, times, managers, internal billing info, etc.
A Registration record with info about the payment and total amount charged per form submission
Bio/Demographic and alumni data about the 1 or more attendees (name, address, degree, etc.)
We store all of the above data within columns in tables as you would expect.
The trouble comes with the 'extra' fields we are asked to put on the forms. Maybe it is a dinner and there is a Veggie or Carnivore option, perhaps there is lodging and there are bed or smoking options, or perhaps there is an optional transportation option. There are tons of weird little "can you add this to the form?" types of requests we receive.
Currently, we JSONify any non-standard data and store it all in one column (per attendee) called 'extras'. We can read this data out in code but it is not well suited to querying. Our internal staff would like to generate a quick report on Veggie dinners needed for instance.
Other than creating a separate table for each form that holds the specific 'extra' data items, are there any other approaches that could make my life (and reporting) easier? Anyone working in a simialr environment?
This is actually one of the toughest problem to solve efficiently. The SQL Server Customer Advisory Team has dedicated a white-paper to the topic which I highly recommend you read: Best Practices for Semantic Data Modeling for Performance and Scalability.
You basically have 3 options:
semantic database (entity-attribute-value)
XML column
sparse columns
Each solution comes with ups and downs. Out of the top of my hat I'd say XML is probably the one that gives you the best balance of power and flexibility, but the optimal solution really depends on lots of factors like data set sizes, frequency at which new attributes are created, the actual process (human operators) that create-populate-use these attributes etc, and not at least your team skill set (some might fare better with an EAV solution, some might fare better with an XML solution). If the attributes are created/managed under a central authority and adding new attributes is a reasonable rare event, then the sparse columns may be a better answer.
Well you could also have the following db structure:
Have a table to store custom attributes
AttributeID
AttributeName
Have a mapping table between events and attributes with:
AttributeID
EventID
AttributeValue
This means you will be able to store custom information per event. And you will be able to reuse your attributes. You can include some metadata as
AttributeType
AllowBlankValue
to the attribute to handle it easily afterwards
Have you considered using XML instead of JSON? Difference: XML is supported (special data type) and has query integration ;)
quick and dirty, but actually nice for querying: simply add new columns. it's not like the empty entries in the previous table should cost a lot.
more databasy solution: you'll have something like an event ID in your table. You can link this to an n:m table connecting events to additional fields. And then store the additional field data in a table with additional_field_id, record_id (from the original table) and the actual value. Probably creates ugly queries, but seems politically correct in terms of database design.
I understand "NoSQL" (not only sql ;) databases like couchdb let you store arbitrary fields per record, but since you're already with SQL Server, I guess that's not an option.
This is the solution that we first proposed in ASP.NET Forums (that later became Community Server), and that the ASP.NET team built a similar version of in the ASP.NET 2.0 Membership when they released it:
Property Bags on your domain objects
For example:
Event.Profile() or in your case, Event.Extras().
Basically, a property bag is a serialized collection of data stored in a name/value pair in a column (or columns). The ASP.NET 2.0 Membership went the route of storing names in a semi-colon delimited list, and values in the same:
Table: aspnet_Profile
Column: PropertyNames (separated by semi-colons, and has start index and end index)
Column: PropertyValues (separated by semi-colons, and only stores the string value)
The downside to that approach is it is all strings, and manually has to be parsed (even though the membership system does it for you automatically).
Recently, my current method is I've built FormCollection and NameValueCollection C# extension methods that automatically serialize the collections to an XML result. And I store that XML in the table in it's own column associated with that entity. I also have a deserializer C# extension on XElement that deserializes that data back to the collection at runtime.
This gives you the power of actually querying those properties in XML, via SQL (though, that can be slow though - always flatten out your read-only data).
The final note is runtime querying: The general rule we follow is, if you are going to query a property of an entity in normal application logic, then you move that property to an actual column on the table - and create the appropriate indexes. If that data will never be queried directly (for example, Linq-to-Sql or EF), then leave it in the XML Property Bag.
Property Bags gives you the power of extending your domain models however you like, without having to modify the db schema.
I have a SQL lookup table like this:
CREATE TABLE Product(Id INT IDENTITY PRIMARY KEY, Name VARCHAR(255))
I've databound a ASP.NET DropDownList to a LLBLGen entity. User selects a product, and the Id get saved. Now I need to display some product specific details later on. Should I use the Product's ID, and hope the ID is always the same between installations ?
switch (selectedProduct.Id)
{
case 1: //product one
break;
case 2:
case 3: //product two or three
break;
}
or use the name, and hope that never changes?
switch (selectedProduct.Name)
{
case "product one":
break;
}
Or is there a better alternative?
If you know of all the items in this table (which I guess you do if you can do a switch on them) and want them the same for each installation then maybe it should not be an identity column and you should insert 1, 2, 3 with the products themselves.
For this situation, there are three common solutions I have seen:
Hard code the ID - this is quick and dirty, not self-documenting (you don't know what product is being referred to), and prone to breakage as you pointed out. I never use this method anymore.
Enums - I use this when the table is small and static. So, ProductType would be a possible candidate for this. This is self-documenting code, but still creates an awkward connection between code and data where if records are inserted with different IDs than you planned for, then things break. You can mitigate this by automating the Enum generation in various ways, but it still feels wrong. E.g., if your unit tests are inserting records into the Product table, it will be difficult for them to recreate the Enum at that point. Also, if you have 100,000 records, the Enum approach starts to look pretty dumb.
Add an additional column, that is a non-changing identifier. I often use AlphaCode as my column name. So in your case it would look like:
switch (selectedProduct.AlphaCode)
{
case "PRODUCT_ONE":
break;
}
This lets you use an AlphaCode that is self-documenting, allows you to reinsert data without caring about the autoincrement PK value, and lets you change the product name without affecting anything. If you use the AlphaCode approach, ensure that you put a unique index on this column.
The other solution, which is often the most preferable one, is to move this logic to the database. E.g., if product 1 is the product you always want to show by default when its category is selected, you could add a column to your table called IsHeroProduct. Then your query becomes:
if (selectedProduct.IsHeroProduct)
{
//do stuff
}
If you want your ProductID's to be fixed (which doesn't seem to be a good idea), then you can use IDENTITY INSERT (in SQL Server, at least) to ensure ProductID values are the same between installations. But, I would normally only do this for static reference data.
You can also use Visual Studio's T4 templates to generate enums directly off the database data
Some ORMs (LLBLGen at least) can handle this for you; but generating a strong type of enums. I've never used that though.
In these cases, I always just go with an enum that I write myself, but I make sure that all the fields are equal, and update if any change. It becomes more interesting when you work across databases (as I do), but if you take care, it is simple enough.
In one sentence, what i ultimately need to know is how to share objects between mid-tier functions w/ out requiring the application tier to to pass the data model objects.
I'm working on building a mid-tier layer in our current environment for the company I am working for. Currently we are using primarily .NET for programming and have built custom data models around all of our various database systems (ranging from Oracle, OpenLDAP, MSSQL, and others).
I'm running into issues trying to pull our model from the application tier and move it into a series of mid-tier libraries. The main issue I'm running into is that the application tier has the ability to hang on to a cached object throughout the duration of a process and make updates based on the cached data, but the Mid-Tier operations do not.
I'm trying to keep the model objects out of the application as much as possible so that when we make a change to the underlying database structure, we can edit and redeploy the mid-tier easily and multiple applications will not need to be rebuilt. I'll give a brief update of what the issue is in pseudo-code, since that is what us developers understand best :)
main
{
MidTierServices.UpdateCustomerName("testaccount", "John", "Smith");
// since the data takes up to 4 seconds to be replicated from
// write server to read server, the function below is going to
// grab old data that does not contain the first name and last
// name update.... John Smith will be overwritten w/ previous
// data
MidTierServices.UpdateCustomerPassword("testaccount", "jfjfjkeijfej");
}
MidTierServices
{
void UpdateCustomerName(string username, string first, string last)
{
Customer custObj = DataRepository.GetCustomer(username);
/*******************
validation checks and business logic go here...
*******************/
custObj.FirstName = first;
custObj.LastName = last;
DataRepository.Update(custObj);
}
void UpdateCustomerPassword(string username, string password)
{
// does not contain first and last updates
Customer custObj = DataRepository.GetCustomer(username);
/*******************
validation checks and business logic go here...
*******************/
custObj.Password = password;
// overwrites changes made by other functions since data is stale
DataRepository.Update(custObj);
}
}
On a side note, options I've considered are building a home grown caching layer, which takes a lot of time and is a very difficult concept to sell to management. Use a different modeling layer that has built in caching support such as nHibernate: This would also be hard to sell to management, because this option would also take a very long time tear apart our entire custom model and replace it w/ a third party solution. Additionally, not a lot of vendors support our large array of databases. For example, .NET has LINQ to ActiveDirectory, but not a LINQ to OpenLDAP.
Anyway, sorry for the novel, but it's a more of an enterprise architecture type question, and not a simple code question such as 'How do I get the current date and time in .NET?'
Edit
Sorry, I forgot to add some very important information in my original post. I feel very bad because Cheeso went through a lot of trouble to write a very in depth response which would have fixed my issue were there not more to the problem (which I stupidly did not include).
The main reason I'm facing the current issue is in concern to data replication. The first function makes a write to one server and then the next function makes a read from another server which has not received the replicated data yet. So essentially, my code is faster than the data replication process.
I could resolve this by always reading and writing to the same LDAP server, but my admins would probably murder me for that. The specifically set up a server that is only used for writing and then 4 other servers, behind a load balancer, that are only used for reading. I'm in no way an LDAP administrator, so I'm not aware if that is standard procedure.
You are describing a very common problem.
The normal approach to address it is through the use of Optimistic Concurrency Control.
If that sounds like gobbledegook, it's not. It's pretty simple idea. The concurrency part of the term refers to the fact that there are updates happening to the data-of-record, and those updates are happening concurrently. Possibly many writers. (your situation is a degenerate case where a single writer is the source of the problem, but it's the same basic idea). The optimistic part I'll get to in a minute.
The Problem
It's possible when there are multiple writers that the read+write portion of two updates become interleaved. Suppose you have A and B, both of whom read and then update the same row in a database. A reads the database, then B reads the database, then B updates it, then A updates it. If you have a naive approach, then the "last write" will win, and B's writes may be destroyed.
Enter optimistic concurrency. The basic idea is to presume that the update will work, but check. Sort of like the trust but verify approach to arms control from a few years back. The way to do this is to include a field in the database table, which must be also included in the domain object, that provides a way to distinguish one "version" of the db row or domain object from another. The simplest is to use a timestamp field, named lastUpdate, which holds the time of last update. There are other more complex ways to do the consistency check, but timestamp field is good for illustration purposes.
Then, when the writer or updater wants to update the DB, it can only update the row for which the key matches (whatever your key is) and also when the lastUpdate matches. This is the verify part.
Since developers understand code, I'll provide some pseudo-SQL. Suppose you have a blog database, with an index, a headline, and some text for each blog entry. You might retrieve the data for a set of rows (or objects) like this:
SELECT ix, Created, LastUpdated, Headline, Dept FROM blogposts
WHERE CONVERT(Char(10),Created,102) = #targdate
This sort of query might retrieve all the blog posts in the database for a given day, or month, or whatever.
With simple optimistic concurrency, you would update a single row using SQL like this:
UPDATE blogposts Set Headline = #NewHeadline, LastUpdated = #NewLastUpdated
WHERE ix=#ix AND LastUpdated = #PriorLastUpdated
The update can only happen if the index matches (and we presume that's the primary key), and the LastUpdated field is the same as what it was when the data was read. Also note that you must insure to update the LastUpdated field for every update to the row.
A more rigorous update might insist that none of the columns had been updated. In this case there's no timestamp at all. Something like this:
UPDATE Table1 Set Col1 = #NewCol1Value,
Set Col2 = #NewCol2Value,
Set Col3 = #NewCol3Value
WHERE Col1 = #OldCol1Value AND
Col2 = #OldCol2Value AND
Col3 = #OldCol3Value
Why is it called "optimistic"?
OCC is used as an alternative to holding database locks, which is a heavy-handed approach to keeping data consistent. A DB lock might prevent anyone from reading or updating the db row, while it is held. This obviously has huge performance implications. So OCC relaxes that, and acts "optimistically", by presuming that when it comes time to update, the data in the table will not have been updated in the meantime. But of course it's not blind optimism - you have to check right before update.
Using Optimistic Cancurrency in practice
You said you use .NET. I don't know if you use DataSets for your data access, strongly typed or otherwise. But .NET DataSets, or specifically DataAdapters, include built-in support for OCC. You can specify and hand-code the UpdateCommand for any DataAdapter, and that is where you can insert the consistency checks. This is also possible within the Visual Studio design experience.
(source: asp.net)
If you get a violation, the update will return a result showing that ZERO rows were updated. You can check this in the DataAdapter.RowUpdated event. (Be aware that in the ADO.NET model, there's a different DataAdapter for each sort of database. The link there is for SqlDataAdapter, which works with SQL Server, but you'll need a different DA for different data sources.)
In the RowUpdated event, you can check for the number of rows that have been affected, and then take some action if the count is zero.
Summary
Verify the contents of the database have not been changed, before writing updates. This is called optimistic concurrency control.
Other links:
MSDN on Optimistic Concurrency Control in ADO.NET
Tutorial on using SQL Timestamps for OCC