Creating a Timeline and SQL Storage

Creating a Timeline and SQL Storage - c#

Language: C#
Compiler: Visual Studio 2012
O/S: Windows 7 Home Premium
Here is a question thats been on many questions, and through a few debates.
I know there are currently provisional .net controls for a functional timeline, as well as hints and tips on how a process would be done, but I have not found (so far) a complete tutorial on a well-maintained SQL-Storage Timeline system.
I need to document almost every change that my site will have. From the addition to user reputation, to the joining / creating and eventual submissions of members, clans games etc.
As far as I know, DateTime in a SQL database should be avoided, especially in large quantities.
What would be the implementation, process, and eventual output of a Timeline?

What you're describing is sometimes known as "Audit history" - and it's often implemented using a single, denormalized, table, however many DB purists will argue against it as you lose strong typing.
The table looks like this:
AuditTable( EventId bigint, DateTime datetime, Subject nvarchar, Table varchar, Column varchar, TablePK bigint, OldValueInt bigint nullable, OldValueStr nvarchar nullable )
-- add more nullable columns for more types, if necessary
Each time a value is changed, such as a user's reputation being increased, you would add a row to this table, such as this:
INSERT INTO AuditTable( Now(), N'User reputation increased', 'Users', 'Reputation', #userId, 100 )
You only need to store the old value (the value before the change) because the new (i.e. current) value will be in the actual table row.
Adding to the Audit table can be done entirely automatically with SQL Server table triggers.
To view a user's reputation history, you would do this:
SELECT * FROM AuditTable WHERE Table = 'Users' AND Column = 'Reputation' AND TablePK = #userId
Now as I said, this design is more for auditing rather than maintaining an easily user-accessible history, these are the disadvantages:
You cannot semantically index the table, so lookups and lists will always be slow
You're storing database metadata as strings, so there's a lot of overhead
There's no referential integrity (this can be a good thing in that the data will remain if you re-architecture the original tables, e.g. removing the Reputation field from the Users table)
If you want to be more "pure" then you really have to design a table structure that directly supports the history-tracking you want to build. You don't need to create a history table for every field - even Stackoverflow doesn't store a history of everything. For example:
UserReputationHistory ( UserId bigint, ReputationChange int, DateTime when, Subject nvarchar )
Of course it does complicate your code to have to maintain these disparate FooHistory tables.
The other things in in your original question that you comment, such as a member's join date doesn't need a history table, you can get that from a DateJoined field in the member's own DB row.

Related

Shall I use separate tables for each and every category or one table to store all attributes for a Classified Website?

I am developing a classified website by using asp.net and my DB is mysql. Please MSSQL users I need your support too. this is a problem with database schema not related to a specific database provider.
I just want a little bit clarification from you.
So in here since this is a classified website you can post Job Ads, Vehicle Ads, Real estate ads etc...
So I have a header table to store common details about the ad. like title, description and so on.
CREATE TABLE `ad_header` (
`ad_header_id_pk` int(10) unsigned NOT NULL AUTO_INCREMENT,
`district_id_fk` tinyint(5) unsigned NOT NULL,
`district_name` varchar(50) DEFAULT NULL,
`city_id_fk` tinyint(5) unsigned DEFAULT NULL,
`city_name` varchar(50) DEFAULT NULL,
`category_id_fk` smallint(3) unsigned NOT NULL,
`sub_category_id_fk` smallint(3) unsigned DEFAULT NULL,
`title` varchar(100) NOT NULL,
`description` text NOT NULL,
...............
PRIMARY KEY (`ad_header_id_pk`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
So if It is a Job ad I have another table for to store the attributes that relevant only to a JOb ad like salary, employment type, working hours
Also If it is a vehicle ad I have separate table to store fuel type, transmission type etc...
So I have 10 categories. These categories are not going to change in a decade. So now I have these 2 approaches
1) One header table and 10 specific tables for to store each
categories attributes
2) One header table and One attribute table
that will hold all attributes of each and every classified groups. for those that are not relevant will hold NULL values
What is the best way to do this regarding performance and scalability.
For those who build classified websites please give me a guide. Thanks in advance

The question is not entirely clear to me, but I can give some advice:
First of all, if you find yourself wanting to store delimited values in a single column/cell, you need to step back and create a new table to hold that info. NEVER store delimited data in a single column.
If I understand your question correctly, Ads have Categories like "Job", "For Sale", "Vehicle", "Real Estate", etc. Categories should then have Attributes, where attributes might be things unique to each category, like "Transmission Type" or "Mileage" for the Vehicle category, or "Square Feet" or "Year Constructed" for the Real Estate category.
There is more than one correct way to handle this situation.
If the master categories are somewhat fixed, it is a legitimate design choice to have a separate table for the attributes from each category, such that each ad listing would have one record from ad_header, and one record from the specific Attribute table for that category. So a vehicle listing would have an ad_header record and a vehicle_attributes record.
If the categories are more fluid, it is also a legitimate design choice in this case to have one CateogryAttributes table, that defines the Attributes used with each Category, along with an Ad_Listing_Attributes table that holds the attribute data for each listing, that would include a foreign key to both CategoryAttributes and Ad_header. Note the schema for this table effectively follows Entity/Attribute/Value pattern, which is widely considered to actually be more of an anti-pattern. That is, it's something to be avoided in most cases. But if you expect to be frequently adding new categories, it may be the best you can do here.
A final option is to put attributes from all categories in a single large table, and populate only what you need. So a vehichle listing would have only an ad_header record, but there would be a lot of NULL columns in the record. I'd avoid that in this case, because your ideal scenario would want to require some attributes for certain categories (ie: NOT NULLABLE columns) but leave others options.
This is another case where Postgresql may have been the better DB choice. Postgresql has something called table inheritance, that is specifically designed to address this situation to allow you to avoid an EAV table schema.
Full disclosure: I'm actually a Sql Server guy for most situations, but it does seem like Postgresql may be a better fit for you. My experience is MySql was good in the late 90's and early 00's, but has really lagged behind since. It continues to be popular today mainly because of that early momentum along with some advantage in cheap hosting availability, rather than any real technical merit.

Storing a Dictionary<int,string> or KeyValuePair in a database

I wanted to see what others have experienced when working with types like List<> or Dictionary<> and having in turn storing and retrieving that data?
Here's an example scenario: users will be creating their own "templates", where these templates is essentially a collection of Dictionary, e.g. for user1, values are (1, Account), (2, Bank), (3, Code), (4, Savings), and for user2, values (unrelated) could be (1, Name), (2, Grade), (3, Class), and so on. These templates/lists could be of varying length but they will always have an index and a value. Also, each list/ template will have one and only one User linked to it.
What types did you choose on the database side?
And pain-points and/or advice I should be aware of?

As far as the types within the collection go, there is a fairly 1-to-1 mapping between .Net types and SQL types: SQL Server Data Type Mappings. You mostly need to worry about string fields:
Will they always be ASCII values (0 - 255)? Then use VARCHAR. If they might contain non-ASCII / UCS-2 characters, then use NVARCHAR.
What is their likely max length?
Of course, sometimes you might want to use a slightly different numeric type in the database. The main reason would be if an int was chosen on the app side because it "easier" (or so I have been told) to deal with than Int16 and byte, but the values will never be above 32,767 or 255, then you should most likely use SMALLINT or TINYINT respectively. The difference between int and byte in terms of memory in the app layer might be minimal, but it does have an impact in terms of physical storage, especially as row counts increase. And if that is not clear, "impact" means slowing down queries and sometimes costing more money when you need to buy more SAN space. But, the reason I said to "most likely use SMALLINT or TINYINT" is because if you have Enterprise Edition and have Row Compression or Page Compression enabled, then the values will be stored in the smallest datatype that they will fit in.
As far as retrieving the data from the database, that is just a simple SELECT.
As far as storing that data (at least in terms of doing it efficiently), well, that is more interesting :). A nice way to transport a list of fields to SQL Server is to use Table-Valued Parameters (TVPs). These were introduced in SQL Server 2008. I have posted a code sample (C# and T-SQL) in this answer on a very similar question here: Pass Dictionary<string,int> to Stored Procedure T-SQL. There is another TVP example on that question (the accepted answer), but instead of using IEnumerable<SqlDataRecord>, it uses a DataTable which is an unnecessary copy of the collection.
EDIT:
With regards to the recent update of the question that specifies the actual data being persisted, that should be stored in a table similar to:
UserID INT NOT NULL,
TemplateIndex INT NOT NULL,
TemplateValue VARCHAR(100) NOT NULL
The PRIMARY KEY should be (UserID, TemplateIndex) as that is a unique combination. There is no need (at least not with the given information) for an IDENTITY field.
The TemplateIndex and TemplateValue fields would get passed in the TVP as shown in my answer to the question that I linked above. The UserID would be sent by itself as a second SqlParameter. In the stored procedure, you would do something similar to:
INSERT INTO SchemaName.TableName (UserID, TemplateIndex, TemplateName)
SELECT #UserID,
tmp.TemplateIndex,
tmp.TemplateName
FROM #ImportTable tmp;
And just to have it stated explicitly, unless there is a very specific reason for doing so (which would need to include never, ever needing to use this data in any queries, such that this data is really just a document and no more usable in queries than a PDF or image), then you shouldn't serialize it to any format. Though if you were inclined to do so, XML is a better choice than JSON, at least for SQL Server, as there is built-in support for interacting with XML data in SQL Server but not so much for JSON.

List or any collection's representation in databases are supposed to be tables. Always think of it as a collection and relate it to what a database offers.
Though you can always serialize a collection, i do not suggest it since updating or inserting records, you'd always update the whole record or data whereas having a table, you'd only have to query for the KEY wherein Dictionary, you already have it.

Auto-increment Using Dates

I'm quite a beginner in general but I have a theory, idea, etc...
I want to create a task database, with a unique TaskID column [primary key or not] using the date. I need the entry to be auto-generated. In order to avoid collisions, I want to attach a number to the end, so this should achieve the goal of having all entries unique. So a series of entries would look like this:
201309281 [2013-09-28]
201309282
201309291
My thought is that I could use auto-increment that would reset at midnight EST, and start again at the given date, or something like that.
The advantage, to me, of having it work like this, is that you could see all tasks created on a given day, but then the particular task may not be completed or invoiced until, say, a week later. This way you could search by creation date, completion date, or invoice date.
I realize that there are many ways to achieve the end goal of task database. I was just curious if this was possible, or if anyone had any thoughts on how to implement it as the primary key column, or any other column for that matter.
I also want to apologize if this question is unclear. I will try to sum up here.
Can you have an auto-increment column based on the date the row is created, so it automatically generates the date as a number [20130929] with an extra digit on the end in the following format, AND have that extra digit number on the end reset to "1" every day at midnight EST or UTC?
And thoughts on how to accomplish?
eg:
201309291
EDIT: BTW, I would like to use an MVC4 web app to give users CRUD functionality. Using C#. I thought this fact may expand the options.
EDIT: I found this q/a on stack, and it seems similar, but doesn't quite answer my question. My thought is posting the link here might help find an answer. Resetting auto-increment column back to 0 daily

I take it you're new to db design Nick but this sort of design would make any seasoned DBA cringe. You should avoid putting any information in primary keys. The results you're trying to achieve can be attained using something like the code below. Remember, PK's should always be dumb ID's, no intelligent keys!
Disclaimer: I'm a very strong proponent of surrogate key designs and I'm biased in that direction. I've been stung many times by architectures didn't fully consider the trade-offs or the downstream implications of a natural key design. I humbly respect and understand the opinions of natural key advocates but in my experience developing relational business apps - surrogate designs are the better choice 99% of the time.
(BTW, you don't really even need the createdt field in the RANK clause, you could use the auto-increment PK instead in the ORDER BY clause of the PARTITION).
CREATE TABLE tbl(
id int IDENTITY(1,1) NOT NULL,
dt date NOT NULL,
createdt datetime NOT NULL
CONSTRAINT PK_tbl PRIMARY KEY CLUSTERED (id ASC)
)
go
'I usually have this done for me by the database
'rather than pass it from middle tier
'ALTER TABLE tbl ADD CONSTRAINT DF_tbl_createdt
' DEFAULT (getdate()) FOR createdt
insert into tbl(dt,createdt) values
('1/1/13','1/1/13 1:00am'),('1/1/13','1/1/13 2:00am'),('1/1/13','1/1/13 3:00am'),
('1/2/13','1/2/13 1:00am'),('1/2/13','1/1/13 2:00am'),('1/2/13','1/1/13 3:00am')
go
SELECT id,dt,rank=RANK() OVER (PARTITION BY dt ORDER BY createdt ASC)
from tbl

I would say that this is a very bad design thought. Primary keys ideally should be surrogate in nature and thus automatically created by SQL Server.
The logic drafted by you might get implemented well but due to lot of manual-engineering it could lead to lot of complexities, maintenance overhead and performance issues.
For creating PKs you should restrict yourself to either IDENTITY property, SEQUENCES (new in SQL Server 2012), or GUID (newID()).
Even if you want to go with your design you can have a combination of Date type column and an IDENTITY int/bigint column. And you can add an extra computed column to concatenate them. Resetting IDENTITY column every midnight would not be a good idea.

Ok, I found an answer. There may be problems with this method that I don't know about, so comments would be welcome. But this method does work.
CREATE TABLE [dbo].[MainOne](
[DocketDate] NVARCHAR(8),
[DocketNumber] NVARCHAR(10),
[CorpCode] NVARCHAR(5),
CONSTRAINT pk_Docket PRIMARY KEY (DocketDate,DocketNumber)
)
GO
INSERT INTO [dbo].[MainOne] VALUES('20131003','1','CRH')
GO
CREATE TRIGGER AutoIncrement_Trigger ON [dbo].[MainOne]
instead OF INSERT AS
BEGIN
DECLARE #number INT
SELECT #number=COUNT(*) FROM [dbo].[MainOne] WHERE [DocketDate] = CONVERT(DATE, GETDATE())
INSERT INTO [dbo].[MainOne] (DocketDate,DocketNumber,CorpCode) SELECT (CONVERT(DATE, GETDATE
())),(#number+1),inserted.CorpCode FROM inserted
END
Any thoughts? I will wait three days before I mark as answer.
The only reason I'm not marking 'sisdog' is because it doesn't appear that his answer would make this an automatic function when an insert query is run.

long vs Guid for the Id (Entity), what are the pros and cons

I am doing a web-application on asp.net mvc and I'm choosing between the long and Guid data type for my entities, but I don't know which one is better. Some say that long is much faster. Guid also might have some advantages. Anybody knows ?

When GUIDs can be Inappropriate
GUIDs are almost always going to be slower because they are larger. That makes your indexes larger. That makes your tables larger. That means that if you have to scan your tables, either wholly or partially, it will take longer and you will see less performance. This is a huge concern in reporting based systems. For example, one would never use a GUID as a foreign key in a fact table because its length would usually be significant, as fact tables are often partially scanned to generate aggregates.
Also consider whether or not it is appropriate to use a "long". That's an enormously large number. You only need it if you think you might have over 2 BILLION entries in your table at some point. It's rare that I use them.
GUIDs can also be tough to use and debug. Saying, "there's a problem with Customer record 10034, Frank, go check it out" is a lot easier than saying "there's a problem with {2f1e4fc0-81fd-11da-9156-00036a0f876a}..." Ints and longs are also easier to type into queries when you need to.
Oh, and it's not the case that you never get the same GUID twice. It has been known to happen on very large, disconnected systems, so that's something to consider, although I wouldn't design for it in most apps.
When GUIDs can be Appropriate
GUIDs are the appropriate when you're working with disconnected systems where entities are created and then synchronized. For example, if someone makes a record in your database on a mobile device and syncs it, or you have entities being created at different branch offices and synced to a central store at night. That's the kind of flexibility they give you.
GUIDs also allow you the ability to associate entities without persisting them to the database, in certain ORM scenarios. Linq to SQL (and I believe the EF) don't have this problem, though there are times you might be forced to submit your changes to the database to get a key.
If you create your GUIDs on the client, it's possible that since the GUIDs you create are not sequential, that insert performance could suffer because of page splits on the DB.
My Advice
A lot of stuff to consider here. My vote is to not use them unless you have a compelling use case for them. If performance really is your goal, keep your tables small. Keep your fields small. Keep your DB indexes small and selective.

SIZE:
Long is 8 bytes
Guid is 16 bytes
GUID has definitely high probability for going to be unique and is best to use for identification of individual records in a data base(s).
long (Identity in DB), might represent a unique record in a table but you might have records represented by same ID (Identity), in one or more different table like as follows:
TableA: PersonID int, name varchar(50)
TableB: ProductID int, name varchar(50)
SELECT PersonID from TableA where name =''
SELECT ProductID from TableB where name =''
both can return same value, but in case of GUID :
TableA: PersonID uniqueidentifier, name varchar(50)
TableB: ProductID uniqueidentifier, name varchar(50)
SELECT PersonID from TableA where name =''
SELECT ProductID from TableB where name ='
you can rarely have same value as id returned from two tables
Have a look here
SQL Server - Guid VS. Long
GUIDs as PRIMARY KEYs and/or the clustering key

Guids make it much easier to create a 'fresh' entity in your API because you simply assign it the value of Guid.NewGuid(). There's no reliance on auto-incremented keys from a database, so this better decouples the Domain Model from the underlying persistence mechanism.
On the downside, if you use a Guid as the Clustered Index in SQL Server, inserts become expensive because new rows are very rarely added to the end of the table, so the index needs to be rebuilt very often.
Another issue is that if you perform selects from such a database without specifying an explicit ordering, you get out results in an essentially random order.

Is the usage of identity insert good with metadatatables

I have several tables within my database that contains nothing but "metadata".
For example we have different grouptypes, contentItemTypes, languages, ect.
the problem is, if you use automatic numbering then it is possible that you create gaps.
The id's are used within our code so, the number is very important.
Now I wonder if it isn't better not to use autonumbering within these tables?
Now we have create the row in the database first, before we can write our code. And in my opinion this should not be the case.
What do you guys think?

I would use an identity column as you suggest to be your primary key(surrogate key) and then assign your you candidate key (identifier from your system) to be a standard column but apply a unique constraint to it. This way you can ensure you do not insert duplicate records.
Make sense?

if these are FK tables used just to expand codes into a description or contain other attributes, then I would NOT use an IDENTITY. Identity are good for ever inserting user data, metadata tables are usually static. When you deploy a update to your code, you don't want to be suprised and have an IDENTITY value different than you expect.
For example, you add a new value to the "Languages" table, you expect the ID will be 6, but for some reason (development is out of sync, another person has not implemented their next language type, etc) the next identity you get is different say 7. You then insert or convert a bunch of rows having using Language ID=6 which all fail becuase it does not exist (it is 7 iin the metadata table). Worse yet, they all actuall insert or update because the value 6 you thought was yours was already in the medadata table and you now have a mix of two items sharing the same 6 value, and your new 7 value is left unused.
I would pick the proper data type based on how many codes you need, how often you will need to look at it (CHARs are nice to look at for a few values, helps with memory).
for example, if you only have a few groups, and you'll often look at the raw data, then a char(1) may be good:
GroupTypes table
-----------------
GroupType char(1) --'M'=manufacturing, 'P'=purchasing, 'S'=sales
GroupTypeDescription varchar(100)
however, if there are many different values, then some form of an int (tinyint, smallint, int, bigint) may do it:
EmailTypes table
----------------
EmailType smallint --2 bytes, up to 32k different positive values
EmailTypeDescription varchar(100)

If the numbers are hardcoded in your code, don't use identity fields. Hardcode them in the database as well as they'll be less prone to changing because someone scripted a database badly.

I would use an identity column as the primary key also just for simplicity sake of inserting the records into the database, but then use a column for type of metadata, I call mine LookUpType(int), as well as columns for LookUpId (int value in code) or value in select lists, LookUpName(string), and if those values require additional settings so to speak use extra columns. I personally use two extras, LookUpKey for hierarchical relations, and LookUpValue for abbreviations or alternate values of LookUpName.

Well, if those numbers are important to you because they'll be in code, I would probably not use an IDENTITY.
Instead, just make sure you use a INT column and make it the primary key - in that case, you will have to provide the ID's yourself, and they'll have to be unique.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.