Structure a table with many columns - c#

On a SQL Server database I have a table with about 200 items:
create table dbo.Reports
(
Id int identity not null,
HealthRating int not null, --- { One of: NoProblems; TemporaryChange; ... }
Hobbies int not null, --- { Many of: None, Running, Tennis, Football, Swimming }
HobbiesOthers nvarchar (400) null
-- More 100 columns
);
So I have about 200 columns with types: INT, NVARCHAR, BIT and DATETIME.
Some of the INT columns are as HealthRating to store one value.
Others are like Hobbies to hold many items ... And usually have an extra column to store other options as text (nvarchar) ...
How should I structure this table? I see 3 options:
Have one column for each property so:
HealthRatingNoProblems bit no null,
HealthRatingTemporaryChange bit no null,
Create lookup tables for HealthRatings, Hobbies, ...
Probably I will end with more 60 tables or so ...
Use Enums and Flag enums which are now supported in Entity Framework and store one choice and multiple choice items in Int columns as I posted.
What would you suggest?

By all means -- please! -- normalize that poor table. If you end up with 50 or 60 tables then so be it. That is the design. If a user has a hobby, that information will be in the Hobby table. If he has three hobbies, there will be three entries in the Hobby table. If he doesn't have any hobbies, there will be nothing in the Hobby table. So on with all the other tables.
And for all those times you are only interested in hobbies, you only involve the Hobby table with the Reports table and leave all the other tables alone. You can't do that with one huge, all encompassing row that attempts to hold everything. There, if you only want to look at hobby information, you still have to read in the entire row, bringing in all that data you don't want. Why read in data you are just going to discard?

Related

BULK INSERT across multiple related tables?

I need to do a BULK INSERT of several hundred-thousand records across 3 tables. A simple breakdown of the tables would be:
TableA
--------
TableAID (PK)
TableBID (FK)
TableCID (FK)
Other Columns
TableB
--------
TableBID (PK)
Other Columns
TableC
--------
TableCID (PK)
Other Columns
The problem with a bulk insert, of course, is that it only works with one table so FK's become a problem.
I've been looking around for ways to work around this, and from what I've gleaned from various sources, using a SEQUENCE column might be the best bet. I just want to make sure I have correctly cobbled together the logic from the various threads and posts I've read on this. Let me know if I have the right idea.
First, would modify the tables to look like this:
TableA
--------
TableAID (PK)
TableBSequence
TableCSequence
Other Columns
TableB
--------
TableBID (PK)
TableBSequence
Other Columns
TableC
--------
TableCID (PK)
TableCSequence
Other Columns
Then, from within the application code, I would make five calls to the database with the following logic:
Request X Sequence numbers from TableC, where X is the known number of records to be inserted into TableC. (1st DB call.)
Request Y Sequence numbers from TableB, where Y is the known number of records to be inserted into TableB (2nd DB call.)
Modify the existing objects for A, B and C (which are models generated to mirror the tables) with the now known Sequence numbers.
Bulk insert to TableA. (3rd DB call)
Bulk insert to TableB. (4th DB call)
Bulk insert to TableC. (5th DB call)
And then, of course, we would always join on the Sequence.
I have three questions:
Do I have the basic logic correct?
In Tables B and C, would I remove the clustered index from the PK and put in on the Sequence instead?
Once the Sequence numbers are requested from Tables B and C, are they then somehow locked between the request and the bulk insert? I just need to make sure that between the request and the insert, some other process doesn't request and use the same numbers.
Thanks!
EDIT:
After typing this up and posting it, I've been reading deeper into the SEQUENCE document. I think I misunderstood it at first. SEQUENCE is not a column type. For the actual column in the table, I would just use an INT (or maybe a BIGINT) depending on the number of records I expect to have). The actual SEQUENCE object is an entirely separate entity whose job is to generate numeric values on request and keep track of which ones have already been generated. So, if I understand correctly, I would generate two SEQUENCE objects, one to be used in conjunction with Table B and one with Table C.
So that answers my third question.
Do I have the basic logic correct?
Yes. The other common approach here is to bulk load your data into a staging table, and do something similar on the server-side.
From the client you can request ranges of sequence values using the sp_sequence_get_range stored procedure.
In Tables B and C, would I remove the clustered index from the PK
No, as you later noted the sequence just supplies the PK values for you.
Sorry, read your question wrong at first. I see now that you are trying to generate your own PK's rather then allow MS SQL to generate them for you. Scratch my above comment.
As David Browne mentioned, you might want to use a staging table to avoid the strain you'll put on your app's heap. Use tempdb and do the modifications directly on the table using a single transaction for each table. Then, copy the staging tables over to their target or use a MERGE if appending. If you are enforcing FK's, you can temporarily remove those constraints if you choose to insert in reverse order (C=>B=>A). You also may want to consider temporarily removing indexes if experiencing performance issues during the insert. Last, consider using SSIS instead of a custom app.

Holding different datatypes dynamically in database

Lets say I have a table Person and I need that a user can add different attributes to him/herself.
User should be able to add a date, string, number, boolean, multiple values.
Lets say he wants to add:
Date of birth
Name
Heigth
Children names
How would I hold this in database?
I have 2 ideas:
I can hold all the values as string or varchar and always parse the value back to original format when used. Multiple values holding like text1#text2#text3 or similar.
Having a table, where there are columns for each : date, string, number and only the one that is needed will be populated and other will stay nulls.
Any suggestions?
Good database design should always be N:1 (many to one) or 1:1 (one to one), never 1:N (one to many) or N:N (many to many), meaning that if you have multiple related fields of a user, you should make a new table that refers to the user.
Since a user can only have one birth date though, you should keep that as a column to the Users table.
For example, in this case you want children names as the "multiple", assigned to one user.
A simple table for that could look like this:
ID int primary key
UserID int references User(ID)
Name varchar
That way, you can make multiple children names for one user, while still being able to keep constraints in the database (which helps ensure code correctness if you're interfacing with it through an application!)
Some people will suggest having a table for each of the values, just to avoid nulls. For example, to store their birthdate, you make a table similar to the Children names table above, since you won't have to make a column in the Users table that might be null.
Personally I think using nulls are fine, as they allow you to see if there is a relevant result set without joining (or worse, left joining) an entire table of potentially irrelevant information.
Use your second approach. In your table 'Person', have a row for each record that has multiple columns each which holds a single value for you desired fields.
So..
tbPerson
ID | Date Of Birth | Name | Height | Childrens names | etc...
To Create a table...
CREATE TABLE tbPerson([ID] INT IDENTITY(1,1), [Date Of Birth] DATE, [Name] VARCHAR(50), Height INT, [Childrens names] VARCHAR(250))
This is the best and easiest way and enables editing 1 field of a persons records simple. In your first approach you will have endless nightmares storing everything a 1 long string.

What is a better approach performance wise

Lets say I need to fetch some records from the database, and filter them based on an enumeration type property.
fetch List<SomeType>
filter on SomeType.Size
enumeration Size { Small, Medium, Large }
when displaying records, there will be a predefined value for Size filter (ex Medium). In most of the cases, user will select a value from filtered data by predefined value.
There is a possibility that a user could also filter to Large, then filter to Medium, then filter to Large again.
I have different situations with same scenario:
List contains less than 100 records and 3-5 properties
List contains 100-500 records and 3-5 properties
List contains max 2000 records with 3-5 properties
What is my best approach here? Should I have a tab that will contain grid for each enum, or should I have one common enum and always filter, or?
I would do the filtering right on the database, if those fields are indexed I would suspect having the db filter it would be much faster than filtering with c-sharp after the fact.
Of course you can always cache the filtered database result as to prevent multiple unnescessary database calls.
EDIT: as for storing the information in the database, suppose you had this field setup:
CREATE TABLE Tshirts
(
id int not null identity(1,1),
name nvarchar(255) not null,
tshirtsizeid int not null,
primary key(id)
)
CREATE TABLE TshirtSizes
(
id int not null, -- not auto-increment
name nvarchar(255)
)
INSERT INTO TshirtSizes(id, name) VALUES(1, 'Small')
INSERT INTO TshirtSizes(id, name) VALUES(2, 'Medium')
INSERT INTO TshirtSizes(id, name) VALUES(3, 'Large')
ALTER TABLE Tshirts ADD FOREIGN KEY(tshirtsizeid) REFERENCES tshirtsize(id)
then in your C#
public enum TShirtSizes
{
Small = 1,
Medium = 2,
Large = 3
}
In this example, the table TshirtSizes is only used for the reader to know what the magic numbers 1, 2, and 3 mean. If you don't care about database read-ability you can omit those tables and just have an indexed column.
Memory is usually cheap. Otherwise you could one-time sort all the values and retrieve based on comparison which would be O(n). You could keep track of the positions of things and retrieve faster that way.

TSQL Large Insert of Relational Data, W/ Foreign Key Upsert

Relatively simple problem.
Table A has ID int PK, unique Name varchar(500), and cola, colb, etc
Table B has a foreign key to Table A.
So, in the application, we are generating records for both table A and table B into DataTables in memory.
We would be generating thousands of these records on a very large number of "clients".
Eventually we make the call to store these records. However, records from table A may already exist in the database, so we need to get the primary keys for the records that already exist, and insert the missing ones. Then insert all records for table B with the correct foreign key.
Proposed solution:
I was considering sending an xml document to SQL Server to open as a rowset into TableVarA, update TableVarA with the primary keys for the records that already exist, then insert the missing records and output that to TableVarNew, I then select the Name and primary key from TableVarA union all TableVarNew.
Then in code populate the correct FKs into TableB in memory, and insert all of these records using SqlBulkCopy.
Does this sound like a good solution? And if so, what is the best way to populate the FKs in memory for TableB to match the primary key from the returned DataSet.
Sounds like a plan - but I think the handling of Table A can be simpler (a single in-memory table/table variable should be sufficient):
have a TableVarA that contains all rows for Table A
update the ID for all existing rows with their ID (should be doable in a single SQL statement)
insert all non-existing rows (that still have an empty ID) into Table A and make a note of their ID
This could all happen in a single table variable - I don't see why you need to copy stuff around....
Once you've handled your Table A, as you say, update Table B's foreign keys and bulk insert those rows in one go.
What I'm not quite clear on is how Table B references Table A - you just said it had an FK, but you didn't specify what column it was on (assuming on ID). Then how are your rows from Table B referencing Table A for new rows, that aren't inserted yet and thus don't have an ID in Table A yet?
This is more of a comment than a complete answer but I was running out of room so please don't vote it down for not being up to answer criteria.
My concern would be that evaluating a set for missing keys and then inserting in bulk you take a risk that the key got added elsewhere in the mean time. You stated this could be from a large number of clients so it this is going to happen. Yes you could wrap it in a big transaction but big transactions are hogs would lock out other clients.
My thought is to deal with those that have keys in bulk separate assuming there is no risk the PK would be deleted. A TVP is efficient but you need explicit knowledge of which got processed. I think you need to first search on Name to get a list of PK that exists then process that via TVP.
For data integrity process the rest one at a time via a stored procedure that creates the PK as necessary.
Thousands of records is not scary (millions is). Large number of "clients" that is the scary part.

How can I handle multiple row revisions in a single table?

I'm working on an app where users enter pricing quotes. They want to be able to have multiple revisions of the quotes and have access to all of them to revise and view.
Currently the data is stored in a Quotes table that looks like this:
QuoteID (PK, autonumber)
data1, data2, data3 and so on.
QuoteID foreign keys to other tables for one to many relationships for details about the quote.
Is there a way to keep all of the revisions in the Quotes table AND handle revisions? This way, the FK relationships to other tables would not be broken.
Based on what you said and some gueses as to what else/what more you need, I came up with the following table structure outline (tables in ALLCAPS, columns in CamelCase; columns ending in Id are identities or suitable natural keys; where the ColumnId name matches that table name, it's a primary key, otherwise it's a foreign key into the referenced table):
-- CUSTOMER ----
CustomerId
-- QUOTE ----
QuoteId
CustomerId
Data1Id
-- QUOTEREVISION ----
QuoteRevisionid
QuoteId
CreatedAt
Data2Id
Data3Id
-- DATA1 ----
Data1Id
-- DATA2 ----
Data2Id
-- DATA3 ----
Data3Id
CUSTOMER records who can make quotes.
QUOTE tracks a customer's pricing quotes. One row for every given [whatever] that they're entering quotes for.
QUOTEREVISION records each quote revision they enter. When a Quote is first created, the first QuoteRevision will also be created. CreatedAt would be a dateimte, to keep track of when they occured. QuoteId + CreatedAt is the natural key for the table, so you might not need QuoteRevisionsId.
DATA1, DATA2, DATA3, and others as needed contain the extra information. I configured Data1 to hold information relevant to the quote level--that is, the same fact would apply to each quote revision. Data2 and Data3 would contain data that could vary from revision to revision.
I've no doubt there's stuff in here that doesn't apply to your problem, but hopefully this gives you some ideas for possible solutions.
You could add a Revision column to both your Quotes table and the other tables making a compound key but that would probably be a bit awkward to keep in sync. I think your best bet is to make the QuoteID column NOT be a primary key and add a new primary key that is used to link the Quotes table to the other tables. The QuoteID then becomes just a field that you can search on (you'd probably want to create an index on it).
I agree with the design of Philip Kelley, I may only notice that quote revision you can calculate in the output using ROW_NUMBER() or it emulations according to your DBMS.
There is also a nice book about storing historical data: http://www.cs.arizona.edu/people/rts/tdbbook.pdf
I'm think that the separate database tables for data elements (as follows) may make the database structure more complex.
-- DATA1 ----
Data1Id
-- DATA2 ----
Data2Id
-- DATA3 ----
Data3Id
What's your take on creating these data elements as the columns in the revision table instead?

Categories