Table design for SQL - c#

i hope you can help me out here - i have a question about designing a SQL table. I know how to write to a database and perform queries using C#, but i have never really had to design one. So i thought i would give it a shot just our of interest.
Lets assume i have a table named family_surname. Within that table, there could be x amount of family_members, say ranging from 2 people to 22 people. How can i reference family_members against family_surname?
So i would have
FAMILY SURNAME
Smith,
Jones,
Brown,
Taylor,
etc.
And then Smith may have 5 members, to where i would like to record age, height, weight, whatever. Jones may have 8 members - it could vary between families.
I dont really want 'surname' listed 8 times for each member - ideally the surname row would reference (or somehow point to) a corresponding row in another table. And thats where im having trouble!
I hope i make sense; like i say, im just interested, but i would like to know how to do this with two tables.
Anyway, thank you for your help, i appreciate it.
EDIT
Thank you to everone who commented - certainly some useful information here, which i appreciate. Im reading up and researching some SQL bits and peices, and so far its pretty good.
Thanks again, guys!

What you are asking is a question about normalization. The table would look like:
Create table surname (
SurnameID int,
Surname varchar(255)
)
The other tables would reference the surname by using the I'd. Also, you probably want surnameid to be unique, a primary key, and auto incrementing. Those are more advanced topics.
That said, I'm not sure surname is a great candidate for splitting out like this. One reason to normalize data is to maintain relational integrity. In this case, it means that when you change "Smith" to "Jones", all the Smiths change at once. I don't think this is a concern in your case.

Yes the previous answer about learning about database normalization is probably accurate but for starters....
Breaking down the person's name (first and last) is probably a bit much. Unless you are assuming everyone named "jones" are ALL related. Think of each table as an entity/object and try to connect them to real world "objects" as much as possible. Since a person needs both first and last name (minimum) to uniquely identify them, they should not be normalized in that way.
In the scenario you've painted, you should have a Persons table that has PersonId, FirstName, LastName. And if need be, a separate table to store other information. However, since the person can only be of one height, weight, age, etc... those should be stored in the Persons table.
Therefore, you really only need one table here. Unless you start getting into phone numbers, addresses, etc.

The decomposition can be done as follows
CREATE TABLE SURNAMES(INT ID, SURNAME VARCHAR2(200))
CREATE TABLE DETAILS(INT ID, FOREIGN KEY(SURNAME_ID) REFERENCES SURNAMES(ID), PARAM1, PARAM2 .....)
A rough sketch of the decomposition is
Get the list of attributes (SURNAME, PARAM1, PARAM2,....).
Based on the list of attributes the following keys can be inferred :
1. (SURNAME)
2.(PARAM1,PARAM2...)
A separate table is created for each set of keys

I dont really want 'surname' listed 8 times for each member
Why? Have you measured on realistic amounts of data and determined it's actually a problem?
Unless you plan having additional data specific to the surname (and independent of persons who have that surname), there is nothing wrong about surname not being in its own table. You are not breaking any normal form.
In fact, what you propose can be a really bad idea, for following reasons:
First and foremost, you'd need a JOIN just to find out person's surname - bad for performance.
It complicates (and slows-down) insertion/modification/deletion of persons.
When inserting a new person, you'd have to search the surname table to decide whether you can "reuse" the existing one or insert the new one.
Modification (e.g. when a wife takes husband's surname) is a combination of deletion (see below) and insertion.
Can a surname exist without any person having it? If no, there is no good declarative integrity to enforce this. At best you'd need to write some triggers.
You might not end-up saving much space after all - the additional table will have its own storage overheads (such as the index "underneath" the primary key) which may "eat" much of the anticipated storage saving.

Related

Is levenshtein distance the best tool for the job when I know the proper spelling of a string and historical misspellings?

I have two tables.
Table A has a single entry for each current employee, and contains the proper spelling of each user. There is only ever 80 employees at a given time, but the names themselves change periodically.
It looks a bit like this:
FirstName
MiddleName
LastName
EmployeeID
John
Smith
1234
Michael
Doe
Tabler
1235
I have another table, Table B, with millions of entries. This table is populated by users in the field entering in full names as they hear them in person.
Name
DateEntered
JOHN SMITH
20210701
JONATHAN SMITH
20210701
MICHAEL DOE
20210630
MIKE DOE
20210425
JON R. SMITH
20201231
To see what I'm up against, I ran a simple query attempting to view certain variations on names. Something like:
SELECT TOP 50 Name, COUNT(*) as hits
FROM Table_B
WHERE Name like 'Jo%' and Name like '%Sm%'
GROUP BY Name
ORDER BY hits desc;
Which returns:
Name
Hits
JOHN SMITH
171
JOHN R. SMITH
98
JONATHAN SMITH
67
JOHN R SMITH
45
JOHNSMITH
35
JOHN SMIHT
12
JOSIE SMULLET
9
JOHN DOE FOR BRAD SMATTEX
1
And so on and so on, with as many variations as you can think of on a given name.
Quite simply, I need to be able to view future misspellings and properly associate it to a user.
Now, I've managed to get a C# project that can determine the levenshtein distance between strings, so this question isn't really about how to generate the distance itself, or even how to write the code that will solve my problem.
I'm more so wondering if I am using the right tool for the job by assuming that a levenshtein function is my secret key, or if I am creating an XY problem and should be investigating other avenues to solve this, or if I even have enough data in front of me to achieve the task at hand.
It becomes a design choice, both in the schema and in the UI.
Who will be managing the name Alias data?
Is there a UX to clarify when a unique match cannot be made with certainty?
how many different processes need to use the Alias?
how often is the Alias lookup going to be used.?
what level of certainty do you need, and how critical is the data?
If you want the users to be able to manage the known Alias or common misspelling, the by all means create a table (or array) that allows the users (or administrators) to manage the lookup.
It also comes down to the scenario. If you need this for frequent importing of data then you need a definitive source of data to match on to give you confidence that your process will work.
In this scenario, I would validate the input against the mapped Alias values for each name, if a unique name cannot be identified, fail the input until a unique result can be found, this forces either DBA, Admin or users to update the Alias list accordingly.
If this is very infrequent, then it might be simpler to manage this in a script that parses and modifies the input first, rather than building this into your schema. Then you or the DBA performing the input can manage the script when the list of employees changes, or a new misspelling appears.
Be careful not to over-engineer solutions like this. Levenshtein is great to sort lists of users against a search argument to assist users to find someone, but due to internationalization, multiculturalism and general quirky choices of people out there, the number of names that clash or return false matches might not be acceptable.

What is faster adding foreign key to an other table or creating flat table in select queries

I want to get the fastest way with select queries.
I have a table that contains two million lines and I want to add an information about the country for each line.
for exemple the table:
strain(id,name,sequenceinformations,depositor,numberofsequences)
and I want to add country informations: country(id,name,code)
what is the fastest way doing it in the same table or adding the country table and adding just id of country.
I know that for design it is better to separate tables and for maintenance it is mach better but in my case I search only the speed.
The age old normalization vs denormalization debate. At first glance, a separate table (the normalized approach) seems like the logical choice. However, for country data (which tends to be relatively static), adding it directly to the first table is a viable option. On the rare occasion when a country changes its name, the amount of maintenance is fairly minimal. Sure, it takes up more space, but space is cheap.
That said, for relatively small databases, the performance difference is probably negligible. Therefore, the best approach is whatever you find easiest to understand and maintain.
Also consider if the country information is likely to be used in other tables: if you're not careful, maintenance could become difficult and error prone.
So, to address your specific question: yes, a denormalized approach will, in most cases, be technically faster for select queries, but slower in update queries. Whether the difference is sufficient to justify it is another question.
As an aside, I saw an interesting approach recently where a separate table with country data was kept for the purpose of populated dropdown lists, etc, but the country name itself was added to the other tables. Obviously this approach isn't as robust as full normalization, but it certainly helped enforce a certain level of consistency.
Since your country table will not have rows more than countries in world so it will be small table so you can use separate table to have country data and use join to get the data.
I believe hash join will be a better option but since MySQL resolves all joins using nested-loop join. In nested loop join, The driving table is read once and for each row in driving table, the inner table is processed once. The smaller the inner result set,better is the performance. So, you need to keep inner result from the country table.If inner input is indexed then it will be faster.
At last it depends on the factor how often your main table data is getting updated and selected. More updates go for new tables, lesser updates go for other approach.

How to create a copy of items in database without breaking FK constraints?

Ok so I have a structure like this (don't know how to format this properly but I'll try):
Table Location, contains id, name and address.
Table Room, which contains id, name, size and a foreign key to a location.
Table RoomItem (think tables, chairs etc), contains id, name, type, value and a foreign key to a room.
So let's say I have set up a location with a dozen rooms and a hundred roomitems using this structure. Now I want to create a new location, but I want to have exactly the same rooms and roomitems as a template for the new location. So basically I'd want to copy all rooms pointing to a certain location and change their FK to the new location, as well as copy all the roomitems pointing to said rooms, and change their FKs to point to the newly created rooms.
I am using Entity Framework 4.3, but anything executable from a standard c# winforms project would work just fine.
And my problem is figuring out how to do this "properly". Doing it all manually seems like quite the hassle, meaning something like going through the rooms one by one, copying all their fields and creating a new one, then diving deeper into its roomitems and copying them, after saving the first room so I have its id available.
Even writing this out sounds quite confusing, so is there a better way of accomplishing this?
I provide a way of doing this below, in SQL. Please note this answer is subject to the following caveats:
I am answering without having available any development tools, or reference material. This means that the answer comes out of my memory.
In particular this implies that the solution has not been compiled; also its syntax and semantics may be off in some respects.
I have used the variant of sql that I know best - that for Firebird. The exact syntax you need may vary depending on the sql that is available to you.
In addition I have assumed that the variable names are sufficiently self explanatory to warrant no further comment.
Finally, this answer can probably be improved - the repetition of SELECT statements indicates to me that there is a refinement that will remove this repetition.
SET TERM !! ;
CREATE TRIGGER Deep_Copy_Room FOR Location
BEFORE INSERT
POSITION 0
AS BEGIN
NEW.id = GEN_ID (id_GEN, 1);
INSERT INTO Room (id, name, size, fk)
VALUES (GEN_ID(id_GEN, 1), Roo_name, Roo_Size, NEW.id)
WHERE SELECT id, name, size, FK_id FROM Room AS Roo_Id, Roo_name, Roo_size
WHERE SELECT id, name, addrss FROM Location AS Loc_Id, Loc_name, Loc_address
WHERE fk = Copy_From_id;
INSERT INTO Item (id, name, type, value, fk_id)
VALUES (GEN_ID(id_GEN,1), Ite_name, Ite_type, Ite_value, Roo_id)
WHERE SELECT id, name, type, value, FK_id FROM RoomItem AS Ite_Id, Ite_name, Ite_type, Ite_value, Roo_Id)
WHERE SELECT id, name, size, FK_id FROM Room AS Roo_Id, Roo_name, Roo_size, Loc_id
WHERE SELECT id, name, address FROM Location AS Loc_Id, Loc_name, Loc_address
WHERE Loc_id = Copy_From_id;
END
SET TERM ; !!

Can i manage employee related data in one table?

I have seen people using ‘DataListValue’ table for storing those values (Call_Types, DepartmentCodes, Divisions and etc) which are used quite often in the drop down list on UI.
This way i can manage them in one table and will have one screen to update codes.
I am wondering if it is okay to keep DepartmentCode,RoleCode,CountryCode in the Data_List Table? Or I should have them in separate table?
It is common practice to have a single codes table that stores code/description pairs with some kind of table type column. For instance, you might commonly see a table like this:
CodesTable
TableId
Code
Description
However, I have never thought it was a good idea. Even if all you store is a code and a description, it's better to make a new table for each set of codes. That way your foreign key relationships will be more clear. Plus, inevitably, you will one day need to store some additional data about one of those code sets and you'll end up needing to add an additional column that only applies to one of the code sets that are stored in the table and the column will be null for all the other rows. It always turns ugly fast.
For instance, lets say, as in the example above, you set TableId to "C" for all the Country codes and you set it to "D" for all Department codes. But then next month a new requirement comes in that requires you to store a postal abbreviation for each Country code. Do you add a PostalAbbreviation column to the table even though it will never apply to Department codes? Or do you create another table that just stores additional data for each country code? How do you know what "C" and "D" mean unless you have some other place to look them up? All around it's just a bad idea.

Saving multiple items per single database cell

i have a countries list. Each user can check multiple countries. Once saved, this "user country list" will be used to get whether other users fit into countries certain user chose.
Question is what would be the most efficient approach to this problem...
I have one, one to save user selection as delimited list like Canada,USA,France ... in single varchar(max) field but problem with it would be that once user from Germany enters page i perform this check on. To search for Germany i would be needed to get all items and un-delimit each field to check against value or to use sql 'like' which again is pretty damn slow..
If you have better solution or some tips i would be glad to hear.
Just to make sure, many users will have their own selections of countries from which and only they want to have users to land on their page. While millions of users will reach those pages. So the faster approach will be the better.
technology, MSSQL and ASP.NET
thanks
You should not store a list of values in one cell. Consider having a separate table that stores each of the selected countries with a foreign key reference to the user table. This is standard Database Normalization.
PLEASE don't go down the route you're thinking of, storing multiple entries in one field. I've had to re-write more applications because of bad database design than for any other reason, and that is a bad design.
Added
I have this poster on my wall at work: http://www.informationqualitysolutions.com/FreeStuff/rettigNormalizationPoster.pdf
One of my predecessors was a newbie to DB Design, and this helped her a lot. I keep it for any new hires that may need it. It explains normalization very nicely, with examples.
Do not save delimited fields into your database. Your database will not be normalized.
You need a many-to-many table for users and countries:
UserId
CountryId
If you do start using a delimited field, you end up needing to parse it (either in SQL or your Code). It is more difficult to query and optimize.
In this case, you want will want to create a table called UserCountries (or some such) which would store the UserID and CountryID. This is a standard relational construct. To beginners, it seems strange and too involved, but this structure makes it very easy and very fast to write flexible queries against this type of data. No delimiting required!
I think it would be better to use a UserCountry table, which contains a link to the User and the Country table. This creates a lot more possibilities to query against the database. Example queries that are much simpler this way:
Number of Countries per user
All users which selected a particular country
Sort all popular countries
Do not store multiple countries in a single field. Add 2 additional tables - Countries (ID, Name) and UserCountries (UserID, CountryID)

Categories