Hypothetically, I have two SQL tables: Table and AuditTable. In Table is a column, org_id, of type float with nulls allowed. Also, org_id is not a primary key. A column with the same name resides in AuditTable. I also have an EditTable class used to make changes to Table and AuditTable. The members of EditTable are set via a user interface. EditTable also contains a org_id member.
There is no good reason why Table.org_id was made a float; it will always contain an integer value. However, since Table was already existing, I can't change the type of Table.org_id. However, since I created AuditTable and EditTable I can set AuditTable.org_id and EditTable.org_id to any type.
When Visual Studio converts Table into a C# class, Table.org_id is made a Nullable<double>. Should I make AuditTable.org_id a float with nulls allowed and make EditTable.org_id a nullable double to match Table.org_id? Or should I make both AuditTable.org_id and EditTable.org_id ints and then do some casting? However, I was thinking about staying away from casting to be on the safe side and just make the types match the orginal Table.
Thanks for any suggestions.
Oh, it is a bad idea to store join keys as floating point numbers. I wish that SQL actually banned this practice. The issue is that 0.9999999999 might look like 1.00000000, but they don't match when joining (or in a where clause). Much better to have what-you-see-is-what-you-get for such conditions.
First, go to whoever you can and beg/bribe/flatter/encourage them to do:
alter table `table` modify org_id int;
If that doesn't work, you have a conundrum. It is much better for query performance to have join keys be of the same types, even types that I don't agree with. Also, that is a pretty important concept for databases. So, you cannot change that join key.
Instead, I think you should add a new key into your table, called something like org_id_int. This would have the correct type, a useful index -- everything except a pretty name. Use this for your joins. Use the other key for the joins to the existing table, until it gets fixed.
Related
I have a field in a sqlite database, we'll call it field1, on which I'm trying to iterate over each record (there's over a thousand records). The field type is string. The value of field1 in the first four rows are as follows:
DEPARTMENT
09:40:24
PARAM
350297
Here is some simple code I use to iterate over each row and display the value:
while (sqlite_datareader.Read())
{
strVal = sqlite_datareader.GetString(0);
Console.WriteLine(strVal);
}
The first 3 values display correctly. However, when it gets to the numerical entry 350297 it errors out with the following exception on the .getString() method
An unhandled exception of type 'System.InvalidCastException' occurred in System.Data.SQLite.dll
I've tried casting to a string, and a bunch of other stuff. But I can't get to the bottom of why this is happening. For now, I'm forced to use getValue, which is of type object, then convert back to a string. But I'd like to figure out why getString() isn't working here.
Any ideas?
EDIT: Here's how I currently deal with the problem:
object objVal; // This is declared before the loop starts...
objVal = sqlite_datareader.IsDBNull(i) ? "" : sqlite_datareader.GetValue(i);
if (objVal != "")
{
strVal = (string)objVal;
}
What the question should have included is
The table schema, preferrably the CREATE TABLE statement used to define the table.
The SQL statement used in opening the sqlite_datareader.
Any time you're dealing with data type issues from a database, it is prudent to include such information. Otherwise there is much unnecessary guessing and floundering (as apparent in the comments), when so very useful, crucial information is explicitly defined in the schema DDL. The underlying query for getting the data is perhaps less critical, but it could very well be part of the issue if there are CAST statements and/or other expressions that might be affecting the returned types. If I were debugging the issue on my own system, these are the first thing I would have checked!
The comments contain good discussion, but a best solution will come with understanding how sqlite handles data types straight from the official docs. The key takeaway is that sqlite defines type affinities on a column and then stores actual values according to a limited set of storage classes. A type affinity is a type to which data will attempt to be converted before storing. But (from the docs) ...
The important idea here is that the type is recommended, not required. Any column can still store any type of data.
But now consider...
A column with TEXT affinity stores all data using storage classes NULL, TEXT or BLOB. If numerical data is inserted into a column with TEXT affinity it is converted into text form before being stored.
So even though values of any storage class can be stored in any column, the default behavior should have been to convert any numeric values, like 350297, as a string before storing the value... if the column was properly declared as a TEXT type.
But if you read carefully enough, you'll eventually come to the following at the end of section 3.1.1. Affinity Name Examples:
And the declared type of "STRING" has an affinity of NUMERIC, not TEXT.
So if the question details are taken literally and field1 was defined like field1 STRING, then technically it has NUMERIC affinity and so a value like 350297 would have been stored as an integer, not a string. And the behavior described in the question is precisely what one would expect when retrieving data into strictly-typed data model like System.Data.SQLite.
It is very easy to cuss at such an unintuitive design decisions and I won't defend the behavior, but
at least the results of "STRING" type are clearly stated so that the column can be redefined to TEXT in order to fix the problem, and
"STRING" is actually not a standard SQL data type. SQL strings are instead defined with TEXT, NTEXT, CHAR, NCHAR, VARCHAR, NVARCHAR, etc.
The solution is either to use code as currently implemented: Get all values as objects and then convert to string values... which should be universally possible with .Net objects since they should all have ToString() method defined.
Or, redefine the column to have TEXT affinity like
CREATE TABLE myTable (
...
field1 TEXT,
...
)
Exactly how to redefine an existing column filled with data is another question altogether. However, at least when doing the conversion from the original to the new column, remember to use a CAST(field1 AS TEXT) to ensure the storage class is changed for the existing data. (I'm not certain whether type affinity is "enforced" when simply copying/inserting data from an existing table into another or if the original storage class is preserved by default. That's why I suggest the cast to force it to a text value.)
I am allowing users to generate expressions against predefined columns on the table. A user can create columns, tables, and can define constraints such as unique and not null columns. I also want to allow them to generate "Calculated columns". I am aware that PostgreSQL does not allow calculated columns so to get around that I'll use expressions like this:
SELECT CarPrice, TaxRate, CarPrice + (CarPrice * TaxRate) AS FullPrice FROM CarMSRP
The user can enter something like this
{{CarPrice}} + ({{CarPrice}} * {{TaxRate}})
Then it gets translated to
CarPrice + (CarPrice * TaxRate)
Not sure if this is vulnerable to sql injection. If so, how would I make this secure?
Why don't you utilize STORED PROCEDURES to conduct this?
This way, you can, for instance, define variables to receive what user wrote and check if there are some BLACKLISTED words (like DELETE, TRUNCATE, ALL, *, and so forth).
I don't know PostgreSQL, but if it's not possible there, you can also check those problematic commands BEFORE translate them to call your SELECT statement.
If I understand you correctly, you just take user input as desribed above and substitute in select column list. If so, that is sure not safe, because something like:
"* from SomeSystemTable--({{CarPrice}} + ({{CarPrice}} * {{TaxRate}})"
Will allow user to select anything from any other tables he has permissions for. You can try to build expression tree to avoid that: parse user input into some structure describing variables and arithmetic operations between them (like parsing arithmetic expressions). Otherwise you can remove all {{}} from your string (ensure that any {{}} corresponds to a column in a table) and check if only "+-*()" and whitespace characters left.
Note that from user experience viewpoint you will need to parse expression anyway, to warn user about errors without actually running the query.
I have an application which has rows of data in a relation database the table needs a status which will always be either
Not Submitted, Awaiting Approval, Approved, Rejected
Now since these will never change I was trying to decide the best way to implement them I can either think of a Status enum with the values and an int assigned where the int is placed into the status column on the table row.
Or a status table that linked to the table and the user select one of these as the current status.
I can't decide which is the better option as I currently have a enum in place with these values for the approval pages to populate the dropdown etc and setup the sql (as it currently using to bool Approved and submitted for approval but this is dirty for various reasons and needs changed).
Wondering what your thought on this were and whether I should go for one or the other.
If it makes any difference I am using Entity framework.
I would go with the Enum if it never changes since this will be more performant (no join to get the status). Also, it's the simpler solution :).
Now since these will never change...
You can count on this assumption being false, and sooner than you think.
I would use a lookup table. It's far easier to add or change values in a lookup table than to change the definition of an enum.
You can use a natural primary key in the lookup table so you don't need to do a join to get the value. Yes a string takes a bit more space than an integer id, but if your goal is to avoid the join this will accomplish that goal.
I use Enums and use the [Description("asdf")] attribute to bind meaningful sentences or other things that aren't allowed in Enums. Then use the Enum text itself as a value in drop downs and the Description as the visible text.
I have a SQL lookup table like this:
CREATE TABLE Product(Id INT IDENTITY PRIMARY KEY, Name VARCHAR(255))
I've databound a ASP.NET DropDownList to a LLBLGen entity. User selects a product, and the Id get saved. Now I need to display some product specific details later on. Should I use the Product's ID, and hope the ID is always the same between installations ?
switch (selectedProduct.Id)
{
case 1: //product one
break;
case 2:
case 3: //product two or three
break;
}
or use the name, and hope that never changes?
switch (selectedProduct.Name)
{
case "product one":
break;
}
Or is there a better alternative?
If you know of all the items in this table (which I guess you do if you can do a switch on them) and want them the same for each installation then maybe it should not be an identity column and you should insert 1, 2, 3 with the products themselves.
For this situation, there are three common solutions I have seen:
Hard code the ID - this is quick and dirty, not self-documenting (you don't know what product is being referred to), and prone to breakage as you pointed out. I never use this method anymore.
Enums - I use this when the table is small and static. So, ProductType would be a possible candidate for this. This is self-documenting code, but still creates an awkward connection between code and data where if records are inserted with different IDs than you planned for, then things break. You can mitigate this by automating the Enum generation in various ways, but it still feels wrong. E.g., if your unit tests are inserting records into the Product table, it will be difficult for them to recreate the Enum at that point. Also, if you have 100,000 records, the Enum approach starts to look pretty dumb.
Add an additional column, that is a non-changing identifier. I often use AlphaCode as my column name. So in your case it would look like:
switch (selectedProduct.AlphaCode)
{
case "PRODUCT_ONE":
break;
}
This lets you use an AlphaCode that is self-documenting, allows you to reinsert data without caring about the autoincrement PK value, and lets you change the product name without affecting anything. If you use the AlphaCode approach, ensure that you put a unique index on this column.
The other solution, which is often the most preferable one, is to move this logic to the database. E.g., if product 1 is the product you always want to show by default when its category is selected, you could add a column to your table called IsHeroProduct. Then your query becomes:
if (selectedProduct.IsHeroProduct)
{
//do stuff
}
If you want your ProductID's to be fixed (which doesn't seem to be a good idea), then you can use IDENTITY INSERT (in SQL Server, at least) to ensure ProductID values are the same between installations. But, I would normally only do this for static reference data.
You can also use Visual Studio's T4 templates to generate enums directly off the database data
Some ORMs (LLBLGen at least) can handle this for you; but generating a strong type of enums. I've never used that though.
In these cases, I always just go with an enum that I write myself, but I make sure that all the fields are equal, and update if any change. It becomes more interesting when you work across databases (as I do), but if you take care, it is simple enough.
I have several questions regarding where to handle nulls. Let me set up a scenario. Imagine I have a table that has 5 varchar(50) columns to use as an example when providing reasons for using nulls or empty strings.
Is it better to handle NULLS in code or in the database? By this I mean, is it better to assign an empty string to a varchar(50) if it contains no value or is it better to assign null to the varchar(50) and handle that null in code?
Does assigning an empty string to a column affect performance overhead?
How does using a null vs. an empty string affect indexing?
I am under the impression that if you do not allow your database to contain nulls, you do not have to handle it in code. Is this statement true?
Do other datatypes besides varchars pose the same problems when using a default value or is it more problematic with string datatypes?
What is the overhead of using the ISNULL function if the table contains nulls?
What are other Advantages/Disadvantages?
My general advice is to declare fields in a database as NOT NULL unless you have a specific need to allow null values as they tend to be very difficult for people new to databases to handle.
Note that an empty string and a null string field do not necissiarly mean the same thing (unless you define them to). Often null means "unknown" or "not provided", whereas an empty string is just that, a provided and known empty string.
Allowing or disallowing null fields depends entirely on your needs.
The main advantage is that you can handle null and empty strings separately in both the .NET and SQL code - they can, after all, mean different things.
The downside is you need to be careful; in .NET you have to not call obj.SomeMethod() on null, and in SQL you need to watch that nulls tend to propagate when combined (unlike, for example, C# string concatenation).
There isn't really a noticeable size difference between null and empty. In the .NET code I'd hope that it uses the interned empty string, but it isn't going to matter hugely.
NULL is stored more efficently (NULL bit map) then empty string (2 bytes for varchar length, or "n" for char)
Storage engine blog: Why is the NULL bitmap in a record an optimization?
I've seen some articles that say different, but for char/varchar I've found NULL to be be useful and tend to treat empty string the same as NULL. I've also found the NULL is quicker in queries than empty string too. YMMV of course and I'll evaluate each case on it's own merits.
You are intermixing an implementation concern with a logical data architecture concern.
You should decide whether or not to allow nulls in a field purely based on whether it accurately models the data you expect to store in the database. Part of the confusion, as a few others have pointed out, is that null and empty strings are not just two ways of storing the same information.
Null means either there is no value or the value is unkown.
Empty string means there is a value and it is an empty string.
Let me demonstrate with an example. Say for example you have a middle name field and need to differentiate between situations where the middle name hasn't been populated and when the person doesn't have a middle name. Use the empty string to indicate that there is no middle name and null to indicate it hasn't been entered.
In almost all cases where a null makes sense in terms of the data it they should be handled in the application code, not the database under the assumption that the DB needs to differentiate between two different states.
The Short Version: Don't pick null vs empty string based on performance/storage concerns in the DB, pick the one that best models the information you are trying to store.
I think a null value and an empty string are two different things both in code and in a database. A variable or field being null means it has no value, but if either is an empty string, it does have a value which happens to be the empty string.
1: Very subjective, as noted by other answers there's a tangible difference between NULL (no answer/unknown) and "" (known to be nothing/not applicable - i.e. a person without a middle name).
2: It shouldn't do.
3: AFAIK (I'm still a junior/learning DBA, so take this with a grain of salt), but there should be no effect.
4: This is arguable. In theory if you apply a NOT NULL constraint to a database field, then you should never have to handle a NULL value. In practice, the gap between theory and practice is smaller in theory than in practice. (In other words, you should probably still handle being given a NULL even if it's theoretically impossible.)
I typically default during design to NOT NULL unless a reason is given otherwise - particularly money/decimal columns in accounting - there is usually never an unknown aspect to these. There might be a case where a money column was optional (like a survey or business relationship system where you put your household/business income - this might not be known until/if a relationship is formed by the account manager). For datetime, I would never allow a NULL RecordCreated column, for instance, while a BirthDate column would allow NULL.
NOT NULL columns remove a lot of potential extra code and ensures that users will not have to account for NULLs with special handling - especially good in presentation layer views or data dictionaries for reporting.
I think it's important during design time to devote a great deal of time handling data types (char vs. varchar, vs. nchar vs. nvarchar, money vs. decimal, int vs. varchar, GUID vs. identity), NULL/NOT NULL, primary key, choice of clustered index and non-clustered indexes and INCLUDE columns. I know that probably sounds like everything in DB design, but if answers to all those questions are understood up front, you will have a much better conceptual model.
Note that even in a database where there are no columns allowed to be NULL, a LEFT JOIN in a view can result in a NULL
For a concrete case of the decision process, let's take a simple case of Address1, Address2, Address3, etc all varchar(50) - a pretty common scenario (which might be better represented as a single TEXT column, but let's assume it's modelled this way). I would not allow NULLs and I would default to empty string. The reason for this is:
1) It's not really unknown - it's blank. The nature of UNKNOWN between the multiple columns is never going to be well-defined. It is highly unlikely you would have a KNOWN Address1 and an UNKNOWN Address2 - you either know the whole address or you don't. Unless you are going to have constraints, let them be blank and don't allow NULLs.
2) As soon as people start naively doing things like Address1 + #CRLF + Address2 - NULLs start to NULL out the entire address! Unless you are going to wrap them in a view with ISNULL, or change you ANSI NULL settings, why not let them be blank - after all, it's the way they are viewed by users.
I would use probably the same logic for a Middle Name or Middle initial, depending on how it's used - is there a difference between someone without a middle name or someone where it's unknown?
In some cases, I would probably not even allow empty string - and I would do this by constraint. For instance - First and Last Name on a patient, Company Name on a customer. These should never be blank nor empty (or all whitespace or similar). The more of these constraints that are in place, the better your data quality and the sooner you catch silly mistakes like import issues, NULL propagation etc.
Putting faked up data (Empty string for string data, 0 for numbers, some riduculaous date for dates) instead of null in a database is almost always a poor choice. Those faked up values do not mean the same thing and especially for numeric data, it is hard to get a faked value that isn't the same as a real value. And when you put in bad data you still have to write code around it to make sure things are handled correctly (such as not returning those records which don;t have an end date) so you actually save nothing onthe development side.
If you cannot know the data at the time the record is inserted null is the best choice. That said, if the data will alawys be known use not null wherever possible.
You should look into sixth normal form. 6NF was specifically invented to get rid of the problems introduced by the use of NULLS. A lot of those problems are made worse by SQL's three valued logic (true, false, unknown), and the programmer's common use of two valued logic.
In 6NF, every time a row/column intersection would have to be flagged as NULL, the situation can be handled by simply omitting the row.
However, I generally do not try for 6NF in database design. Most of the time, NULLable columns are not used as part of search criteria or join criteria, and the problems with NULLS don't surface.