I realise there have been debates about this, although I can find a real definitive answer.
A lot of times, this question leads to a definition of what varchar(MAX) etc is, their actual limits, and what NOT to use.
What I want to find out is this:
I am giving the user an option to type/paste in whatever they want to in a text box, without limit.
This could be anything from a word, to a byte array dump of an image.
I need to then be able to quickly reference to this data at a later stage, using its TITLE or ID.
What would be the best way to go about storing this data into a DB? I have read that using varchar(MAX) stops indexing, and generally should not be used.
I'm not really well-off in SQL, so I imagine a possible solution would be to split this string into arrays of 4000, and store them like that.
Is this a good lead? Or am I missing something obvious?
General Model:
public string a_Title { get; set; }
public string a_Content { get; set; }
public string a_AdditionalInfo { get; set; }
Where a_Content will be stored as the unknown.
I believe you don't have to worry about your storage type right now. You really have choice between varchar(n), varchar(max), nvarchar(n), nvarchar(max), even varbinary and text/ntext. All of them have its virtues and drawbacks. You can even chuck your blob-like strings into a file using FILESTREAM.
However, I believe you want to reference your data via some int or short varchar field, and you want to get your results relatively quick (not blazing-hot-get-me-data-yesterday-fast). There can be more or less optimal ways to do it, but don't think about that now.
I suggest this:
create a table structure that satisfies your needs
fill it with some dummy data
implement data access using your favorite data access layer
When you feel happy enough, create a performance test wit, say, few thousands of accesses. This is the first time you want to optimize performance. Until then I would even forget the indexing itself.
P.S. - don't split it to 4000-bytes chunks; this is a very awkward workaround and it can cause random search misses (not exactly random, you could find the bug after slow and painful debugging session)
VarChar(MAX) is intended for cases like yours.
Even if you chunk the data into NVarChar(4000) (or VarChar(8000)) byte segments to allow indexes to be built, what worth with those indexes really have?
You'll also be opening yourself up to the headache of figuring out where segments begin and end and then reconstituting them either in a nasty SQL statement or in some middle-tier client code.
Further, VarChar(MAX) and NVarChar(MAX) will be kept in-row when its possible to do so. That means until you are using 4001 or 8001 characters, respectively.
One approach would be to build two tables, and leverage VARCHAR(MAX):
CREATE TABLE table1 (ID INT PRIMARY KEY IDENTITY(1, 1), Name VARCHAR(1024) NOT NULL);
CREATE TABLE table1Text (ID INT PRIMARY KEY, data VARCHAR(MAX) NOT NULL);
and the ID in table1Text is a foreign key reference to table1. The benefit of this approach is that you can still index table1 and you can provide full-text searching on table1Text without ever having any issues of each table affecting one another.
I'd be tempted to use varbinary(max) for this "catch-all" data column you're describing. The varbinary data type can store text (see below), byte arrays and entire files. The size limit is 2GB, but the nice things about varbinary (as opposed to deprecated image) is that you can return/download a portion of the data using a simple substring function. So, if you only wanted the first 100mb of some data, you'd used:
Declare #data as varbinary(100000000)
select #data = substring(DataColumn,0,100000000) from SomeDataTable where ID = 1
To insert text into varbinary you'd use:
declare #test table (
data varbinary(max) not null,
datatype varchar(10) not null
)
--insert varchar
insert into #test (data, datatype)
values cast('Your Text Here' as varbinary(max)), 'varchar'
-- insert nvarchar
insert into #test (data, datatype) select cast(N'YourTextHere' as varbinary(max)), 'nvarchar'
-- see the results
select data, datatype from #test
select cast(data as varchar(max)) as data_to_varchar, datatype from #test
select cast(data as nvarchar(max)) as data_to_nvarchar, datatype from #test
Related
If i do a query like this
SELECT * from Foo where Bar = '42'
and Bar is a int column. Will that string value be optimized to 42 in the db engine? Will it have some kind of impact if i leave it as it is instead of changing it to:
Select * from Foo where Bar = 42
This is done on a SQL Compact database if that makes a difference.
I know its not the correct way to do it but it's a big pain going though all code looking at every query and DB schema to see if the column is a int type or not.
SQL Server automatically convert it to INT that because INT has higher precedence than VARCHAR.
You should also be aware of the impact that implicit conversions can
have on a query’s performance. To demonstrate what I mean, I’ve created and populated the following table in the AdventureWorks2008 database:
USE AdventureWorks2008;
IF OBJECT_ID ('ProductInfo', 'U') IS NOT NULL
DROP TABLE ProductInfo;
CREATE TABLE ProductInfo
(
ProductID NVARCHAR(10) NOT NULL PRIMARY KEY,
ProductName NVARCHAR(50) NOT NULL
);
INSERT INTO ProductInfo
SELECT ProductID, Name
FROM Production.Product;
As you can see, the table includes a primary key configured with the
NVARCHAR data type. Because the ProductID column is the primary key,
it will automatically be configured with a clustered index. Next, I
set the statistics IO to on so I can view information about disk
activity:
SET STATISTICS IO ON;
Then I run the following SELECT statement to retrieve product
information for product 350:
SELECT ProductID, ProductName
FROM ProductInfo
WHERE ProductID = 350;
Because statistics IO is turned on, my results include the following
information:
Table 'ProductInfo'. Scan count 1, logical reads 6, physical reads 0,
read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob
read-ahead reads 0.
Two important items to notice is that the query performed a scan and
that it took six logical reads to retrieve the data. Because my WHERE
clause specified a value in the primary key column as part of the
search condition, I would have expected an index seek to be performed,
rather than I scan. As the figure below confirms, the database engine performed a scan, rather than a seek. Figure below shows the details of that scan (accessed by hovering the mouse over the scan icon).
Notice that in the Predicate section, the CONVERT_IMPLICIT function is
being used to convert the values in the ProductID column in order to
compare them to the value of 350 (represented by #1) I passed into the
WHERE clause. The reason that the data is being implicitly converted
is because I passed the 350 in as an integer value, not a string
value, so SQL Server is converting all the ProductID values to
integers in order to perform the comparisons.
Because there are relatively few rows in the ProductInfo table,
performance is not much of a consideration in this instance. But if
your table contains millions of rows, you’re talking about a serious
hit on performance. The way to get around this, of course, is to pass
in the 350 argument as a string, as I’ve done in the following
example:
SELECT ProductID, ProductName
FROM ProductInfo
WHERE ProductID = '350';
Once again, the statement returns the product information and the statistics IO data, as shown in the following results:
Now the index is being properly used to locate the record. And if you
refer to Figure below, you’ll see that the values in the ProductID
column are no longer being implicitly converted before being compared
to the 350 specified in the search condition.
As this example demonstrates, you need to be aware of how performance
can be affected by implicit conversions, just like you need to be
aware of any types of implicit conversions being conducted by the
database engine. For that reason, you’ll often want to explicitly
convert your data so you can control the impact of that conversion.
You can read more about Data Conversion in SQL Server.
If you look into the MSDN chart which tells about the implicit conversion you will find that string is implicitly converted into int.
both should work in your case but the norme is to use quote anyway.
cuz if this work.
Select * from Foo where Bar = 42
this not
Select * from Foo where Bar = %42%
and this will
SELECT * from Foo where Bar = '%42%'
ps: you should anyway look at entity framework and linq query it make it simple...
If i am not mistaken, the SQL Server will read it as INT if the string will only contains number (numeric) and you're comparing it to the INTEGER column datatype, but if the string is is alphanumeric , then that is the time you will encounter an error or have an unexpected result.
My suggestion is , in WHERE clause, if you are comparing integer, do not put single quote. that is the best practice to avoid error and unexpected result.
You should use always parameters when executing sql by code, to avoid security lacks (EJ: Sql injection).
Can I store SOF/2015/01 as my ID, and can I auto increment 01 like usual primary key?
Can I store SOF/2015/01 as my ID
Answer : yes you can
can I auto increment 01 like usual primary key.
Answer : no you can't
Auto increment can increment only numbers.
You have to that manually.
You can use function in trigger to generate your desire auto incremented number like this
create function NextCustomerNumber()
returns char(5)
as
begin
declare #lastval char(5)
set #lastval = (select max(customerNumber) from Customers)
if #lastval is null set #lastval = 'C0001'
declare #i int
set #i = right(#lastval,4) + 1
return 'C' + right('000' + convert(varchar(10),#i),4)
end
This can cause some issues, however:
What if two processes attempt to add a row to the table at the exact
same time? Can you ensure that the same value is not generated for
both processes?
There can be overhead querying the existing data each time you'd like
to insert new data
Unless this is implemented as a trigger, this means that all inserts
to your data must always go through the same stored procedure that
calculates these sequences. This means that bulk imports, or moving
data from production to testing and so on, might not be possible or
might be very inefficient.
If it is implemented as a trigger, will it work for a set-based
multi-row INSERT statement? If so, how efficient will it be? This
function wouldn't work if called for each row in a single set-based
INSERT -- each NextCustomerNumber() returned would be the same value.
You can learn more from this
Create a two column unique primary key with the string 'SOF/2015' as part one and an auto increment-ing integer as the second column. You can combine the two columns using a function that returns a string to give you the combined key. For syntactic sugar, you can create a view on the table using your function to combine the keys into one view column.
You can certainly use NCHAR or NVARCHAR types as primary keys, on the presumption that any variable-sized columns don't use MAX, and data doesn't exceed the maximum allowable size of your index.
As for using it as an auto-incremented column, that won't work. SQL is very smart, but not quite in that way.
I would suggest pulling that string into two or three separate columns, so that you can store the "01" portion as a separate, IDENTITY'd column. But certainly this is a design question that you'd have to work out on your own.
The other solution would be a trigger, but I'd probably hesitate in general with using something like this as a primary key. Using numeric types is just a lot nicer in many ways, particularly when you have to reference the tables elsewhere. You could always apply a UNIQUE index on the string representation.
I wanted to see what others have experienced when working with types like List<> or Dictionary<> and having in turn storing and retrieving that data?
Here's an example scenario: users will be creating their own "templates", where these templates is essentially a collection of Dictionary, e.g. for user1, values are (1, Account), (2, Bank), (3, Code), (4, Savings), and for user2, values (unrelated) could be (1, Name), (2, Grade), (3, Class), and so on. These templates/lists could be of varying length but they will always have an index and a value. Also, each list/ template will have one and only one User linked to it.
What types did you choose on the database side?
And pain-points and/or advice I should be aware of?
As far as the types within the collection go, there is a fairly 1-to-1 mapping between .Net types and SQL types: SQL Server Data Type Mappings. You mostly need to worry about string fields:
Will they always be ASCII values (0 - 255)? Then use VARCHAR. If they might contain non-ASCII / UCS-2 characters, then use NVARCHAR.
What is their likely max length?
Of course, sometimes you might want to use a slightly different numeric type in the database. The main reason would be if an int was chosen on the app side because it "easier" (or so I have been told) to deal with than Int16 and byte, but the values will never be above 32,767 or 255, then you should most likely use SMALLINT or TINYINT respectively. The difference between int and byte in terms of memory in the app layer might be minimal, but it does have an impact in terms of physical storage, especially as row counts increase. And if that is not clear, "impact" means slowing down queries and sometimes costing more money when you need to buy more SAN space. But, the reason I said to "most likely use SMALLINT or TINYINT" is because if you have Enterprise Edition and have Row Compression or Page Compression enabled, then the values will be stored in the smallest datatype that they will fit in.
As far as retrieving the data from the database, that is just a simple SELECT.
As far as storing that data (at least in terms of doing it efficiently), well, that is more interesting :). A nice way to transport a list of fields to SQL Server is to use Table-Valued Parameters (TVPs). These were introduced in SQL Server 2008. I have posted a code sample (C# and T-SQL) in this answer on a very similar question here: Pass Dictionary<string,int> to Stored Procedure T-SQL. There is another TVP example on that question (the accepted answer), but instead of using IEnumerable<SqlDataRecord>, it uses a DataTable which is an unnecessary copy of the collection.
EDIT:
With regards to the recent update of the question that specifies the actual data being persisted, that should be stored in a table similar to:
UserID INT NOT NULL,
TemplateIndex INT NOT NULL,
TemplateValue VARCHAR(100) NOT NULL
The PRIMARY KEY should be (UserID, TemplateIndex) as that is a unique combination. There is no need (at least not with the given information) for an IDENTITY field.
The TemplateIndex and TemplateValue fields would get passed in the TVP as shown in my answer to the question that I linked above. The UserID would be sent by itself as a second SqlParameter. In the stored procedure, you would do something similar to:
INSERT INTO SchemaName.TableName (UserID, TemplateIndex, TemplateName)
SELECT #UserID,
tmp.TemplateIndex,
tmp.TemplateName
FROM #ImportTable tmp;
And just to have it stated explicitly, unless there is a very specific reason for doing so (which would need to include never, ever needing to use this data in any queries, such that this data is really just a document and no more usable in queries than a PDF or image), then you shouldn't serialize it to any format. Though if you were inclined to do so, XML is a better choice than JSON, at least for SQL Server, as there is built-in support for interacting with XML data in SQL Server but not so much for JSON.
List or any collection's representation in databases are supposed to be tables. Always think of it as a collection and relate it to what a database offers.
Though you can always serialize a collection, i do not suggest it since updating or inserting records, you'd always update the whole record or data whereas having a table, you'd only have to query for the KEY wherein Dictionary, you already have it.
I have a SQL Server database designed like this :
TableParameter
Id (int, PRIMARY KEY, IDENTITY)
Name1 (string)
Name2 (string, can be null)
Name3 (string, can be null)
Name4 (string, can be null)
TableValue
Iteration (int)
IdTableParameter (int, FOREIGN KEY)
Type (string)
Value (decimal)
So, as you've just understood, TableValue is linked to TableParameter.
TableParameter is like a multidimensionnal dictionary.
TableParameter is supposed to have a lot of rows (more than 300,000 rows)
From my c# client program, I have to fill this database after each Compute() function :
for (int iteration = 0; iteration < 5000; iteration++)
{
Compute();
FillResultsInDatabase();
}
In FillResultsInDatabase() method, I have to :
Check if the label of my parameter already exists in TableParameter. If it doesn't exist, i have to insert a new one.
I have to insert the value in the TableValue
Step 1 takes a long time ! I load all the table TableParameter in a IEnumerable property and then, for each parameter I make a
.FirstOfDefault( x => x.Name1 == item.Name1 &&
x.Name2 == item.Name2 &&
x.Name3 == item.Name3 &&
x.Name4 == item.Name4 );
in order to detect if it already exists (and after to get the id).
Performance are very bad like this !
I've tried to make selection with WHERE word in order to avoid loading every row of TableParameter but performance are worse !
How can I improve the performance of step 1 ?
For Step 2, performance are still bad with classic INSERT. I am going to try SqlBulkCopy.
How can I improve the performance of step 2 ?
EDITED
I've tried with Store Procedure :
CREATE PROCEDURE GetIdParameter
#Id int OUTPUT,
#Name1 nvarchar(50) = null,
#Name2 nvarchar(50) = null,
#Name3 nvarchar(50) = null
AS
SELECT TOP 1 #Id = Id FROM TableParameter
WHERE
TableParameter.Name1 = #Name1
AND
(#Name2 IS NULL OR TableParameter.Name2= #Name2)
AND
(#Name3 IS NULL OR TableParameter.Name3 = #Name3)
GO
CREATE PROCEDURE CreateValue
#Iteration int,
#Type nvarchar(50),
#Value decimal(32, 18),
#Name1 nvarchar(50) = null,
#Name2 nvarchar(50) = null,
#Name3 nvarchar(50) = null
AS
DECLARE #IdParameter int
EXEC GetIdParameter #IdParameter OUTPUT,
#Name1, #Name2, #Name3
IF #IdParameter IS NULL
BEGIN
INSERT TablePArameter (Name1, Name2, Name3)
VALUES
(#Name1, #Name2, #Name3)
SELECT #IdParameter= SCOPE_IDENTITY()
END
INSERT TableValue (Iteration, IdParamter, Type, Value)
VALUES
(#Iteration, #IdParameter, #Type, #Value)
GO
I still have the same performance... :-( (not acceptable)
If I understand what's happening you're querying the database to see if the data is there in step 1. I'd use a db call to a stored procedure that that inserts the data if it not there. So just compute the results and pass to the sp.
Can you compute the results first, and then insert in batches?
Does the compute function take data from the database? If so can you turn the operation in to a set based operation and perform it on the server itself? Or may part of it?
Remember that sql server is designed for a large dataset operations.
Edit: reflecting comments
Since the code is slow on the data inserts, and you suspect that it's because the insert has to search back before it can be done, I'd suggest that you may need to place SQL Indexes on the columns that you search on in order to improve searching speed.
However I have another idea.
Why don't you just insert the data without the check and then later when you read the data remove the duplicates in that query?
Given the fact that name2 - name3 can be null, would it be possible to restructure the parameter table:
TableParameter
Id (int, PRIMARY KEY, IDENTITY)
Name (string)
Dimension int
Now you can index it and simplify the query. (WHERE name = "TheNameIWant" AND Dimension="2")
(And speaking of indices, you do have index the name columns in the parameter table?)
Where do you do your commits on the insert? if you do one statement commits, group multiple inserts into one.
If you are the only one inserting values, if speed is really of essence, load all values from the database into the memory and check there.
just some ideas
hth
Mario
I must admit that I'm struggling to grasp the business process that you are trying to achieve here.
On initial review, it appears as if you are are performing a data comparison within your application tier. I would advise against this and suggest that you let the Database Engine do what it is designed to do, to manage and implement your data access.
As another poster has mentioned, I concur that you should look to create a Stored Procedure to handle your record insertion logic. The procedure can perform a simple check to see if your records already exist.
You should also consider:
Enforcing the insertion logic/rule by creating a Unique Constraint across the four name columns.
Creating a covering non-clustered index incorporating the four name columns.
With regard to performance of your inserts, perhaps you can provide some metrics to qualify what it is that you are seeing and how you are measuring it?
To give you a yardstick the current ETL insertion record for SQL Server is approx 16 million rows per second. What sort of numbers are you expecting and wanting to see?
the fastest way ( i know so far) is bulk insert. but not just lines of INSERT. try insert + select + union. it works pretty fast.
insert into myTable
select a1, b1, c1, ...
union select a2, b2, c2, ...
union select a3, b3, c3, ...
We have some code in which we need to maintain our own identity (PK) column in SQL. We have a table in which we bulk insert data, but we add data to related tables before the bulk insert is done, thus we can not use an IDENTITY column and find out the value up front.
The current code is selecting the MAX value of the field and incrementing it by 1. Although there is a highly unlikely chance that two instances of our application will be running at the same time, it is still not thread-safe (not to mention that it goes to the database everytime).
I am using the ADO.net entity model. How would I go about 'reserving' a range of id's to use, and when that range runs out, grab a new block to use, and guarantee that the same range will not be used.
use more universal unique identifier data type like UNIQUEIDENTIFIER (UUID) instead of INTEGER. In this case you can basically create it on the client side, pass it to the SQL and do not have to worry about it. The disadvantage is that, of course, the size of this field.
create a simple table in the database CREATE TABLE ID_GEN (ID INTEGER IDENTITY), and use this as a factory to give you the identifiers. Ideally you would create a stored procedure (or function), to which you would pass the number of identifiers you need. The stored procedure will then insert this number of rows (empty) into this ID_GEN table and will return you all new ID's, which you can use in your code. Obviously, your original tables will not have the IDENTITY anymore.
create your own variation of the ID_Factory above.
I would choose simplicity (UUID) if you are not constrained otherwise.
If it's viable to change the structure of the table, then perhaps use a uniqueidentifier for the PK instead along with newid() [SQL] or Guid.NewGuid() [C#] in your row generation code.
From Guid.NewGuid() doco:
There is a very low probability that the value of the new Guid is all zeroes or equal to any other Guid.
Why are you using ADO.net Entity Framework to do what sounds like ETL work? (See critique of ADO.NET Entity Framework and ORM in general below. It is rant free).
Why use ints at all? Using a uniqueidentifier would solve the "multiple instances of the application running" issue.
Using a uniqueidentifier as a column default will be slower than using an int IDENTITY... it takes more time to generate a guid than an int. A guid will also be larger (16 byte) than an int (4 bytes). Try this first and if it results in acceptable performance, run with it.
If the delay introduced by generating a guid on each row insert it unacceptable, create guids in bulk (or on another server) and cache them in a table.
Sample TSQL code:
CREATE TABLE testinsert
(
date_generated datetime NOT NULL DEFAULT GETDATE(),
guid uniqueidentifier NOT NULL,
TheValue nvarchar(255) NULL
)
GO
CREATE TABLE guids
(
guid uniqueidentifier NOT NULL DEFAULT newid(),
used bit NOT NULL DEFAULT 0,
date_generated datetime NOT NULL DEFAULT GETDATE(),
date_used datetime NULL
)
GO
CREATE PROCEDURE GetGuid
#guid uniqueidentifier OUTPUT
AS
BEGIN
SET NOCOUNT ON
DECLARE #return int = 0
BEGIN TRY
BEGIN TRANSACTION
SELECT TOP 1 #guid = guid FROM guids WHERE used = 0
IF #guid IS NOT NULL
UPDATE guids
SET
used = 1,
date_used = GETDATE()
WHERE guid = #guid
ELSE
BEGIN
SET #return = -1
PRINT 'GetGuid Error: No Unused guids are available'
END
COMMIT TRANSACTION
END TRY
BEGIN CATCH
SET #return = ERROR_NUMBER() -- some error occurred
SET #guid = NULL
PRINT 'GetGuid Error: ' + CAST(ERROR_NUMBER() as varchar) + CHAR(13) + CHAR(10) + ERROR_MESSAGE()
ROLLBACK
END CATCH
RETURN #return
END
GO
CREATE PROCEDURE InsertIntoTestInsert
#TheValue nvarchar(255)
AS
BEGIN
SET NOCOUNT ON
DECLARE #return int = 0
DECLARE #guid uniqueidentifier
DECLARE #getguid_return int
EXEC #getguid_return = GetGuid #guid OUTPUT
IF #getguid_return = 0
BEGIN
INSERT INTO testinsert(guid, TheValue) VALUES (#guid, #TheValue)
END
ELSE
SET #return = -1
RETURN #return
END
GO
-- generate the guids
INSERT INTO guids(used) VALUES (0)
INSERT INTO guids(used) VALUES (0)
--Insert data through the stored proc
EXEC InsertIntoTestInsert N'Foo 1'
EXEC InsertIntoTestInsert N'Foo 2'
EXEC InsertIntoTestInsert N'Foo 3' -- will fail, only two guids were created
-- look at the inserted data
SELECT * FROM testinsert
-- look at the guids table
SELECT * FROM guids
The fun question is... how do you map this to ADO.Net's Entity Framework?
This is a classic problem that started in the early days of ORM (Object Relational Mapping).
If you use relational-database best practices (never allow direct access to base tables, only allow data manipulation through views and stored procedures), then you add headcount (someone capable and willing to write not only the database schema, but also all the views and stored procedures that form the API) and introduce delay (the time to actually write this stuff) to the project.
So everyone cuts this and people write queries directly against a normalized database, which they don't understand... thus the need for ORM, in this case, the ADO.NET Entity Framework.
ORM scares the heck out of me. I've seen ORM tools generate horribly inefficient queries which bring otherwise performant database servers to their knees. What was gained in programmer productivity was lost in end-user waiting and DBA frustration.
The Hi/Lo algorithm may be of interest to you:
What's the Hi/Lo algorithm?
Two clients could reserve the same block of id's.
There is no solution short of serializing your inserts by locking.
See Locking Hints in MSDN.
I fyou have a lot of child tables you might not want to change the PK. PLus the integer filedsa relikely to perform better in joins. But you could still add a GUID field and populate it in the bulk insert with pre-generated values. Then you could leave the identity insert alone (almost alawys a bad idea to turn it off) and use the GUID values you pre-generated to get back the Identity values you just inserted for the insert into child tables.
If you use a regular set-based insert (one with the select clause instead of the values clause) instead of a bulk insert, you could use the output clause to get the identities back for the rows if you are using SQL Server 2008.
The most general solution is generate client identifiers that never across with database identifiers - usually it is negative values, then update identifiers with identifier generated by database on inserting.
This way is safe to use in application with many users inserts the data simultaneously. Any other ways except GUIDs are not multiuser-safe.
But if you have that rare case when entity's primary key is required to be known before entity is saved to database, and it is impossible to use GUID, you may use identifier generation algorithm which are prevent identifier overlapping.
The most simple is assigning a unique identifier prefix for each connected client, and prepend it to each identifier generated by this client.
If you are using ADO.NET Entity Framework, you probably should not worry about identifier generation: EF generates identifiers by itself, just mark primary key of the entity as IsDbGenerated=true.
Strictly saying, entity framework as other ORM does not require identifier for objects are not saved to database yet, it is enought object reference for correctly operating with new entities. Actual primary key value is required only on updating/deleting entity, and on updating/deleting/inserting entity that references new entity, e.i. in cases when actual primary key value is about to be written in database. If entity is new, it is impossible to save other entites that are referenced new entity until new entity is not saved to database, and ORMs maintains specific order of entities saving which take references map into account.