Incremental ETL on code first many-to-many association table

Incremental ETL on code first many-to-many association table - c#

I'm setting up a data warehouse (in SQL Server) together with our engineers we got almost everything up and running. Our main application also uses SQL Server as backend, and aims to be code first while using the entity framework. In most tables we added a column like updatedAt to allow for incremental loading to our data warehouse, but there is a many-to-many association table created by the entity framework which we cannot modify. The table consists of two GUID columns with a composite key, so they are not iterable like an incrementing integer or dates. We are now basically figuring out the options on how to enable incremental load on this table, but there is little information to be found.
After searching for a while I mostly came across posts which explained how it's not possible to manually add columns (such as updatedAt) to the association table, such as here Create code first, many to many, with additional fields in association table. Suggestions are to split out the table into two one-to-many tables. We would like to prevent this if possible.
Another potential option would be to turn on change data capture on the server, but that would potentially defeat the purpose of code first in the application.
Another thought was to add a column in the database itself, not in code, with a default value of the current datetime. But that might also be impossible / non compatible with the entity framework, as well as defeating the code first principle.
Are we missing anything? Are there other solutions for this? The ideal solution would be a code first solution, or a solution in the ETL process without affecting the base application, without changing too much. Any suggestions are appreciated.

Related

Potential conflict between transferring data and '.ValueGeneratedOnAdd()'

I apologize if this is duplicative; I could find nothing directly pertaining.
The difficulty involves EF Core (v 3.1.8, if it matters), but is not specific or restricted thereto. I am doing code first, creating a number of entities, but the key point is that I am getting my initial data set from an app that I am trying to replace. My new app has a number of structural differences in every corresponding entity, but the data in the old app is still critical, so I will be transferring it to my new database. (Old db is hosted by MS SQL 2008; new db is hosted by MS SQL 2019, if it matters).
Most of the key fields are GUIDs, and the problem is that in EF Core, at the point in the future when I want to use the new app to do more data entry, I will also want the database to choose the GUID. In EF Core Fluent API parlance, that would be, for example:
modelBuilder.Entity("ReplaceOldApp.Models.Address", b =>
{
b.Property<Guid>("AddressID")
.ValueGeneratedOnAdd()
.HasColumnType("uniqueidentifier");
}
However, if I inform EF Core that I want the database to create the key, then it will create the tables such that when I try to transfer the data from the old database (whether using EF or some other means), the new database will ignore the old GUID and create a new, unrelated one. (Or at least, that's what I think will happen. I'm not ready to try it yet.) If that happens, then all of the data from, say, the old Person entity (such as the above-implied Address entity), will no longer be related between their corresponding entities in the new database, because all records will have shiny new GUIDs. I will have all the information, and no way to actually use it.
Obviously I can tell EF Core to inform the database that it will not be creating the GUIDs, and I can then read, unmunge and transfer the data from the old database to the new without fear of data loss (God willing). But then going forward, for any new data entry, the GUIDs will not be automatically genned. I can of course then mod my IEntityTypeConfiguration Fluent API classes for the various entities and do a second migration, re-genning the affected tables, but I'm worried that EF Core will decide that it needs to DROP the tables to accommodate such a change. (Again, I do not know for sure because I have not tried it: sorry.)
So my question is: How would you approach such a situation? Should I ignore EF and do something clever with MS SQL Studio? Should I do two migrations with a transfer in-between? Should I tell the database, even though it has been told to gen the keys, somehow to accept the old keys without changing things, perhaps via LINQ?
============== Edit:
I'm sure SSIS would work to transfer the data from old to new databases, but the learning curve appears daunting, and I am only trying to solve one problem, not gain a new career. Powershell ditto, although it may be a bit more of a hacker's tool, and as such knowledge of it might assist tweaking or help to solve a diverse set of one-time SQL Server headaches. However, again, as would you, I prefer to use what I know, or failing that, learn or learn more about a tool which promises to serve me consistently into the future.
With the very welcome new (to me) information about IDENTITY_INSERT, and information gained from Linq To Sql and identity_insert, I believe I should not use LINQ to SQL because it may assume that IDENTITY_INSERT is OFF and simply filter out the crucial GUID, failing therefore to provide it to the target server. Rather, it seems I can use C# to produce a series of generated SQL statements, and then run each one on the target server inside a TransactionScope(). Because each such insert will thereby run 'in the same connection', the state of IDENTITY_INSERT will be preserved for that entire insert transaction, and (creek don't rise) it should work.
Again, I appreciate your answer, Randy in Marin. It has, it seems, led me to an approach that will work within the potential constraints of my context (EF Core), while allowing me to preserve the crucial existing IDENTITY information. Peace.

Not being an EF programmer, I don't know if there is an option for identity insert that you can enable for a migration. You might search the term to see if it comes up.
Our team support database migrations. We can do it a number of ways. I would not even consider EF because it's not designed for data migrations - or for database design. (And because we tend to use what we know.)
This is not the way I would do it, but it might be better than SSIS if you have not used SSIS. If the tables are in the same database or in databases on the same server, you can use T-SQL to load each table one at a time. Even if not on the same server, a linked server would allow a distributed transaction. (I avoid linked servers like the plague, but for a one time thing like a migration I would tolerate it. I would rather restore a copy of the source database to the destination server to use as a source. Distributed transactions gone wrong have forced me to reboot critical servers.)
Each table can have a 4 part name. If the server part (e.g., using a linked server name) is not present, the local instance is used. If the database part is not present, the current database is used. This is the format I assume for the "src_table" and "dst_table".
[myserver\myinstance].[mydatabase].[myschema].[mytable]
Each table is loaded with T-SQL as follows:
TRUNCATE TABLE dst_table
SET IDENTITY_INSERT dst_table ON
INSERT dst_table (...) SELECT ... FROM src_table
SET IDENTITY_INSERT dst_table OFF -- must be turned off - only 1 table can have this ON
If there are foreign keys, some tables (e.g., def tables) would need to be loaded first.
If the table does not have an IDENTITY column (EF code creates all values), you don't use the IDENTITY_INSERT stuff. It will fail if you use it and there is not an identity column. It will fail if you don't use it and try to insert into an identity column.
If there is a lot of data in a table, the transaction might be too big or slow. Inserting in batches might be called for.
If it was something to run on a schedule, I would likely create a SSIS package to do the load.
If I wanted to try something new, I would use powershell and the DBATools module cmdlets to see if extracting to csv and importing the csv would be efficient. The import cmdlet has a column mapping parameter, among many others. PowerShell could be used to do transformation, but I think this crosses over into SSIS territory.
I have dealt with migrations where the GUIDs and IDs no longer related after the move. Using queries joining the new data to the old data, we were able to fix the related values. It's likely more work to fix it after than to plan for it to be correct from the start.

What is the best pattern for persisting entity order in a relational database?

Environment: ASP.NET MVC application with EF (Database-first with control over the db design) and Sql Server.
I have a lot of entities that have user-generated properties. I need a way for the user to be able to specify the order of these elements.
Slightly more detail: The user can create a "template" that they can then add "properties" to. These "properties" need to be ordered. There is also >= 4 other different entity types that also require user-specified ordering.
I have no problem with the front-end aspect of this, but I want to know what is the best way to persist the element order in sql server.
The obvious solution to me is to give each entity a column "Order" (or another non-keyword name) and, upon a reordering (eg. moving element #4 to #2) update all affected entities.
Is this the best way to solve this problem?

This doesn't sound like a small project and you might have various other dynamic customizations on top of the SortOrder property.
Adding an SortOrder column to your entity table is certainly not a bad approach, but this approach might clog up your data with information that doesn't necessarily belong to that entity (especially if multiple users can customize the same instances).
So I've got an alternative idea for you:
Add a CustomizationNode table (or something similar) to your database
Here you store SortOrder and potentially other kinds of metadata and user customizations which are not necessarily part of the conceptual entity.
Then, should you need to add/change/remove any customization info, you'll only need to do so in one table rather than several. And you don't need to migrate your entities whenever you change the customization capabilities.
Depending on your situation, you can link them in one of several ways:
1. Add a single column CustomizationNodeId to each entity table
This pertains to having a single customization per entity instance and is the simplest solution.
Also one customization could be shared across multiple entities of the same type (or even different types, though that probably doesn't make much sense)
2. Add multiple columns EntityXId, EntityYId to the CustomizationNode table.
In principle, only one of these ID fields would be filled and the others will be empty. Can seem a bit "off", but is not necessarily a wrong way to do it.
While you lose the ability to share a customization across multiple entities of the same type, you gain the ability to have multiple customizations per entity and additional FK's such as the UserID. This would allow you to have a per-user customization.
3. Add a link table between each EntityX and CustomizationNode
This is the most complex but also the most versatile solution. It embodies adding a table with a FK to each table you wish to link.
One important benefit you gain is additional decoupling. Customizations and entities don't know of eachother's existence and change without impacting one another.
Furthermore you can add additional metadata to those link tables, such that you can have things like versioning on top of everything mentioned above in 1 and 2.
The bottom line is, if your application is highly dynamic and customizeable then you might want to store "metadata" separately from your actual "data".

Entity Framework AddObject to only "INSERT INTO" certain columns

We have a system that will use the same code to communicate with different client databases. These databases will use the same EF Model, but different connection strings.
Our problem is, not every site will be using the same version of our database structure; some might be missing a few columns or contain a few old columns.
If we upgrade the system to the current version, now the database model now has an extra EmergencyContact column. All older databases will now fail, because EF is trying to insert into this column (even though we have not set a value for this property).
Is there a way of telling EF to only use columns for which we have a value for, when it generates the INSERT INTO query?

EF will be fine if your schema has missing columns that are in the real database, but it will not work if you have columns in the schema that are not in the database, and there is no way to fix that.
Your only choice is to use different schemas for different databases, and write code that manages them (ie, only instantiates the version of the context you need).

In the case where your model does not match your database schema, EF will only insert/update the columns in the model. However, if the unknown columns are not null, EF will throw an exception. Also, if you created relational constraints on the unknown columns, of course those will not be created as they are not yet known.

If the persistence layer per site is the only part that changes then I would extract your EF model into it's own version e.g.
DbV1.dll
DbV2.dll
You could then load in the appropriate DLL based on some setting from the client i.e. you could pass information as a custom header e.g.
db-version: 1
There are other more reliable ways, however, I don't know what your current setup is like so it's difficult to answer.

Custom Fields in .Net and SQL Server

We have a requirement on our project for custom fields. We have some standard fields on the table and each customer wants to be able to add their own custom fields. At the moment I am not interested in how this will work in the UI, but I want to know what the options are for the back end storage and retrieval of the data. The last time I did something like this was about 10 years ago in VB6 so I would be interested to know what the options are for this problem in today's .Net world.
The project is using SQL server for the backend, linq-to-sql for the ORM and a C# asp.net front end.
What are my options for this?
Thanks

There are four main options here:
actually change the schema (DDL) at runtime - however, pretty much no ORM will like that, and generally has security problems as your "app" account shouldn't normally be redefining the database; it does, however, avoid the "inner platform" effect inherent in the next two
use a key-value store as rows, i.e. a Customer table might have a CustomerValues table with pairs like "dfeeNumber"=12345 (one row per custom key/value pair) - but a pain to work with (instead of a "get", this is a "get" and a "list" per entity)
use a single hunk of data (xml, json, etc) in a CustomFields single cell - again, not ideal to work with, but it easier to store atomically with the main record (downside: forces you to load all the custom fields to read a single one)
use a document database (no schema at all) - but then: no ORM
I've used all 4 at different points. All 4 can work. YMMV.

I have a similar situation on the project I'm working on now.
Forget about linq-to-sql when you are having a flexible database schema. There is no way to update the linq-to-sql models on the fly when the DB schema changes.
Solutions:
Keep an extra table with the table name the values belong to , column name , value etc
Totally dynamically change your table schema each time they add a field.
Use a NOSQL solution like mongoDB or the Azure Table Storage. A NOSQL solution doesn't require a schema and can be changed on the fly.
This is a handy link 2 read:
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:10678084117056

You're referring to an EAV model (entity-attribute-value).
Here's an article: http://hanssens.org/post/Generic-Entity-Attribute-Value-Model-e28093-A-POCO-Implementation.aspx

Using NHibernate with ancient database with some "dynamic" tables

I have a legacy database with a pretty evil design that I need to write some applications for. I am not allowed to touch the database design at all, seeing how this is a fragile old system held together by spit and prayers. I am of course very aware that this is not how the database should have been designed in the first place, but real life some times gets in the way..
For my new application I am using NHibernate (with Fluent for mappings and NHibernate LINQ for querying) and trying to Do Things Right. So there is IoC and repositories and more interfaces than I can count. However, the DB structure is giving me some headaches.
The system is very much focused around the concept of customers, and each customer lives in a campaign. These campaigns are created by one of the old applications. Each campaign in the system is defined in a table called CampaignSettings. One of the columns of this table is simply a text column called "Table", which refers to a database table that is created at the same time as the campaign entry in CampaignSettings. The name of this table is related to the name of the campaign, which can pretty much be anything the customer wants (within the constraints given by SQL Server (2000 or 2005)). In these tables the customers live.
So that is challenge #1 - I won't know the table names until runtime. And it will change from site to site - no static mapping I guess.
To make it even worse, we have challenge #2 - this campaign table is also dynamic in structure, meaning it has a certain number of columns that are always there (customer id, name, phone number, email address and other housekeeping stuff), and then there are two other sets of columns, added depending on the requirements of the customer on a case-by-case basis.
The old applications use SQL to get the column names present in the table, then add the ones it doesn't know about as "custom fields" in the application. I need to handle this.
I know I probably can't handle these challenges simply by using mapping magic, and I am prepared to do some ugly SQL in addition to the ORM goodness that I get from NHibernate (there are 20-some "static" tables in here as well which NHibernate handles beautifully) - but how?
I will create a Customer entity that I guess I can populate manually by doing direct SQL like
SELECT * FROM SomeCampaignTable WHERE id=<?>
and then going through the columns one by one and putting stuff where it belongs. Not fun, but necessary.
And then I guess to discover the structure of the table in the first place, I could run SQL like this:
SELECT COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'SomeCampaignTable'
ORDER BY ORDINAL_POSITION
And again do some manual work to configure my object to handle the custom fields.
My question is simply - how can I do this in NHibernate? Is it a simple matter of finding a way to run my own SQL, then looping through the results, or is there a more elegant way to take the pain out of it?
While I appreciate that this database design belongs in some kind of Museum of Torture somewhere, answers like "Add some views" or "Change the DB" won't help me - I will be shot if I suggest something like that.
Thanks for anything that could help save my sanity here!

You might be able to use NHibernate using Native SQL Entity Queries. Forget Linq2NH - not that I would recommend Linq2NH for any serious application.
Check this page.
13.1.2. Entity queries
https://www.hibernate.org/hib_docs/nhibernate/1.2/reference/en/html/querysql.html
You could maybe do something like this:
Map your entities based on a 'fake' table to keep NHibernate happy when it compiles the mapping documents (I know you said you can't change the DB, but hopefully ok to make an empty table to keep NH happy).
Then run a query like this, as per 13.1.2 above:
sess.CreateSQLQuery("SELECT tempColumn1 as mappingFileColumn1, tempColumn2 as mappingFileColumn2, tempColumn3 as mappingFileColumn3 FROM tempTableName").AddEntity(typeof(Cat));
NHibernate should stitch together the columns you've returned with the mapped entity and give you the entity of type 'Cat' with all the properties populated. I am speculating here though, I do not know for sure if this will work, its the only way I can think of to use NHibernate for this given you don't know the tables/columns at compile time. You definitely cannot use HQL, Criteria, Linq2NH since you don't know the tables and columns at compile time, and HQL et al all convert your mappings to the mapped column names to produce the underlying SQL. Native SQL Queries are the only way I think.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.