EF and Linq with self referential table

EF and Linq with self referential table - c#

I have a self-referential table in my database that looks sort of like above. Basically its setup in such a way that each row has a unique ID (identity PK) and a DependentID to indicate any other record in the set that it is dependent on. It is very similar to the parent-child type examples you often see in SQL textbooks but my case is subtly unique in the sense that a given record can also be dependent upon itself (see row 1 above)
Two questions:
Can EF be made to represent this relationship properly? I've read several posts on here that suggest that it does not deal with this scenario gracefully so my initial thought was that it might not even be worth it, I might be better off just treating it as a normal table and writing the business logic to ensure the data gets inserted/updated correctly. In my scenario, I won't ever be querying these entities thru EF really, the app will basically load them all at startup and then I'll run linq queries against them at runtime to filter as needed
Assuming I cannot get it to work with EF and as I note in #1 I simply load em all up into memory at startup (there are only going to be 50-100 or so), what would be the most efficient way to join on this via linq? I would want to be able to pass in a DependentId and get all the records associated with it and their properties...so in this example I'd want to pass in '1' and get back:
1 - John - 10
2 - Mike - 25
3 - Bob - 5
thanks for the help

Indeed, the entity framework cannot represent such a relationship, certainly not in in a recursively queryable form.
But you are not asking for recursive queries, so you could treat DependentId as just another data column. Doing that, it would be trivial to build and execute your question-two query against the database.
UPDATE:
That query would look something like
int dependentIdToSearch = 1;
var q = from something in db.mytable
where something.DependentId == dependentIdToSearch
select new { something.Id, something.Name, something.Value };
END UPDATE
If you do need recursive queries (all direct and indirect dependencies of), you need a table valued function with a common table expression. The entity framework cannot deal with that either, at least not in the current version. If you need this support, you can wait for EF 5 or use Linq to SQL (which had support for table valued functions since the first version years ago).
You can indeed also read the entire table in memory, provided that it is read-only, or that there is only "one memory" (single server, not load-balanced or client app with local database).
If it's read-only, you have the option to build an object graph once at load time, enabling efficient execution later. For example, you could define a class with a collection of objects that are dependent on each object. Your query then becomes a trivial iteration over that collection.

Related

Working with a large amount of data

I got c# appliaction and entity famework as ORM.
I got database with table Images. Table have Id, TimeStamp, Data columns.
This table can have really ALOT entities. Also Datacolumn contain large byte array.
I need to take first entity starting from some date, or first 5 as example.
var result = Images.OrderBy(img => img.TimeStamp).FirstOrDefault(img => img.TimeStamp > someDate);
throws out of memory exception.
Is there some way to pass that?
Should i use stored procedure or something else?

If Images is already a queried object, then when you OrderBy it, it accesses the whole set. I'll assume it isn't, and it is directly your DbSet or an EF IQueryable (so you are querying using Linq-To-Entities and not Linq-To-Objects and the ordering is done on the query to the database, and not on the returned whole set).
Unless you need change tracking detection, use AsNoTracking on your DbSet (in this case, Context.Images.AsNoTracking().OrderBy(...). That should lower the memory requirements by a lot (change tracking detection requires more than twice the memory).
Also, if using large blob data, it might be wise to store it in its own table (with just an id and the data) and access it only when you need it (having a reference to this id on the table/entity where you are doing your operations) if you are using an ORM and want to work with the original entity all the time (you could also use a Select to project the query on a new entity without the blob field).
If you need to access the image data for the returned rows all the time, and there's not enough memory in the system for it, then tough luck.

Retrieve just some columns using an ORM

I'm using Entity Framework and SQL Server 2008 with the Database First approach.
My problem is :
I have some tables that holds many many columns (~100), and when I try to retrieve a lot of rows it takes a significant time before it returns the results, even if sometimes I need to use just 3 or 4 columns from that table.
I passed half a day in Stackoverflow trying to find a way to solve this problem, and I came up with two solutions :
Using stored procedures to retrieve data with the columns I want.
Edit the .edmx (xml) and the .cs files to remove the columns that I won't use.
My problem again is :
If I use stored procedures to retrieve the data with the columns that I want, Entity Framework loose it benefit and I can use ADO.NET instead of it and call directly the stored procedures ...
I can't take the second solution, because every time I make a change in the database, I'm obliged to regenerate the .edmx file and I loose the changes I made before :'(
Is there a way to do this somehow in Entity Framework ? Is that possible !
I know that other ORMs exist like NHibernate or Dapper, but I don't know if they can offer this feature without causing a lot of pain.

You don't have to return every column each time. You can specify which columns you need.
var query = from t in db.Table
select new { t.Column1, t.Column2, t.Column3 };

Normally if you project the data into a different poco it will do this automatically in EF / L2S etc:
var slim = from row in db.Customers
select new CustomerViewModel {
Name = row.Name, Id = row.Id };
I would expect that to only read 2 columns.
For tools like dapper: since you control the SQL, only specify columns you want - don't use *

You can create a second project with a code-first DbContext, POCO's and maps that return the subset of columns that you require.
This is a case of cut and paste code but it will get you what you need.
You can just create classes and project the data into them but I'm not sure you can make updates using this method. You can use anonymous types within a single method but you'll need actual classes to pass around between methods.
Another option would be to move to a code first development.

High performance Custom user fields

looking for examples/tutorial for custom user fields, not via EAV
EAV is going to be problematic for various reasons such as performance
there are many base entities/tables with over 100000 records each
there will likely be over a dozen attributes
the records are to be displayed in a flat ui grid incl. custom fields so flattening them would be an issue while maintaining performance
Looking at enabling this via DDL where all custom fields would go into a matching table such as
<tablename>_custom_<userid>
and all user attributes would map to a column each and all their metadata stored in a metadata table
the retrieval would be simpler where the query would simply be
select *
from <tablename> A, tableName_custom_userid B
where B.KeyField = A.KeyField --( perhaps using outer join, haven't gone that far yet )
Wondering if there are any gotchas down the road that i need to be aware of ?
of course any samples/pointers would be helpful to kickstart the effort
specifically would appreciate any advice on using DDL for Sql Server compact 4

One technique I have seen used is to use a sort of 'hard-coded' EAV pattern. Don't hang up! It worked well with the dataset sizes you were talking about and didn't actually use EAV - it was only EAV-esque.
The idea is to have a set of tables to store these custom attributes within it, with some triggers (described below) on them. The custom attributes tablesets store metadata about the attribute (what table it goes with, data type, constraints, etc). You can get very fancy with this but I did not haev the need.
The triggers on your meta-tables are there to re-generate views that rollup base+extension into first class objects within the DB. So instead of table person + employee extension table, you have an employee view that includes both. When you drop a new value into the custom attributes tables, the triggers will re-roll the views and include the new stuff. If you wanted to go nuts, you could also have the triggers re-write stored procedures as well. Depending on how your mid-tier code is structured, you would still be forced to re-code some, however this would be the case anyway should you be applying rules that read the data.
In testing, I found that for the relatively small # of records you're talking about, performance was somewhat slower but followed roughly the same pattern of degradation (2x the number of records, ~2x as slow).
-- edits --
How I saw it done, you had a table that represented your first class objects, so a row for 'person' and a row for 'employee,' etc. We'll call that FCO. Then you had a secondary table that stored what tables represented the FCO. We'll call that Srcs.. For person, there would be one row, which is the person table. For Employee, there would be two rows, the person table and the Employee extension. There is a third table, called Attribs, which stores the columns from the tables that constitute the FCO. For simplicity, we'll say Employee has ID, Name and Address, and Employee has Hire Date and Department, and obviously PersonID referring back to Person table. So, 2 rows in FCO table (person and employee), 3 rows in Src table, 8 rows in Attribs.
The view, we'll call it vw_Employee, selects PersonID, Name, Address, Hire Date, Department from the two tables. It is built by a SQL stored procedure we'll call OnMetadataChange.
This SP is fired (by trigger or batch process), and its purpose is to generate the CREATE VIEW statements. It will iterate through every First Class Object, collect which fields from which tables constitute the view, and will issue a CREATE statement based on that. So OnMetadataChange produces a DROP and CREATE for each view, it generates a dynamic SQL statement that is executed once per entry in FCO table. It is preferable to do this with Triggers but not necessary. Hopefully your FCO definitions won't change too often, and when they do, there will probably be a code release as well. You can run your OnMetadataChange SP at that time.
The end result is a 2-layer database. The views constitute the First Class Object layer, which is meaningful to the application. The application only uses views. The tables constitute the 'physical' layer, which the application shouldn't care about. The meta-tables are essentially your mapping between the FCO layer and the physical layer. It takes some time to set it up, but it's quite effective, and gives you many of the benefits of EAV, while at the same time giving you the concrete benefits of 3nf tables (indexability, etc).
If you'd like I can throw some sample SQL out there.

Part of the problem you are having is that you are trying to store schema-less data in a SQL database, which is not its strength. There are three approaches that would make your life far easier:
1) Have a column which stores the serialized custom fields, with whatever format is mst convenient. For example, this column could store xml. Upsides are that you can use SQL Server Compact and pulling back a record is trivial. Downsides are that you always have to pull/push the entire xml blob to do an update, and it is difficult to impossible to query on any custom fields.
2) Upgrade to SQL Server Express, and use XML columns. This is nearly the same as the first suggestion, except that any server ready version of SQL Server has native support for XML data. These columns can have indexes added and fields within the data can be used in queries.
3) Use a Schema-less Database, like MongoDB or CouchDB. These databases are all about storing schemaless data, so your custom fields will be no different than any other field. As such, you can index and query custom fields. Upsides are that custom data is incredibly easy to work with, downsides are that you would have to spend some time rethinking how you store data to fit within their model.
If you do not need to query based on custom fields, or if you can query custom fields within business logic, then the first option can work for you. In any other case, I would err towards something with more capabilities than compact. If cost is the deciding factor, both SQL Server Express and MongoDB are free.

dynamic data model

I have a project that requires user-defined attributes for a particular object at runtime (Lets say a person object in this example). The project will have many different users (1000 +), each defining their own unique attributes for their own sets of 'Person' objects.
(Eg - user #1 will have a set of defined attributes, which will apply to all person objects 'owned' by this user. Mutliply this by 1000 users, and that's the bottom line minimum number of users the app will work with.) These attributes will be used to query the people object and return results.
I think these are the possible approaches I can use. I will be using C# (and any version of .NET 3.5 or 4), and have a free reign re: what to use for a datastore. (I have mysql and mssql available, although have the freedom to use any software, as long as it will fit the bill)
Have I missed anything, or made any incorrect assumptions in my assessment?
Out of these choices - what solution would you go for?
Hybrid EAV object model. (Define the database using normal relational model, and have a 'property bag' table for the Person table).
Downsides: many joins per / query. Poor performance. Can hit a limit of the number of joins / tables used in a query.
I've knocked up a quick sample, that has a Subsonic 2.x 'esqe interface:
Select().From().Where ... etc
Which generates the correct joins, then filters + pivots the returned data in c#, to return a datatable configured with the correctly typed data-set.
I have yet to load test this solution. It's based on the EA advice in this Microsoft whitepaper:
SQL Server 2008 RTM Documents Best Practices for Semantic Data Modeling for Performance and Scalability
Allow the user to dynamically create / alter the object's table at run-time. This solution is what I believe NHibernate does in the background when using dynamic properties, as discussed where
http://bartreyserhove.blogspot.com/2008/02/dynamic-domain-mode-using-nhibernate.html
Downsides:
As the system grows, the number of columns defined will get very large, and may hit the max number of columns. If there are 1000 users, each with 10 distinct attributes for their 'Person' objects, then we'd need a table holding 10k columns. Not scalable in this scenario.
I guess I could allow a person attribute table per user, but if there are 1000 users to start, that's 1000 tables plus the other 10 odd in the app.
I'm unsure if this would be scalable - but it doesn't seem so. Someone please correct me if I an incorrect!
Use a NoSQL datastore, such as CouchDb / MongoDb
From what I have read, these aren't yet proven in large scale apps, based on strings, and are very early in development phase. IF I am incorrect in this assessment, can someone let me know?
http://www.eflorenzano.com/blog/post/why-couchdb-sucks/
Using XML column in the people table to store attributes
Drawbacks - no indexing on querying, so every column would need to be retrieved and queried to return a resultset, resulting in poor query performance.
Serializing an object graph to the database.
Drawbacks - no indexing on querying, so every column would need to be retrieved and queried to return a resultset, resulting in poor query performance.
C# bindings for berkelyDB
From what I read here: http://www.dinosaurtech.com/2009/berkeley-db-c-bindings/
Berkeley Db has definitely proven to be useful, but as Robert pointed out – there is no easy interface. Your entire wOO wrapper has to be hand coded, and all of your indices are hand maintained. It is much more difficult than SQL / linq-to-sql, but that’s the price you pay for ridiculous speed.
Seems a large overhead - however if anyone can provide a link to a tutorial on how to maintain the indices in C# - it could be a goer.
SQL / RDF hybrid.
Odd I didn't think of this before. Similar to option 1, but instead of an "property bag" table, just XREF to a RDF store?
Querying would them involve 2 steps - query the RDF store for people hitting the correct attributes, to return the person object(s), and use the ID's for these person object in the SQL query to return the relational data. Extra overhead, but could be a goer.

The ESENT database engine on Windows is used heavily for this kind of semi-structured data. One example is Microsoft Exchange which, like your application, has thousands of users where each user can define their own set of properties (MAPI named properties). Exchange uses a slightly modified version of ESENT.
ESENT has a lot of features that enable applications with large meta-data requirements: each ESENT table can have about ~32K columns defined; tables, indexes and columns can be added at runtime; sparse columns don't take up any record space when not set; and template tables can reduce the space used by the meta-data itself. It is common for large applications to have thousands of tables/indexes.
In this case you can have one table per user and create the per-user columns in the table, creating indexes on any columns that you want to query. That would be similar to the way that some versions of Exchange store their data. The downside of this approach is that ESENT doesn't have a query engine so you will have to hand-craft your queries as MakeKey/Seek/MoveNext calls.
A managed wrapper for ESENT is here:
http://managedesent.codeplex.com/

In a EAV model you don't have to have many joins, as you can just have the joins you need for the query filtering. For the resultset, return property entries as a separate rowset.
That is what we are doing in our EAV implementation.
For example, a query might return persons with extended property 'Age' > 18:
Properties table:
1 Age
2 NickName
First resultset:
PersonID Name
1 John
2 Mary
second resultset:
PersonID PropertyID Value
1 1 24
1 2 'Neo'
2 1 32
2 2 'Pocahontas'
For the first resultset, you need an inner join for the 'age' extended property
to query the basic Person object entity part:
select p.ID, p.Name from Persons p
join PersonExtendedProperties pp
on p.ID = pp.PersonID
where pp.PropertyName = 'Age'
and pp.PropertyValue > 18 -- probably need to convert to integer here
For the second resultset, we are making an outer join of the first resultset with PersonExtendedProperties table to get the rest of the extended properties. It's a 'narrow' resultset, we do not pivot the properties in sql, so we don't need multiple joins here.
Actually we use separate tables for different types to avoid data type conversion, to have extended properties indexed and easily queriable.

My recommendation:
Allow properties to be marked as indexable. Have a smallish hard limit on number of indexable properties, and on columns per object. Have a large hard limit on total column types in all objects.
Implement indexes as separate tables (one per index) joined with main table of data (main table has large unique key for object). (Index tables can then be created/dropped as required).
Serialize the data, including the index columns, plus put the index propertoes in first class relational columns in their dedicated index tables. Use JSON instead of XML to save space in the table. Enforce short column name policy (or long display name and short stored name policy) to save space and increase performance.
Use quarks for field identifiers (but only in the main engine to save RAM and speed some read operations -- don't rely on quark pointer comparison in all cases).
My thought on your options:
1 is a possible. Performance clearly will be lower than if field ID columns not stored.
2 is a no in general DB engines not all happy about dynamic schema changes. But a possible yes if your DB engine is good at this.
3 Possible.
4 Yes though I'd use JSON.
5 Seems like 4 only less optimized??
6 Sounds good; would go with if happy to try something new and also if happy about reliability and performance but usually would want to go with more mainstream technology. I'd also like to reduce the number of engines involved in coordinating a transaction to less then would be true here.
Edit: But of course though I've recommened something there can be no general right answer here -- profile various data models and approaches with your data to see what runs best for your application.
Edit: Changed last edit wording.

Assuming you an place a limit, N, on how many custom attributes each user can define; just add N extra columns to the Person table. Then have a separate table where you store per-user metadata to describe how to interpret the contents of those columns for each user. Similar to #1 once you've read in the data, but no joins needed to pull in the custom attributes.

For a problem similar to your problem, we have used the "XML Column" approach (the fourth one in your survey of methods). But you should note that many databases (DBMS) support index for xml values.
I recommend you to use one table for Person which contains one xml column along with other common columns. In other words, design the Person table with columns that are common for all person records and add a single xml column for dynamic and differing attributes.
We are using Oracle. it supports index for its xml-type. Two types of indices are supported: 1- XMLIndex for indexing elements and attributes within an xml, 2- Oracle Text Index for enabling full-text search in text fields of the xml.
For example, in Oracle you can create an index such as:
CREATE INDEX index1 ON table_name (XMLCast(XMLQuery ('$p/PurchaseOrder/Reference'
PASSING XML_Column AS "p" RETURNING CONTENT) AS VARCHAR2(128)));
and xml-query is supported in select queries:
SELECT count(*) FROM purchaseorder
WHERE XMLCast(XMLQuery('$p/PurchaseOrder/Reference'
PASSING OBJECT_VALUE AS "p" RETURNING CONTENT)
AS INTEGER) = 25;
As I know, other databases such as PostgreSQL and MS SQL Server (but not mysql) support such index models for xml value.
see also:
http://docs.oracle.com/cd/E11882_01/appdev.112/e23094/xdb_indexing.htm#CHDEADIH

Programming pattern using typed datasets in VS 2008

I'm currently doing the following to use typed datasets in vs2008:
Right click on "app_code" add new dataset, name it tableDS.
Open tableDS, right click, add "table adapter"
In the wizard, choose a pre defined connection string, "use SQL statements"
select * from tablename and next + next to finish. (I generate one table adapter for each table in my DB)
In my code I do the following to get a row of data when I only need one:
cpcDS.tbl_cpcRow tr = (cpcDS.tbl_cpcRow)(new cpcDSTableAdapters.tbl_cpcTableAdapter()).GetData().Select("cpcID = " + cpcID)[0];
I believe this will get the entire table from the database and to the filtering in dotnet (ie not optimal), is there any way I can get the tableadapter to filer the result set on the database instead (IE what I want to is send select * from tbl_cpc where cpcID = 1 to the database)
And as a side note, I think this is a fairly ok design pattern for getting data from a database in vs2008. It's fairly easy to code with, read and mantain. But I would like to know it there are any other design patterns that is better out there? I use the datasets for read/update/insert and delete.

A bit of a shift, but you ask about different patterns - how about LINQ? Since you are using VS2008, it is possible (although not guaranteed) that you might also be able to use .NET 3.5.
A LINQ-to-SQL data-context provides much more managed access to data (filtered, etc). Is this an option? I'm not sure I'd go "Entity Framework" at the moment, though (see here).
Edit per request:
to get a row from the data-context, you simply need to specify the "predicate" - in this case, a primary key match:
int id = ... // the primary key we want to look for
using(var ctx = new MydataContext()) {
SomeType record = ctx.SomeTable.Single(x => x.SomeColumn == id);
//... etc
// ctx.SubmitChanges(); // to commit any updates
}
The use of Single above is deliberate - this particular usage [Single(predicate)] allows the data-context to make full use of local in-memory data - i.e. if the predicate is just on the primary key columns, it might not have to touch the database at all if the data-context has already seen that record.
However, LINQ is very flexible; you can also use "query syntax" - for example, a slightly different (list) query:
var myOrders = from row in ctx.Orders
where row.CustomerID = id && row.IsActive
orderby row.OrderDate
select row;
etc

There is two potential problem with using typed datasets,
one is testability. It's fairly hard work to set up the objects you want to use in a unit test when using typed datasets.
The other is maintainability. Using typed datasets is typically a symptom of a deeper problem, I'm guessing that all you business rules live outside the datasets, and a fair few of them take datasets as input and outputs some aggregated values based on them. This leads to business logic leaking all over the place, and though it will all be honky-dory the first 6 months, it will start to bite you after a while. Such a use of DataSets are fundamentally non-object oriented
That being said, it's perfectly possible to have a sensible architecture using datasets, but it doesn't come naturally. An ORM will be harder to set up initially, but will lend itself nicely to writing maintainable and testable code, so you don't have to look back on the mess you made 6 months from now.

You can add a query with a where clause to the tableadapter for the table you're interested in.
LINQ is nice, but it's really just shortcut syntax for what the OP is already doing.
Typed Datasets make perfect sense unless your data model is very complex. Then writing your own ORM would be the best choice. I'm a little confused as to why Andreas thinks typed datasets are hard to maintain. The only annoying thing about them is that the insert, update, and delete commands are removed whenever the select command is changed.
Also, the speed advantage of creating a typed dataset versus your own ORM lets you focus on the app itself and not the data access code.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.