C# and SQLServer normalizing large sets of Urls - c#

I have many tables in the database that have at least one column that contains a Url. And these are repeated a lot through-out the database. So I normalize them to a dedicated table and I just use numeric IDs everywhere I need them. I often need to join them so numeric ids are much better than full strings.
In MySql + C++, to insert a lot of Urls in one strike, I used to use multi-row INSERT IGNOREs or mysql_set_local_infile_handler(). Then batch SELECT with IN () to pull the IDs back from the database.
In C# + SQLServer I noticed there's a SqlBulkCopy class that's very useful and fast in mass-insertion. But I also need mass-selection to resolve the Url IDs after I insert them. Is there any such helper class that would work the same as SELECT WHERE IN (many, urls, here)?
Or do you have a better idea for turning Urls into numbers in a consistent manner in C#? I thought about crc32'ing the urls or crc64'ing them but I worry about collisions. I wouldn't care if collisions are few, but if not... it would be an issue.
PS: We're talking about tens of millions of Urls to get an idea of scale.
PS: For basic large insert, SQLBulkCopy is faster than SqlDbType.Structured. Plus it has the SqlRowsCopied event for a status tracking callback.

There is even a better way than SQLBulkCopy.
It's called Structured Parameters and it allows you to pass a table-valued parameter to stored procedure or query through ADO.NET.
There are code examples in the article, so I will only highlight what you need to do to get it up and working:
Create a user defined table type in the database. You can call it UrlTable
Setup a SP or query which does the SELECT by joining with a table variable or type UrlTable
In your backing code (C#), create a DataTable with the same structure as UrlTable, populate it with URLs and pass it to an SqlCommand through as a structured parameter. Note that column order correspondence is critical between the data table and the table type.
What ADO.NET does behind the scenes (if you profile the query you can see this) is that before the query it declares a variable of type UrlTable and populates it (INSERT statements) with what you pass in the structured parameter.
Other than that, query-wise, you can do pretty much everything with table-valued parameters in SQL (join, select, etc).

I think you could use the IGNORE_DUP_KEY option on your index. If you set IGNORE_DUP_KEY = ON on the index of the URL column, the duplicate values are simply ignored and the rest are inserted appropriately.

Related

Can I Insert the Results of a Select Statement Into Another Table Without a Roundtrip?

I have a web application that is written in MVC.Net using C# and LINQ-to-SQL (SQL Server 2008 R2).
I'd like to query the database for some values, and also insert those values into another table for later use. Obviously, I could do a normal select, then take those results and do a normal insert, but that will result in my application sending the values back to the SQL server, which is a waste as the server is where the values came from.
Is there any way I can get the select results in my application and insert them into another table without the information making a roundtrip from the the SQL server to my application and back again?
It would be cool if this was in one query, but that's less important than avoiding the roundtrip.
Assume whatever basic schema you like, I'll be extrapolating your simple example to a much more complex query.
Can I Insert the Results of a Select Statement Into Another Table Without a Roundtrip?
From a "single-query" and/or "avoid the round-trip" perspective: Yes.
From a "doing that purely in Linq to SQL" perspective: Well...mostly ;-).
The three pieces required are:
The INSERT...SELECT construct:
By using this we get half of the goal in that we have selected data and inserted it. And this is the only way to keep the data entirely at the database server and avoid the round-trip. Unfortunately, this construct is not supported by Linq-to-SQL (or Entity Framework): Insert/Select with Linq-To-SQL
The T-SQL OUTPUT clause:
This allows for doing what is essentially the tee command in Unix shell scripting: save and display the incoming rows at the same time. The OUTPUT clause just takes the set of inserted rows and sends it back to the caller, providing the other half of the goal. Unfortunately, this is also not supported by Linq-to-SQL (or Entity Framework). Now, this type of operation can also be achieved across multiple queries when not using OUTPUT, but there is really nothing gained since you then either need to a) create a temp table to dump the initial results into that will be used to insert into the table and then selected back to the caller, or b) have some way of knowing which rows that were just inserted into the table are new so that they can be properly selected back to the caller.
The DataContext.ExecuteQuery<TResult> (String, Object[]) method:
This is needed due to the two required T-SQL pieces not being supported directly in Linq-to-SQL. And even if the clunky approach to avoiding the OUTPUT clause is done (assuming it could be done in pure Linq/Lambda expressions), there is still no way around the INSERT...SELECT construct that would not be a round-trip.
Hence, multiple queries that are all pure Linq/Lambda expressions equates to a round-trip.
The only way to truly avoid the round-trip should be something like:
var _MyStuff = db.ExecuteQuery<Stuffs>(#"
INSERT INTO dbo.Table1 (Col1, Col2, Col2)
OUTPUT INSERTED.*
SELECT Col1, Col2, Col3
FROM dbo.Table2 t2
WHERE t2.Col4 = {0};",
_SomeID);
And just in case it helps anyone (since I already spent the time looking it up :), the equivalent command for Entity Framework is: Database.SqlQuery<TElement> (String, Object[])
try this query according your requirement
insert into IndentProcessDetails (DemandId,DemandMasterId,DemandQty) ( select DemandId,DemandMasterId,DemandQty from DemandDetails)

Compare an array with a "very large" table of a SQL Server database

In an C# program I have an array with about 100.000 elements.
Then I have a SQL Server 2008 table where the primary key column contains more or less nearly all elements of the array (but a few not). The table can have up to 30.000.000 rows.
Now I want to determine which elements of the array do not exist in the table. How can this be achieved efficiently?
The most efficient method would probably be to bulk-insert those 100,000 elements into a temp table and then perform the comparison within the database itself.
(Note that I haven't tested this theory; it's just an educated guess.)
Query the table with a
select <primarykey> where <primarykey> in (<primary key of ur list of elements in c#>)
This should be faster than inserting all rows into a table and then checking with an except/minus command for missing elements, because it does not involve any write operation.
Once you have the list of primary keys which are common..pull it back into c# and compare.
A way to avoid creating temp tables would be to use a stored procedure which accepts a table valued parameter of a user-defined table type (udtt). This table would have a schema of one column of a data type matching that in your array.
If you populate a DataTable (with a schema matching the udtt schema) with your array values and supply the data table as your stored proc's parameter, you can pass up all 100,000 of your items in their sql binary format. The proc can just do a join between the 30M row table and the table-valued parameter, returning the items in the TVP table with no matches in the master table.
This avoids needing to build massive IN statements.
EDIT Regarding the comment from #Kyro below
I'm now less confident in this approach. I found an article showing the under-the-covers row-by-row inserts that Kyro describes. What you might gain in sending binary data over the network rather than a large TSQL where in() statement, may well be taken away by the performance SQL side. However, it's a fairly simple code approach, so might just be worth a quick test. Let us know how you get on?

Is it possible to insert data in to table through coding with out using table name

My question is generally we write the following through code while we are inserting data to a table
insert into tblname values('"+txt.text+"','"+txt1.text+"');
As we pass the data form the text boxes like that is it possible to insert in to table with out using table name directlty
Well you obviously need to know what table to insert into, so there has to be a table name identified to the INSERT statement. The options include:
an INSERT statement with actual table name as per your existing example
an INSERT statement with a synonym as the target (alias for an actual table - see: http://blog.sqlauthority.com/2008/01/07/sql-server-2005-introduction-and-explanation-to-synonym-helpful-t-sql-feature-for-developer/)
an INSERT statement with an updateable view as the target
a sproc call whereby the sproc knows the table to INSERT into (but the calling code does not need to know)
You should also be aware of SQL injection risks with your example - avoid concatenating values directly into a SQL string to execute. Instead, parameterise the SQL.
If you need to dynamically specify the table to insert into at run time, you have to concatenate the table name into the SQL statement you then execute. However, be very wary of SQL injection - make sure you fully validate the tablename to make sure there are no nasties in it. You could even check it is a real table by checking for it in sys.tables.
Not possible without name of table.
But you can make use of Linq To SQL (i.e any ORM) or DataAdapter.Update if you have filled it with the proper table....
You cannot do that without the table name, no. However, the bigger problem is that your code is horribly dangerous and at rick from SQL injection. You should fix this right now, today, immediately. Injection, even for internal apps, is the single biggest risk. Better code would be:
insert into tblname (Foo, Bar) values(#foo, #bar)
adding the parameters #foo and #bar to your command (obviously, replace with sensible names).
Before you ask: no, the table name cannot be parameterised; you cannot use
insert into #tblname -- blah
The table name(s) is(/are) fundamental in any query or operation.
I suppose that if it's possible you have to use parameters.
Here you have a little example.

Reading custom data from SQL tables

We have an application that allows the user to add custom columns to our tables (maybe not the best idea, but that's how it is).
We are now (re)designing our dataaccess layer (we didn't really have one before) and now we're going to use parameterized queries in our datamappers when querying the SQL-database (earlier we concatenated the SQL-strings and escaped all input).
Now we're trying to determine the best way of handling the custom columns in order to both query, create and update these records. The custom attributes are going to be stored in a Dictionary on our "business objects" so I was thinking about doing it like this:
Querying data
Use SELECT * to get all columns and populate our properties and store the rest (custom data) in a dictionary on the business object.
Create/Update
Iterate all columns in the table (something like: SELECT COLUMN_NAME FROM information_schema.columns WHERE TABLE_NAME = 'TableName'
Generate a SQL-string (with parameterized variablenames) by checking which columns exists in both the dictionary and the table and then adding the values from the dictionary as variables to the SQLCommand
Or are there any better approches while still using parameterized queries?
If you are adding ad-hoc columns, ORM gets very tricky. In some ways, dropping back to DataTable/DataAdapter (of which I am not a fan) may be an option. Personally, I would look first at other options for storing the custom data:
an xml column
a set of key/value pairs against each record (in a second table)
some other delimited format in a [n]varchar(max)
Do you really have to add columns?

How to get the primary key from a table without making a second trip?

How would I get the primary key ID number from a Table without making a second trip to the database in LINQ To SQL?
Right now, I submit the data to a table, and make another trip to figure out what id was assigned to the new field (in an auto increment id field). I want to do this in LINQ To SQL and not in Raw SQL (I no longer use Raw SQL).
Also, second part of my question is: I am always careful to know the ID of a user that's online because I'd rather call their information in various tables using their ID as opposed to using a GUID or a username, which are all long strings. I do this because I think that SQL Server doing a numeric compare is much (?) more efficient than doing a username (string) or even a guid (very long string) compare. My questions is, am I more concerned than I should be? Is the difference worth always keeping the userid (int32) in say, session state?
#RedFilter provided some interesting/promising leads for the first question, because I am at this stage unable to try them, if anyone knows or can confirm these changes that he recommended in the comments section of his answer?
If you have a reference to the object, you can just use that reference and call the primary key after you call db.SubmitChanges(). The LINQ object will automatically update its (Identifier) primary key field to reflect the new one assigned to it via SQL Server.
Example (vb.net):
Dim db As New NorthwindDataContext
Dim prod As New Product
prod.ProductName = "cheese!"
db.Products.InsertOnSubmit(prod)
db.SubmitChanges()
MessageBox.Show(prod.ProductID)
You could probably include the above code in a function and return the ProductID (or equivalent primary key) and use it somewhere else.
EDIT: If you are not doing atomic updates, you could add each new product to a separate Collection and iterate through it after you call SubmitChanges. I wish LINQ provided a 'database sneak peek' like a dataset would.
Unless you are doing something out of the ordinary, you should not need to do anything extra to retrieve the primary key that is generated.
When you call SubmitChanges on your Linq-to-SQL datacontext, it automatically updates the primary key values for your objects.
Regarding your second question - there may be a small performance improvement by doing a scan on a numeric field as opposed to something like varchar() but you will see much better performance either way by ensuring that you have the correct columns in your database indexed. And, with SQL Server if you create a primary key using an identity column, it will by default have a clustered index over it.
Linq to SQL automatically sets the identity value of your class with the ID generated when you insert a new record. Just access the property. I don't know if it uses a separate query for this or not, having never used it, but it is not unusual for ORMs to require another query to get back the last inserted ID.
Two ways you can do this independent of Linq To SQL (that may work with it):
1) If you are using SQL Server 2005 or higher, you can use the OUTPUT clause:
Returns information from, or
expressions based on, each row
affected by an INSERT, UPDATE, or
DELETE statement. These results can be
returned to the processing application
for use in such things as confirmation
messages, archiving, and other such
application requirements.
Alternatively, results can be inserted
into a table or table variable.
2) Alternately, you can construct a batch INSERT statement like this:
insert into MyTable
(field1)
values
('xxx');
select scope_identity();
which works at least as far back as SQL Server 2000.
In T-SQL, you could use the OUTPUT clause, saying:
INSERT table (columns...)
OUTPUT inserted.ID
SELECT columns...
So if you can configure LINQ to use that construct for doing inserts, then you can probably get it back easily. But whether LINQ can get a value back from an insert, I'll let someone else answer that.
Calling a stored procedure from LINQ that returns the ID as an output parameter is probably the easiest approach.

Categories