I am looking to insert a 20-25MB xml file into a database on a daily basis. The issue is that each entry needs an extra column added with a calculated value. So what I am wondering is if the most efficient way to do this would be using the SQLXML Bulk Load tools after editing the xml file, running through the xml file and add the new column then loading each item, or using the Bulk Load followed by going through the database adding the new column values.
Comments = answer
There is no need to store this value seperate. Since it's a calculated value with the data you need to calculate it on each record, you can calculate this on the fly instead of storing it as it's own unique value. A mix of where and/or having clauses will allow for filtering (searching) of results based on that calculated value.
Related
This may be a dumb question, but I wanted to be sure. I am creating a Winforms app, and using c# oledbconnection to connect to a MS Access database. Right now, i am using a "SELECT * FROM table_name" and looping through each row to see if it is the row with the criteria I want, then breaking out of the loop if it is. I wonder if the performance would be improved if I used something like "SELECT * FROM table_name WHERE id=something" so basically use a "WHERE" statement instead of looping through every row?
The best way to validate the performance of anything is to test. Otherwise, a lot of assumptions are made about what is the best versus the reality of performance.
With that said, 100% of the time using a WHERE clause will be better than retrieving the data and then filtering via a loop. This is for a few different reasons, but ultimately you are filtering the data on a column before retrieving all of the columns, versus retrieving all of the columns and then filtering out the data. Relational data should be dealt with according to set logic, which is how a WHERE clause works, according to the data set. The loop is not set logic and compares each individual row, expensively, discarding those that don’t meet the criteria.
Don’t take my word for it though. Try it out. Especially try it out when your app has a lot of data in the table.
yes, of course.
if you have a access database file - say shared on a folder. Then you deploy your .net desktop application to each workstation?
And furthermore, say the table has 1 million rows.
If you do this:
SELECT * from tblInvoice WHERE InvoiceNumber = 123245
Then ONLY one row is pulled down the network pipe - and this holds true EVEN if the table has 1 million rows. To traverse and pull 1 million rows is going to take a HUGE amount of time, but if you add criteria to your select, then it would be in this case about 1 million times faster to pull one row as opposed to the whole table.
And say if this is/was multi-user? Then again, even on a network - again ONLY ONE record that meets your criteria will be pulled. The only requirement for this "one row pull" over the network? Access data engine needs to have a useable index on that criteria. Of course by default the PK column (ID) always has that index - so no worries there. But if as per above we are pulling invoice numbers from a table - then having a index on that column (InvoiceNumber) is required for the data engine to only pull one row. If no index can be used - then all rows behind the scenes are pulled until a match occurs - and over a network, then this means significant amounts of data will be pulled without that index across that network (or if local - then pulled from the file on the disk).
I am looking for advice on how should I do following:
I have a table in SQL server with about 3 -6 Million Records and 51 Columns.
only one column needs to be updated after calculating a value from 45 columns data been taken in mathematical calculation.
I already have maths done through C#, and I am able to create Datatable out of it [with millions record yes].
Now I want to update them into database with most efficient manner. Options I know are
Run update query with every record, as I use loop on data reader to do math and create DataTable.
Create A temporary table and use SQLBulkCopy to copy data and later use MERGE statement
Though it is very HARD to do, but can try to make Function within SQL to do all math and just run simple update without any condition to update all in once.
I am not sure which method is faster one or better one. Any idea?
EDIT: Why I am afraid of using Stored Procedure
First I have no idea how i wrote it, I am pretty new to do this. Though maybe it is time to do it now.
My Formula is Take one column, apply one forumla on them, along with additional constant value [which is also part of Column name], then take all 45 columns and apply another formula.
The resultant will be stored in 46th column.
Thanks.
If you have a field that contains a calculation from other fields in the database, it is best to make it a calculated field or to maintain it through a trigger so that anytime the data is changed from any source, the calculation is maintained.
You can create a .net function which can be called directly from sql here is the link how to create one http://msdn.microsoft.com/en-us/library/w2kae45k%28v=vs.90%29.aspx. After you created the function run the simple update statement
Can't you create a scalar valued function in c#, and call it in as part of a computed column?
I'm implementing DynamoDB in our project. We have to put large data strings into database so we are splitting data into small pieces and inserting multiple rows with only one attribute value changed - part of string. One column (range key) contains a number of part. Inserting and selecting data works perfectly fine for small and large strings. The problem is deleting an item. I read that when you want to delete an item you need to specify primary key for such item (hash key or hash key and range key - depends on table). But what if I want to delete items that have particular value for one of attributes? Do I need to scan (scan, not query) entire table and for each row run delete or batch delete? Or is there some another solution without using two queries? What I'm trying to do is to avoid scanning entire table. I think we will have about 100 - 1000 milions of rows in such table, so scanning will be very slow.
Thanks for help.
There are no way to delete an arbitrary element in DynamoDB. You indeed need to know the hash_key and the range_key.
If query does not fit your needs for this (ie. you even do not know the hash_key), then you're stuck.
The best would be to re-thing your data modeling. Build a custom index or do 'lazy delete'.
To achieve 'lazy delete', use a table as a queue of element to delete. Periodically, run an EMR on it to do all the delete in the batch in a single scan operation. It's really not the best solution but the only way I can think of to avoid re-modeling.
TL;DR: There is no real way but workarounds. I highly recommend that you re-model at least part of your data.
I am a complete newbie to SSIS.
I have a c#/sql server background.
I would like to know whether it is possible to validate data before it goes into a database. I am grabbing text from a |(pipe) delimited text file.
For example, if a certain datapoint is null, then change it to 0 or if a certain datapoint's length is 0, then change to "nada".
I don't know if this is even possible with SSIS, but it would be most helpful if you can point me into the right direction.
anything is possible with SSIS!
after your flat file data source, use a Derived Column Transformation. Deriving a new column with the expression being something like the following.
ISNULL(ColumnName) ? "nada" : ColumnName
Then use this new column in your data source destination.
Hope it helps.
I don't know if you're dead set on using SSIS, but the basic method I've generally used to import textfile data into a database generally takes two stages:
Use BULK INSERT to load the file into a temporary staging table on the database server; each of the columns in this staging table are something reasonably tolerant of the data they contain, like a varchar(max).
Write up validation routines to update the data in the temporary table and double-check to make sure that it's well-formed according to your needs, then convert the columns into their final formats and push the rows into the destination table.
I like this method mostly because BULK INSERT can be a bit cryptic about the errors it spits out; with a temporary staging table, it's a lot easier to look through your dataset and fix errors on the fly as opposed to rooting through a text file.
When program runs 1st time it just gets some fields from a source database table say:
SELECT NUMBER, COLOR, USETYPE, ROOFMATERIALCODE FROM HOUSE; //number is uniq key
it does some in-memory processing say converting USETYPE and ROOFMATERIAL to destination database format (by using cross ref table).
Then program inserts ALL THE ROWS to destination database:
INSERT INTO BUILDING (BUILDINGID, BUILDINGNUMBER, COLOR, BUILDINGTYPE, ROOFMAT)
VALUES (PROGRAM_GENERATED_ID, NUMBER_FROM_HOUSE, COLOR_FROM_HOUSE,
CONVERTED_USETYPE_FROM_HOUSE, CONVERTED_ROOFMATERIALCODE_FROM_HOUSE);
The above is naturally not SQL but you get the idea (the values with underscores just describe the data inserted).
The next times the program should do the same except:
insert only the ones not found from target database.
update only the ones that have updated color, usetype, roofmaterialcode.
My question is:
How to implement this in efficient way?
-Do I first populate DataSet and convert fields to destination format?
-If I use only 1 DataSet how give destination db BUILDING_IDs (can i add columns to populated DataSet?)
-How to efficiently check if destination rows need refresh (if i select them one # time by BUILDING_NUMBER and check all fields it's gonna be slow)?
Thanks for your answers!
-matti
If you are using Oracle, have you looked at the MERGE statement? You give the merge statement a criteria. If records match the criteria, it performs an UPDATE. If they don't match the criteria (they aren't already in the table), it performs an INSERT. That might be helpful for what you are trying to do.
Here is the spec/example of merge.