I am dealing with a problem that most of our columns were created with default EF behaviour which makes string as nvarchar(max). However that doesn't combine well with indexes.
I tried the putting the [MaxLength(100)] attribute onto the specific column and generate a migration. That generates the alter table statement that when run on a database (with a lot of data) spikes the DTU and basically trashes the DB.
I am now looking for a safe way how to proceed with this (let's say that the column name is "FileName"):
Create a column FileNameV2 with [MaxLength(100)].
Copy data from FileName column to FileNameV2.
Delete FileName column.
Rename FileNameV2 to FileName
Would this approach work or is there any better / easier way (especially one that doesn't upset EF)?
The main issue I found out later was that our SQL Azure database had max size 2 GB so when I was doing the change and the db had 1,5 GB it then reached its size probably when doing the transition from navarchar(max) to nvarchar(100). So the learning is to double check your max size of DB on Azure just to be sure you don't hit this threshold.
Related
This may be a dumb question, but I wanted to be sure. I am creating a Winforms app, and using c# oledbconnection to connect to a MS Access database. Right now, i am using a "SELECT * FROM table_name" and looping through each row to see if it is the row with the criteria I want, then breaking out of the loop if it is. I wonder if the performance would be improved if I used something like "SELECT * FROM table_name WHERE id=something" so basically use a "WHERE" statement instead of looping through every row?
The best way to validate the performance of anything is to test. Otherwise, a lot of assumptions are made about what is the best versus the reality of performance.
With that said, 100% of the time using a WHERE clause will be better than retrieving the data and then filtering via a loop. This is for a few different reasons, but ultimately you are filtering the data on a column before retrieving all of the columns, versus retrieving all of the columns and then filtering out the data. Relational data should be dealt with according to set logic, which is how a WHERE clause works, according to the data set. The loop is not set logic and compares each individual row, expensively, discarding those that don’t meet the criteria.
Don’t take my word for it though. Try it out. Especially try it out when your app has a lot of data in the table.
yes, of course.
if you have a access database file - say shared on a folder. Then you deploy your .net desktop application to each workstation?
And furthermore, say the table has 1 million rows.
If you do this:
SELECT * from tblInvoice WHERE InvoiceNumber = 123245
Then ONLY one row is pulled down the network pipe - and this holds true EVEN if the table has 1 million rows. To traverse and pull 1 million rows is going to take a HUGE amount of time, but if you add criteria to your select, then it would be in this case about 1 million times faster to pull one row as opposed to the whole table.
And say if this is/was multi-user? Then again, even on a network - again ONLY ONE record that meets your criteria will be pulled. The only requirement for this "one row pull" over the network? Access data engine needs to have a useable index on that criteria. Of course by default the PK column (ID) always has that index - so no worries there. But if as per above we are pulling invoice numbers from a table - then having a index on that column (InvoiceNumber) is required for the data engine to only pull one row. If no index can be used - then all rows behind the scenes are pulled until a match occurs - and over a network, then this means significant amounts of data will be pulled without that index across that network (or if local - then pulled from the file on the disk).
In order to better define my problem I'll explain in steps:
I need to consolidate selected data from 4 databases into one.
Each database logs data obtained from an industrial system (sensors and switches, mainly).
DBs are in .accdb format with encryption
Each source database has 3 columns:
timestamp (datetime format)
point_id (Variable name - text format)
_VAL (Variable value - text format in two DBs, byte in the other two DBs)
Variable value is logged in one row every time it changes (1-second resolution), and all variables are logged once every 15 minutes (to get a snapshot of the system every so often). Example:
1/9/2014 1:35:54 AM - Tank_Volume - 5,763
1/9/2014 1:35:54 AM - Line_Pressure - 14,325
1/9/2014 1:35:55 AM - Tank_Volume - 5,121
1/9/2014 1:35:56 AM - Tank_Volume - 4,911
I'm logging a total of 511 variables
The output DB requirements are:
Each row must contain one second of data for all variables, sequentially and without skipping seconds
Each variable must have its own column (511 variables + 1 for timestamp), preferably with an appropriate format to save on space (output DB must be sent by e-mail)
If the variable value hasn't changed for the given second, it can take the last logged value for that variable
It must contain data only for a selected period of time (e.g.: from 1/8/2014 1:30:00 AM to 1/8/2014 3:45:00 AM) - I have the fields for selection in the UI
The user must be prompted to save this consolidated DB
The DB should be optimized in order to reduce its size after all data is copied to it
I know it's not too complex, but I want an opinion on the best way to deal with all this data. The source databases might have more than 1 GB each (many many days of log). I'll usually get only 3~4 hours of data from them into the output DB, but it'll be 14000+ rows (one per second) with 512 cols, parsed cell by cell...I imagine that's a lot to process, right?
My idea is to:
Establish connection with the 4 source DBs (they are located in one fixed directory)
Select the data to be extracted from each DB (based on the UI Start and End datetime fields) and place it in one large DataTable (SourceData)
Once SourceData is populated, close connection with the source DBs
Create 3 output DataTables (OutputData) with an algorithm that parses each line from SourceData on a second-by-second basis and place it in the right row/column (based in the timestamp and point_id source columns) - and if there's no data for any given point in time, repeat the value from the previous second
Open connection to an output DB (supposedly empty), or create one, if possible
Check if there's any table there and drop them, if true
Create 3 tables to contain all cols (timestamp being the primary for all 3 tables)
Populate these tables with the data from OutputData
Optimize the tables to reduce size
Save the DB to a backup folder and prompt the user to save the DB in another place as well, and displaying the final file size
Clear both the SourceData and DataTable to clear RAM usage
Is there a more efficient/easy way to achieve my goal? At first I was going for immediate read/write to/from the DBs, but I figured working with variables inside the executable would be a lot faster that file I/O...
Thank you all in advance!
I am looking for advice on how should I do following:
I have a table in SQL server with about 3 -6 Million Records and 51 Columns.
only one column needs to be updated after calculating a value from 45 columns data been taken in mathematical calculation.
I already have maths done through C#, and I am able to create Datatable out of it [with millions record yes].
Now I want to update them into database with most efficient manner. Options I know are
Run update query with every record, as I use loop on data reader to do math and create DataTable.
Create A temporary table and use SQLBulkCopy to copy data and later use MERGE statement
Though it is very HARD to do, but can try to make Function within SQL to do all math and just run simple update without any condition to update all in once.
I am not sure which method is faster one or better one. Any idea?
EDIT: Why I am afraid of using Stored Procedure
First I have no idea how i wrote it, I am pretty new to do this. Though maybe it is time to do it now.
My Formula is Take one column, apply one forumla on them, along with additional constant value [which is also part of Column name], then take all 45 columns and apply another formula.
The resultant will be stored in 46th column.
Thanks.
If you have a field that contains a calculation from other fields in the database, it is best to make it a calculated field or to maintain it through a trigger so that anytime the data is changed from any source, the calculation is maintained.
You can create a .net function which can be called directly from sql here is the link how to create one http://msdn.microsoft.com/en-us/library/w2kae45k%28v=vs.90%29.aspx. After you created the function run the simple update statement
Can't you create a scalar valued function in c#, and call it in as part of a computed column?
In part of my application I have to get the last ID of a table where a condition is met
For example:
SELECT(MAX) ID FROM TABLE WHERE Num = 2
So I can either grab the whole table and loop through it looking for Num = 2, or I can grab the data from the table where Num = 2. In the latter, I know the last item will be the MAX ID.
Either way, I have to do this around 50 times...so would it be more efficient grabbing all the data and looping through the list of data looking for a specific condition...
Or would it be better to grab the data several times based on the condition..where I know the last item in the list will be the max id
I have 6 conditions I will have to base the queries on
Im just wondering which is more efficient...looping through a list of around 3500 items several times, or hitting the database several times where I can already have the data broken down like I need it
I could speak for SqlServer. If you create a StoredProcedure where Num is a parameter that you pass, you will get the best performance due to its optimization engine on execution plan of the stored procedure. Of course an Index on that field is mandatory.
Let the database do this work, it's what it is designed to do.
Does this table have a high insert frequency? Does it have a high update frequency, specifically on the column that you're applying the MAX function to? If the answer is no, you might consider adding an IS_MAX BIT column and set it using an insert trigger. That way, the row you want is essentially cached, and it's trivial to look up.
I'm storing objects in a database as varbinary(MAX) and want to know their filesize. Without getting into the pro and cons of using the varbinary(MAX) datatype, what is the best way to read the file size of an object stored in the database?
Is it:
A. Better to just read the column from the DB and call the .Length property of System.Data.Linq.Binary.
OR
B. Better to determine the file size of the object before it is added to the DB and create another column called Size.
The files I'm dealing with are generally between 0 and 3 MB with a skew towards the smaller size. It doesn't necessarily make sense to hit the DB again for the file size, but it also doesn't really make sense to read through the entire item to determine its length.
Why not add a calculated column in your database that would be DATALENGTH([your_col])?