Selecting id's from a huge database - c#

I have a database with over 3,000,000 rows, each has an id and xml field with varchar(6000).
If I do SELECT id FROM bigtable it takes +- 2 minutes to complete. Is there any way to get this in 30 seconds?

Build clustered index on id column
See http://msdn.microsoft.com/en-us/library/ms186342.aspx

You could apply indexes to your tables. In your case a clustered index.
Clustered indexes:
http://msdn.microsoft.com/en-gb/library/aa933131(v=sql.80).aspx
I would also suggest filtering your query so it doesn't return all 3 million rows each time, this can be done by using TOP or WHERE.
TOP:
SELECT TOP 1000 ID
FROM bigtable
WHERE:
SELECT ID FROM
bigtable
WHERE id IN (1,2,3,4,5)

First of all, 3 milion records dont make a table 'Huge'.
To optimize your query, you should do the following.
Filter your query, why do you need to get ALL your IDs?
Create clustered index for the ID column to get a smaller lookup table to search first before pointing to the selected row.
Helpful threads, here and here

Okay, why are you retuning all the Ids to the client?
Even if your table has no clustered index (which I doubt), the vast majority of you processing time will be client-side, transferring the Id values over the network and displaying them on the screen.
Querying for all values rather defeats the point of having a query engine.
The only reason I can think of (perhaps I lack imagination) for getting all the Ids is some sort of misguided caching.
If you want to know many you have do
SELECT count(*) FROM [bigtable]
If you want to know if an Id exists do
SELECT count([Id[) FROM [bigtable] WHERE [Id] = 1 /* or some other Id */
This will return 1 row with a 1 or 0 indicating existence of the specified Id.
Both these queries will benefit massively from a clustered index on Id and will return minimal data with maximal information.
Both of these queries will return in less than 30 seconds, and in less than 30 milliseconds if you have a clustered index on Id
Selecting all the Ids will provide no more useful information than these queries and all it will achieve is a workout for you network and client.

You could index your table for better performance.
There are additional options as well which you could use to imrpove performance like partion feature.

Related

improve query performance on SQL Server table contain 3.5 Million rows and growing

I have written one application in C# which is connected to sql server database express edition, from front end I populate the particular table in database every few second and insert approx 200~300 Rows in this table.
Currently table contains approx 3.5 Million rows and its keep growing, the table definition is as below
[DEVICE_ID] [decimal](19, 5) NULL,
[METER_ID] [decimal](19, 5) NULL,
[DATE_TIME] [decimal](19, 5) NULL,
[COL1] [decimal](19, 5) NULL,
[COL2] [decimal](19, 5) NULL,
.
.
.
.
[COL25] [decimal](19, 5) NULL
I have created non clustered index on Date_Time column, and to note there is no unique column exists if it requires I can create identity column (Auto increment) to this but my report generation logic is totally based on Date_Time column.
I usually fire the query based on time, I.e. if I need to calculate the variation occurred in the col1 in the month period. I will need the value of Col1 on first value of 1st day and last value of last day of month, like wise i need to fire the query for flexible dates and I usually need only opening value and closing value based on Date_Time column for any chosen column.
To get first value of col1 for the first day, the query is
select top (1) COL1 from VALUEDATA where DeviceId=#DId and MeterId =#MId and Date_Time between #StartDateTime and #EndDateTime order by Date_Time
To get last value of col1 for the last day, the query is
select top (1) COL1 from VALUEDATA where DeviceId=#DId and MeterId =#MId and Date_Time between #StartDateTime and #EndDateTime order by Date_Time desc
But when I fire the above queries its takes approx 20~30 seconds, I believe this can be further optimized but don't know the way ahead.
One thought i given to this is to create another table and insert first and last row on every day basis and fetch data from this. But I will avoid the same if I can do something in existing table and query.
It’s greatly appreciable if someone can provide the inputs for the same.
To fully optimize those queries you need two different multiple indexes :
CREATE INDEX ix_valuedata_asc ON VALUEDATA (DeviceId, MeterId, Date_Time);
CREATE INDEX ix_valuedata_des ON VALUEDATA (DeviceId, MeterId, Date_Time DESC);
I have another suggestion: if your goal is to get the values of COL1, COL2 etc after you do the index lookup, the solution with just a nonclustered index on the filtering columns still has to join back to the main table, ie; do a bookmark / RID lookup.
Your info gives me the impression your base table does is not clustered (has no clustered index); is in fact a heap table
If most of your querys on the table follow the pattern you describe, I would make this table clustered. In contrary what most people think, you do not have to define an clustered index as the (unique) primary key. If you define a clustered index in SQL server on non unique data, SQL server will make it unique 'under water' by adding an invisible row identifier...
If the main, most often USED selection / filter criteria on this table is date time, I would change the table to the following clustered structure:
First, remove all non clustered indexes
Then add the following clustered index:
CREATE CLUSTERED INDEX clix_valuedata ON VALUEDATA (Date_Time, DeviceId, MeterId);
When using query's that follow your pattern, you (probably!) will get very performant Clustered index SEEK style access to your table if you look at the query explain plan.. You will now get all the other columns in the table for free, as bookmark lookups are not needed anymore. This approach will probably scale better too as the table grows; because of the SEEK behaviour...

Slow Insert Time With Composite Primary Key in Cassandra

I have been working with Cassandra and I have hit a bit of a stumbling block. For how I need to search on data I found that a Composite primary key works great for what I need but the insert times for the record in this Column Family go to the dogs with it and I am not entirely sure why.
Table Definition:
CREATE TABLE exampletable (
clientid int,
filledday int,
filledtime bigint,
id uuid,
...etc...
PRIMARY KEY (clientid, filledday, filledtime, id)
);
clientid = The internal id of the client. filledday = The number of days since 1/1/1900. filledtime = The number of ticks of the day at which the record was recived. id = A Guid.
The day and time structure exists because I need to be able to filter by day easily and quickly.
I know Cassandra stores Column Families with composite primary keys quite differently. From what I understand it will store the everything as new columns off of a base row of the main component of the primary key. Is that the reason the inserts would be slow? When I say slow I mean that if I just have a primary key on id the insert will take ~200 milliseconds and with the composite primary key (or any subset of it, I tried just clientid and id to the same effect) it will take upwards of 32 seconds for 1000 records. The Select times are faster out of the composite key table since I have to apply secondary indexes and use 'ALLOW FILTERING' in order to get the proper records back with the standard key table (I know I could do this in code but the concern is that I am dealing with some massive data sets and that will not always be practical or possible).
Am I declaring the Column Family or the Primary Key wrong for what I am trying to do? With all the unlisted, non-primary key columns the table is 37 columns wide, would that be the problem? I am quite stumped at this point. I have not be able to really find anything about others having similar problems.
Well, your partition key is the client id, so all writes per client go to one node. If you are writing lots of data per client, you could end up with a hotspot, thus decreasing your overall throughput.
Also, could you give an example of the queries that you run? In Cassandra, the data model always need to resemble the queries you want to run. If you need to "allow filtering", then it seems that something is not quite right with your data model. For instance, I don't really see the point of "filledtime" in your PK. If you want to query by time period, just replace your three column keys with a TimeUUID column "ts". This would create a wide row, with one column per entry with a unique timestam, clustered/partitioned per client id.
This allows queries like:
select * from exampletable where clientid = 123 and ts > minTimeuuid('2013-06-18 16:23:00') and ts < minTimeuuid('2013-06-18 16:24:00');
Again, this would depend on the queries you actually need to run.
And lastly, for overall guidance on data modelling, take a look into this ebay tech blog. Reading it helped me cleared up some things for me.
Hope that helps!

SQL Index Table Join

I am executing the following two queries against a SQL database from within my C# MVC application.
Query1
SELECT tableone.id, name, time, type, grade, product, element, value
FROM dbo.tableone INNER JOIN dbo.tabletwo ON dbo.tableone.id = dbo.tabletwo.id
Where name = '" + Name + "' Order By tableone.id Asc, element
Query2
Select DISTINCT element FROM dbo.tableone
INNER JOIN dbo.tabletwo ON dbo.tableone.id = dbo.tabletwo.id
Where name = '" + Name + "'"
Upon running the method that executes these queries each query hangs and oftentimes the next page of my application will not load for over a minute or it will time out on one or the other. When I run the same queries in SQL Server each of them take between 10 and 15 seconds to run which is still too long.
How can I speed them up? Ive never created a SQL index and Im not sure how to create it for each of these or if thats the right path to pursue.
Tableone currently has 20808805 rows and 3 columns, tabletwo has 597707 rows and 6 columns.
Tableone
id(int, not null)
element(char(9), not null)
value(real, null)
Tabletwo
id(int, not null)
name(char(7), null)
time(datetime, null)
type(char(5), null)
grade(char(4), null)
product(char(14), null)
Firstly, as #Robert Co said, a index on tabletwo.name will help on performance.
Also, are there indexes on tableone.id and tabletwo.id? I will asume there are, given they look like primary keys. If not, you definitely need to put indexes on them. I can see tableone to tabletwo is a many-to-one relation, which means you probably don't have a primary key on table one. You seriously need to add a primary key on tableone, such as tableoneid, and make it a clustered index!
I think another reason here is, your tableone is much bigger than tabletwo which is limited down even further by the where clause(name = 'Name'). This means you are joining a large table (tableone) to a small table (tabletwo with the where clause). In SQL, join large table to a small table is going to be slow.
The solution that I can think about is, maybe you can move some columns, such as 'type', to tableone, so that you can limit tableone into a small set in your query as well:
Select DISTINCT element FROM dbo.tableone
INNER JOIN dbo.tabletwo ON dbo.tableone.id = dbo.tabletwo.id
Where tableone.type = 'some type' and name = '" + Name + "'"
I am not quite sure how these suggestions fitted into your data model, I just hope they may help.
10 to 15 seconds with 20 million rows and no index? That's not bad!
As Ethen Li says it's all about indexes. In an ideal world you would create indexes on all columns that feature in a filter (JOINs and WHEREs) or ORDER BYs. However, as this could severely impact UPDATEs and INSERTs you need to be more practical and less ideal. With the information you have provided I would suggest creating the following indexes:
CREATE INDEX index1 ON tableone (name);
If tableone.id is your candidate key (that which uniquely identifies the row) you should also create an index on it - possibly clustered, it depends how ID is generated):
CREATE UNIQUE INDEX IX1TableOne ON tableone (id);
Or
CREATE UNIQUE CLUSTERED INDEX IX1TableOne ON tableone (id);
For tabletwo: the same applies to ID as for tableone - create at least a unqiue index on ID.
With these indexes in-place you should find a significant performance improvement.
Alternatively to add primary key constraints:
ALTER TABLE tableone ADD CONSTRAINT pktableone PRIMARY KEY CLUSTERED (id);
ALTER TABLE tabletwo ADD CONSTRAINT pktabletwo PRIMARY KEY CLUSTERED (id);
On tableone this might take a while because the data might have to be physically re-ordered. Therefore, do it during a maintenance period when there are no active users.

What is a better approach performance wise

Lets say I need to fetch some records from the database, and filter them based on an enumeration type property.
fetch List<SomeType>
filter on SomeType.Size
enumeration Size { Small, Medium, Large }
when displaying records, there will be a predefined value for Size filter (ex Medium). In most of the cases, user will select a value from filtered data by predefined value.
There is a possibility that a user could also filter to Large, then filter to Medium, then filter to Large again.
I have different situations with same scenario:
List contains less than 100 records and 3-5 properties
List contains 100-500 records and 3-5 properties
List contains max 2000 records with 3-5 properties
What is my best approach here? Should I have a tab that will contain grid for each enum, or should I have one common enum and always filter, or?
I would do the filtering right on the database, if those fields are indexed I would suspect having the db filter it would be much faster than filtering with c-sharp after the fact.
Of course you can always cache the filtered database result as to prevent multiple unnescessary database calls.
EDIT: as for storing the information in the database, suppose you had this field setup:
CREATE TABLE Tshirts
(
id int not null identity(1,1),
name nvarchar(255) not null,
tshirtsizeid int not null,
primary key(id)
)
CREATE TABLE TshirtSizes
(
id int not null, -- not auto-increment
name nvarchar(255)
)
INSERT INTO TshirtSizes(id, name) VALUES(1, 'Small')
INSERT INTO TshirtSizes(id, name) VALUES(2, 'Medium')
INSERT INTO TshirtSizes(id, name) VALUES(3, 'Large')
ALTER TABLE Tshirts ADD FOREIGN KEY(tshirtsizeid) REFERENCES tshirtsize(id)
then in your C#
public enum TShirtSizes
{
Small = 1,
Medium = 2,
Large = 3
}
In this example, the table TshirtSizes is only used for the reader to know what the magic numbers 1, 2, and 3 mean. If you don't care about database read-ability you can omit those tables and just have an indexed column.
Memory is usually cheap. Otherwise you could one-time sort all the values and retrieve based on comparison which would be O(n). You could keep track of the positions of things and retrieve faster that way.

What is the fastest way to update sql table?

I have a C# app which allows the user to update some columns in a DB. My problem is that I have 300.000 records in the DB, and just updating 50.000 took 30 mins. Can I do something to speed things up?
My update query looks like this:
UPDATE SET UM = 'UM', Code = 'Code' WHERE Material = 'MaterialCode'.
My only unique constrain is Material. I read the file the user selects, and put the data in a datatable, and then I go row by row, and update the corresponding material in the DB
Limit the number of indexes in your database especially if your application updates data very frequently.This is because each index takes up disk space and slow the adding, deleting, and updating of rows, you should create new indexes only after analyze the uses of the data, the types and frequencies of queries performed, and how your queries will use the new indexes.
In many cases, the speed advantages of creating the new indexes outweigh the disadvantages of additional space used and slowly rows modification. However, avoid using redundant indexes, create them only when it is necessary. For read-only table, the number of indexes can be increased.
Use non clustered index on the table if the update is frequent.
Use clustered index on the table if the updates/inserts are not frequent.
C# code may not be a problem , your update statement is important. Where clause of the update statement is the place to lookout for. You need to have some indexed column in the where clause.
Another thing, is the field material, indexed? And also, is the where clause, needed to be on a field with a varchar value? Can't it be an integer valued field?
Performance will be better if you filter on fields having integers and not strings. Not sure if this is possible for you.

Categories