I have a bulk insert of around 100,000 records that is going to a oracle table having one unique value column. This bulk insert will happen twice or thrice a day up to many years(Never ending).
Need a robust mechanism to generate unique numbers of unique value column. I am building the dataset to commit to database at once.
Previously I created sequence in oracle, and while building the dataset rows, hitting the database, getting a new sequence number and putting into that column. But it is giving performance issues as for 100,000 records, 100,000 database hits will be needed.
Any other method. This unique value column is varchar2 and max length is 20
Why not just create an autonumber sequence using triggers if you're only doing a bulk insert?
You didn't mention that the numbers must be sequential (1..n) so perhaps you could generate GUIDs and represent them in a compact way. In the long run you might encounter collisions, and in that case you can generate a new GUID.
The only problem I see is that you'd need 25 chars to represent the GUID in Base64 (23 if you strip the padding).
you can generate new sequence GUID and take its 20 characters instead of '-' symbol and insert into database. This GUID in not user friendly so no one use can remember this easily....
Related
I have a sql database where one of the columns is a varchar value. This value is always unique, it's not decided by me but a 3rd party application that supplies the data, it's length is undefined and is a mixture of numbers and letters. I should add that it's not declared as unique in the database as to my knowledge you can't for a varchar type?
Each week I run an import of this data from a csv file, however, the only way I know how to check if I'm importing a unique value is to loop through each row in the database and compare it to each line in the csv file to check if the corresponding value is unique.
Obviously this is very inefficient and is only going to get worse over time as the database gets bigger.
I've tried checking Google but no to avail, could be that I'm looking for the wrong thing though.
Any pointers would be much appreciated.
Application is written in C#
Look at running a MERGE command on SQL instead of an INSERT, which will allow you to explicitly guide action to be taken on a duplicate.
Note that if the unique field is indexed unique, then searching for a value is O(LOG(n)) and not O(n). THis means that overall performance for inserting N values is O(NLog(N)) and not O(NN). As N gets large, this is a substantial performance improvement.
Index the table on the unique field.
Do a 'if exists' on the unique key field value.If it returns a true, the row exists, update the row. If the return is false then this is a new row, insert the row.
I have a Form Windows program in C# that adds a record to the database and can remove it.
In the database i have ID (which is Auto Number), but if i delete a record and if i want to add another record instead, the Auto Number increases and doesn't add the missing numbers.
I mean that if i have 9 records in my Access Database and i want to remove a record, it will be 8, but when i add a new record, i get 10 instead of 9. like this picture:
Is there any solution for that?
If it's an auto number, the database will generate a number greater than the last one used - this is how relational databases are supposed to work. Why would there be solution for this? Imagine deleting 5, what would you want to do then, have the auto number create the next record as 5? If you are displaying an id in your C# app - bad idea - then change this to some other value that you can control as you wish.
However what you are trying to achieve does not make sense.
if i delete a record and if i want to add another record instead, the Auto Number increases and doesn't add the missing numbers.
[...]
Is there any solution for that?
The short answer is "No". Once used, AutoNumber values are typically never re-used, even if the deleted record had the largest AutoNumber value in the table. This is due (at least in in part) to the fact that the Jet/Ace database engine has to be able to manage AutoNumber values in a multi-user environment.
(One exception to the above rule is if the Access database is compacted then the next available AutoNumber value for a table with a sequential AutoNumber field is reset to Max(current_value)+1.)
For more details on how AutoNumber fields work, see my other answer here.
In MS access, there is no any solutions for this. But in case of sql server you can create your own function rather using Identity column.
I have defined various text value by int. I store int value in data table for better and fast search. I have three options to display text value:
I declare Enum in my codes and display text value according to int value. It is static and I have to change code if new values is to be added.
To make it dynamic, I can store int and text value in a table which is in another database and admin own it. New values can be updated by admin in this table. I use inner join to display text value whenever a record is fetched.
I store actual text in respective data table. This will make search slow.
My question is which option is best to use under following condition?
Data table has more than records between 1 and 10 millions.
There are more than 5000 users doing fetch, search, update process on table.
Maximum text values are 12 in number and length (max) 50 char.
There are 30 data tables having above conditions and functions.
I like combination of option #2 and option #1 - to use int's but have dictionary table in another database.
Let me explain:
to store int and text in a table which is in another database;
in origin table to store int only;
do not join table from another database to get text but cache dictionary on client and resolve text from that dictionary
I would not go for option 1 for the reason given. Enums are not there as lookups. You could replace 1 with creating a dictionary but again it would need to be recompiled each time a change is made which is bad.
Storing text in a table (ie option 3) is bad if it is guaranteed to be duplicated a lot as here. This is exactly where you should use a lookup table as you suggest in number 2.
So yes, store them in a database table and administer them through that.
The joining shouldn't take long to do at all if it is just to a small table. If you are worried though an alternative might be to load the lookup table into a dictionary in the code the first time you need it and look up the values on the code from your small lookup table. I doubt you'll have problems with just doing it by the join though.
And I'd do this approach no matter what the conditions are (ie number of records, etc.). The conditions do make it more sensible though. :)
If you have literally millions of records, there's almost certainly no point in trying to spin up such a structure in server code or on the client in any form. It needs to be kept in a database, IMHO.
The query that creates the list needs to be smart enough to constrain the count of returned records to a manageable number. Perhaps partitioned views or stored procedures might help in this regard.
If this is primarily a read-only list, with updates only done in the context of management activities, it should be possible to make queries against the table very rapid with proper indexes and queries on the client side.
I have a C# app which allows the user to update some columns in a DB. My problem is that I have 300.000 records in the DB, and just updating 50.000 took 30 mins. Can I do something to speed things up?
My update query looks like this:
UPDATE SET UM = 'UM', Code = 'Code' WHERE Material = 'MaterialCode'.
My only unique constrain is Material. I read the file the user selects, and put the data in a datatable, and then I go row by row, and update the corresponding material in the DB
Limit the number of indexes in your database especially if your application updates data very frequently.This is because each index takes up disk space and slow the adding, deleting, and updating of rows, you should create new indexes only after analyze the uses of the data, the types and frequencies of queries performed, and how your queries will use the new indexes.
In many cases, the speed advantages of creating the new indexes outweigh the disadvantages of additional space used and slowly rows modification. However, avoid using redundant indexes, create them only when it is necessary. For read-only table, the number of indexes can be increased.
Use non clustered index on the table if the update is frequent.
Use clustered index on the table if the updates/inserts are not frequent.
C# code may not be a problem , your update statement is important. Where clause of the update statement is the place to lookout for. You need to have some indexed column in the where clause.
Another thing, is the field material, indexed? And also, is the where clause, needed to be on a field with a varchar value? Can't it be an integer valued field?
Performance will be better if you filter on fields having integers and not strings. Not sure if this is possible for you.
Up until now i've been using the C# "Guid = Guid.NewGuid();" method to generate a unique ID that can be stored as the ID field in some of my SQL Server database tables using Linq to SQL.
I've been informed that for indexing reasons, using a GUID is a bad idea and that I should use an auto-incrementing Long instead. Will using a long speed up my database transactions? If so, how do I go about generating unique ID's that are of type Long?
Regards,
Both have pros and cons, it depends entirely on how you use them that matters.
Right off the bat, if you need identifiers that can work across several databases, you need GUIDs. There are some tricks with Long (manually assigning each database a different seed/increment), but these don't scale well.
As far as indexing goes, Long will give much better insert performance if the index is clustered (by default primary keys are clustered, but this can be modified for your table), since the table does not need to be reorganized after every insert.
As far as concurrent inserts are concerned however, Long (identity) columns will be slower then GUID - identity column generation requires a series of exclusive locks to ensure that only one row gets the next sequential number. In an environment with many users inserting many rows all the time, this can be a performance hit. GUID generation in this situation is faster.
Storage wise, a GUID takes up twice the space of a Long (8 bytes vs 16). However it depends on the overall size of your row if 8 bytes is going to make a noticable difference in how many records fit in one leaf, and thus the number of leaves pulled from disk during an average request.
The "Queen of Indexing" - Kim Tripp - basically says it all in her indexing blog posts:
GUIDs as PRIMARY KEYs and/or the clustering key
The clustered index debate continues...
Ever increasing clustering key - the Clustered Index Debate......again!
Basically, her best practices are: an optimal clustering key should be:
unique
small
stable (never changing)
ever-increasing
GUID's violate the "small" and "ever-increasing" and are thus not optimal.
PLUS: all your clustering keys will be added to each and every single entry in each and every single non-clustered index (as the lookup to actually find the record in the database), thus you want to make them as small as possible (INT = 4 byte vs. GUID = 16 byte). If you have hundreds of millions of rows and several non-clustered indices, choosing an INT or BIGINT over a GUID can make a major difference - even just space-wise.
Marc
have a look at this
Is it better to use an uniqueidentifier(GUID) or a bigint for an identity column?
A long (big int in sql server) is 8 bytes and a Guid is 16 bytes, so you are halving the number of the bytes sql server has to compare when doing a look up.
For generating a long, use IDENTITY(1,1) when you create the field in the database.
so either using create table or alter table:
Field_NAME BIGINT NOT NULL PRIMARY KEY IDENTITY(1,1)
See comments for posting Linq to sql
Use guids when you need to consider import/export to multiple databases. Guids are often easier to use than columns specifying the IDENTITY attribute when working with a dataset of multiple child relationships. this is because you can randomly generate guids in the code in a disconnected state from the database, and then submit all changes at once. When guids are generated properly, they are insainely hard to duplicate by chance. With identity columns, you often have to do an intial insert of a parent row and query for it's new identity before adding child data. You then have to update all child records with the new parent identity before committing them to the database. The same goes for grandchildren and so on down the heirarchy. It builds up to a lot of work that seems unnecessary and mundane. You can do something similar to Guids by comming up with random integers without the IDENTITY specification, but the chance of collision is greatly increased as you insert more records over time. (Guid.NewGuid() is similar to a random Int128 - which doesn't exist yet).
I use Byte (TinyInt), Int16 (SmallInt), Int32/UInt16 (Int), Int64/UInt32 (BigInt) for small lookup lists that do not change or data that does not replicate between multiple databases. (Permissions, Application Configuration, Color Names, etc.)
I imagine the indexing takes just as long to query against regardless if you are using a guid or a long. There are usually other fields in tables that are indexed that are larger than 128 bits anyway (user names in a user table for example). The difference between Guids and Integers is the size of the index in memory, as well as time populating and rebuilding indexes. The majority of database transactions is often reading. Writing is minimal. Concentrate on optimizing reading from the database first, as they are usually made of joined tables that were not optimized properly, improper paging, or missing indexes.
As with anything, the best thing to do is to prove your point. create a test database with two tables. One with a primary key of integers/longs, and the other with a guid. Populate each with N-Million rows. Moniter the performance of each during the CRUD operations (create, read, update, delete). You may find out that it does have a performance hit, but insignificant.
Servers often run on boxes without debugging environments and other applications taking up CPU, Memory, and I/O of hard drive (especially with RAID). A development environment only gives you an idea of performance.
Consider creating sequential GUID from .NET application:
http://dotnet-snippets.de/dns/sequential-guid-SID998.aspx
What are the performance improvement of Sequential Guid over standard Guid?
You can debate GUID or identity all day. I prefer the database to generate the unique value with an identity. If you merge data from multiple databases, add another column (to identify the source database, possibly a tinyint or smallint) and form a composite primary key.
If you do go with an identity, be sure to pick the right datatype, based on number of expected keys you will generate:
bigint - 8 Bytes - max positive value: 9,223,372,036,854,775,807
int - 4 Bytes - max positive value: 2,147,483,647
Note "number of expected keys " is different than the number of rows. If you mainly add and keep rows, you may find that an INT is enough with over 2 billion unique keys. I'll bet your table won't get that big. However, if you have a high volume table where you keep adding and removing rows, you row count may be low, but you'll go through keys fast. You should do some calculations to see how log it would take to go through the INTs 2 billion keys. If it won't use them up any time soon go with INT, otherwise double the key size and go with BIGINT.