Long time process that involved in large inserts batch

Long time process that involved in large inserts batch - c#

I have a process that takes a lot and I'm seeking ways to reduce it.
The process goes like this:
A system manager writes to text box 10-20 lines(1-3 words each)
Remove any empty and only white space lines
The system multiply each line with 3000 different suffixes (by suffix I mean additional 1-2 words)
Check for duplicate lines and remove them
Check for illegal chars and in the process duplicates against the db - if found any - remove the line.
For each line:
Select a line id (for parent id) - this is a query, like: select parentid from table where name='son'
Insert the line with the parent id
As I sees it the for each line insert takes the longest time but it is necessary for the parent id(it is one of the new lines). The Table has, among other, an id, name and parent id. The code is written in c# work with mysql.
For solution I think the only thing may be convert the code to be full mysql stored procedure, but I'm not sure how much it will helps.

Related

Ideas on incorrect ORDER BY results

I want to emphasize that I'm looking for ideas, not necessarily a concrete answer since it's difficult to show what my queries look like, but I don't believe that's needed.
The process looks like this:
Table A keeps filling up, like a bucket - an SQL job keeps calling SP_Proc1 every minute or less and it inserts multiple records into table A.
At the same time a C# process keeps calling another procedure SP_Proc2 every minute or less that does an ordered TOP 5 select from table A and returns the results to the C# method. After C# code finishes processing the results it deletes the selected 5 records from table A.
I bolded the problematic part above. It is necessary that the records from table A be processed 5 at a time in the order specified, but a few times a month SP_Proc2 selects the ordered TOP 5 records in a wrong order even though all the records are present in table A and have correct column values that are used for ordering.
Something to note:
I'm ordering by integers, not varchar.
The C# part is using 1 thread.
Both SP_Proc1 and SP_Proc2 use a transaction and use READ COMMITTED OR READ COMMITTED SNAPSHOT transaction isolation level.
One column that is used for ordering is a computed value, but a very simple one. It just checks if another column in table A is not null and sets the computed column to either 1 or 0.
There's a unique nonclustered index on primary key Id and a clustered index composed of the same columns used for ordering in SP_Proc2.
I'm using SQL Server 2012 (v11.0.3000)
I'm beginning to think that this might be an SQL bug or maybe the records or index in table A get corrupted and then deleted by the C# process and that's why I can't catch it.
Edit:
To clarify, SP_Proc1 commits a big batch of N records to table A at once and SP_Proc2 pulls the records from table A in batches of 5, it orders the records in the table and selects TOP 5 and sometimes a wrong batch is selected, the batch itself is ordered correctly, but a different batch was supposed to be selected according to ORDER BY. I believe Rob Farley might have the right idea.

My guess is that your “out of order TOP 5” is ordered, but that a later five overlaps. Like, one time you get 1231, 1232, 1233, 1234, and 1236, and the next batch is 1235, 1237, and so on.
This can be an issue with locking and blocking. You’ve indicated your processes use transactions, so it wouldn’t surprise me if your 1235 hasn’t been committed yet, but can just be ignored by your snapshot isolation, and your 1236 can get picked up.
It doesn’t sound like there’s a bug here. What I’m describing above is a definite feature of snapshot isolation. If you must have 1235 picked up in an earlier batch than 1236, then don’t use snapshot isolation, and force your table to be locked until each block of inserts is finished.

An alternative suggestion would be to use a table lock (tablock) for the reading and writing procedures.
Though this is expensive, if you desire absolute consistency then this may be the way to go.

SQL Server - Best practice to circumvent large IN (...) clause (>40000 items)

I'm developing an ASP.NET app that analyzes Excel files uploaded by user. The files contain various data about customers (one row = one customer), the key field is CustomerCode. Basically the data comes in form of DataTable object.
At some point I need to get information about the specified customers from SQL and compare it to what user uploaded. I'm doing it the following way:
Make a comma-separated list of customers from CustomerCode column: 'Customer1','Customer2',...'CustomerN'.
Pass this string to SQL query IN (...) clause and execute it.
This was working okay until I ran into The query processor ran out of internal resources and could not produce a query plan exception when trying to pass ~40000 items inside IN (...) clause.
The trivial ways seems to:
Replace IN (...) with = 'SomeCustomerCode' in query template.
Execute this query 40000 times for each CustomerCode.
Do DataTable.Merge 40000 times.
Is there any better way to work this problem around?
Note: I can't do IN (SELECT CustomerCode FROM ... WHERE SomeConditions) because the data comes from Excel files and thus cannot be queried from DB.

"Table valued parameters" would be worth investigating, which let you pass in (usually via a DataTable on the C# side) multiple rows - the downside is that you need to formally declare and name the data shape on the SQL server first.
Alternatively, though: you could use SqlBulkCopy to throw the rows into a staging table, and then just JOIN to that table. If you have parallel callers, you will need some kind of session identifier on the row to distinguish between concurrent uses (and: don't forget to remove your session's data afterwards).

You shouldn't process too many records at once, because of errors as you mentioned, and it is such a big batch that it takes too much time to run and you can't do anything in parallel. You shouldn't process only 1 record at a time either, because then the overhead of the SQL server communication will be too big. Choose something in the middle, process eg. 10000 records at a time. You can even parallelize the processing, you can start running the SQL for the next 10000 in the background while you are processing the previous 10000 batch.

What is the best optimization technique for a wildcard search through 100,000 records in sql table

I am working on an ASP.NET MVC application. This application is used by 200 users. These
users constantly (every 5 mins) search for an item from the list of 100,000 items (this list is going to increase every month by 1-2 %). This list of 100,000 items are stored in a SQL Server table.
The search is a wildcard search
eg:
Select itemCode, itemName, ItemDesc
from tblItems
Where itemName like '%SearchWord%'
The searching needs to really fast since the main business relies on searching and selecting the item.
I would like to know how to get the best performance. The search results have to come up instantaneously.
What I have tried -
I tried pre-loading the entire 100,000 records into memcache and then reading from the memcache. I was trying to avoid the calls to SQL Server for every search.
This takes a lot of time. Every time user searches for an item, we are retrieving 100,000 records from the memcache and then doing the search. This is taking almost 2-3 times more time than direct SQL searches.
I tried doing a direct search on the SQL Server table but limiting the results to only 50 records at a time (using top 50)
This seems to be Ok but still no-where near the performance we are seeking
I would like to hear the possible solutions and links to any articles/code.
Thanks in advance

Run SQL Profiler and do a tuning profile. This will make recommendations on indexes to execute against your database.
Also, a query such as the following would be worth a try.
SELECT *
FROM
(
SELECT ROW_NUMBER() OVER ( ORDER BY ColumnA) AS RowNumber, itemCode, itemName, ItemDesc
FROM tblItems
WHERE itemName LIKE '%FooBar%'
) AS RowResults
WHERE RowNumber >= 1 AND RowNumber < 50
ORDER BY RowNumber
EDIT: Updated query to reflect your real scenario.

How about having a search without the leading wildcard as your primary search....
Where itemName like 'SearchWord%'
and then have having a "More Results" button that loads
Where itemName like '%SearchWord%'
(alternatively exclude results from the first result set)
Where itemName not like 'SearchWord%' and itemName like '%SearchWord%'

A weird alternative which might work, as it depends on several assumptions etc. Sorry not fully explained but am using ipad so hard to type. (and yes, this solution has been used in high txn commericial systems)
This assumes
That your query is cpu constrained not IO
That itemName is not too long, such that it holds all letters and numbers
That searchword, in total, contains enough selective characters and isnt just highly common characters
Your selection predicates are constrained by a %like%
The basic idea is to expand your query to help the optimiser know which rows need the like scanning.
Step 1. Setup your table
Create an additional 26 or 36 columns for each letter/digit. When I've done this for real it has always been a seperate table, but putting it on source table should be ok for a small volume like 100k. Lets call the colmns trig_a, trig_b etc.
Create a trigger for each insert/edit/delete and put a 1 or 0 into the trig_a field if it contains an 'a', do this for all 26/36 columns. The trigger to do this is complex, but possible (at least using Oracle). If you get stuck I'm sure SO'ers can create it, or I can dig it out.
At this point, we have a series of columns that indicate whether a field contains a letter/digit etc.
Step 2. Helping you query
With this extra info, we are in the position to help the optimiser. Add the following to your query
Select ... Where .... And
((trig_a > 0) or (searchword not like '%a%')) and
((trig_b > 0) or (searchword not like '%b%')) and
... Repeat for all columns monitored...
If the optimiser behaves, it can use the (hopefully) lower cost field>0 predicates to reduce the like predicates evaluated.
Notes.
You may need to force the optimiser to scan trig_? Fields first
Indexes can help on trig_? Fields, especically if in the source table
I haven't shown how to handle upper/lower case, dont forget to handle this
You might find just doing a few letters is all you need to do.
This technique doesnt offer performance gains for every use of like, so it isnt a general purpose technique for everywhere you use a like.

Optimizing SDF filesize

I recently started learning Linq and SQL. As a small project I'm writing a dictionary application for Windows Phone. The project is split into two Applications. One Application (that currently runs on my PC) generates a SDF file on my PC. The second App runs on my Windows Phone and searches the database. However I would like to optimize the data usage. The raw entries of the dictionary are written in a TXT file with a filesize of around 39MB. The file has the following layout
germanWord \tab englishWord \tab group
germanWord \tab englishWord \tab group
The file is parsed into a SDF database with the following tables.
Table Word with columns _version (rowversion), Id (int IDENTITY), Word (nvarchar(250)), Language (int)
This table contains every single word in the file. The language is a flag from my code that I used in case I want to add more languages later. A word-language pair is unique.
Table Group with columns _version (rowversion), GroupId (int IDENTITY), Caption (nvarchar(250))
This table contains the different groups. Every group is present one time.
Table Entry with columns _version (rowversion), EntryId (int IDENTITY), WordOneId (int), WordTwoId(int), GroupId(int)
This table links translations together. WordOneId and WordTwoId are foreign keys to a row in the Word Table, they contain the id of a row. GroupId defines the group the words belong to.
I chose this layout to reduce the data footprint. The raw textfile contains some german (or english) words multiple times. There are around 60 groups that repeat themselfes. Programatically I reduce the wordcount from around 1.800.000 to around 1.100.000. There are around 50 rows in the Group table. Despite the reduced number of words the SDF is around 80MB in filesize. That's more than twice the size of the the raw data. Another thing is that in order to speed up the searching of translation I plan to index the Word column of the Word table. By adding this index the file grows to over 130MB.
How can it be that the SDF with ~60% of the original data is twice as large?
Is there a way to optimize the filesize?

The database file must contain all of the data from your raw file, in addition to row metadata -- it also will contain the strings based on the datatypes specified -- I believe your option here is NVARCHAR which uses two bytes per letter. Combining these considerations, it would not surprise me that a database file is over twice as large as a text file of the same data using the ISO-Latin-1 character set.

c# - SQL - speed up code to DB

I have a page with 26 sections - one for each letter of the alphabet. I'm retrieving a list of manufacturers from the database, and for each one, creating a link - using a different field in the Database. So currently, I leave the connection open, then do a new SELECT by each letter, WHERE the Name LIKE that letter. It's very slow, though.
What's a better way to do this?
TIA

Since you are going to fetch them all anyway, you might find it faster to fetch them in one go and split them into letter-groups in the code.
Looking at it from the other end, why do you need to fetch all the lists just to build a set of links? Shouldn't you fetch a single letter when its link is clicked?

It sounds like you are doing up to 26 queries, which will never be fast. Often a single db query can take at least 40 ms, due to network latency, establishing connection, etc. So, doing this 26 times means that it will take around 40 x 26 ms, or more than one second. Of course, it can take much longer depending on your schema, data set, hardware, etc., but this is a rule of thumb that gives you a rough idea of the impact of queries on overall page render time.
One way I deal with this kind of situation is to use a DataTable. Fetch all the records into the DataTable, and then you can iterate through the alphabet, and use the Select method to filter.
DataTable myData = GetMyData();
foreach(string letter in lettersOfTheAlphabet)
{
myData.Filter(String.Format("Name like '{0}%'", letter));
//create your link here
}
Depending on your model layer you may wish to filter in a different way, but this is the basic idea that should improve the performance a lot.
Assuming you are querying to determine which letters are used, so that you know which links to render, you could actually just query for the letters themselves, like this:
select distinct substring(ManufacturerName, 1, 1) as FirstCharacter
from MyTable
order by 1

get one result set from one query and split that up. There is quite a lot of overhead going out the the database 26 times to do basically the same work!

You could probably do it smarter with a stored procedure. Let the SP return all the information you need in one call, and suddenly you only have one database interaction instead of 26...

Bring back all the items in one set (dataset, etc..), either through stored procedure or query, including the field left(col1,1), and sorting by that field..
select left(col1,1) as LetterGroup, col1, url_column from table1 order by left(col1,1)
Then look through the whole resultset, changing sections when the letter changes.

First letter in the alphabet sucks (sorry) as discriminator. You do not neet to split them actually (you could just ask for "where name like 'a%'), but whatever you run for that gives you on average a 1/26 or so split of the names. Not extremely efficient.
What do you mean with "creating a link - using a different field in the Database" - this sounds like a bad design to me.

there are a couple ways u can do this. 1) create a view in your db that has all the manufactures and their website link and then continue to hit the view for each letter. 2) select all the manufactures once and store it in a .net dataset and then use that dataset to populate your links.

This seems dirty to me, but you could create a first letter CHAR column and trigger to populate it. Have the first letter from the manufacturer name stored in that column and index it. Then select * from table where FirstLetter = 'A'.
Or create a lookup table with rows A - Z and set up foreign key in the manufacturer table. Again you would probably need a trigger to update this information. Then you could inner join the lookup table to the manufacturer table.
Then instead of putting 26 datasets in the page, have a list of links (A-Z) which select and show each dataset one at a time.

If I read you right, you're making a query for every manufacturer to get the "different field" you need to construct the link. If so, that's your problem, not the 26 alphabetic queries (though that's no help).
In a case like that, the faster way is this one query:
SELECT manufacturer_name, manufacturer_id, different_field
FROM manufacturers m
INNER JOIN different_field_table d
ON m.manufacturer_id = d.manufacturer_id
ORDER BY manufacturer_name
In your server code, loop through the records as usual. If you want, emit a heading when the first letter of the manufacturer_name changes.
For additional speed:
Put that in a stored procedure.
Index different_field_table on manufacturer_id.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.