I need to do some in-memory merging in C# of two sorted streams of strings coming from one or more SQL Server 2000 databases into a single sorted stream. These streams of data can be huge, so I don't want to pull both streams into memory. Instead, I need to keep one item at a time from each stream in memory and at each step, compare the current item from each stream, push the minimum onto the final stream, and pull the next item from the appropriate source stream. To do this correctly, though, the in-memory comparison has to match the collation of the database (consider the streams [A,B,C] and [A,B,C]: the correct merged sequence is [A,A,B,B,C,C], but if your in-memory comparison thinks C < B, your in-memory merge will yield A,A,B, at which point it will be looking at a B and a C, and will yield the C, resulting in an incorrectly sorted stream.)
So, my question is: is there any way to mimic any of the collations in SQL Server 2000 with a System.StringComparison enum in C# or vise-versa? The closest I've come is to use System.StringCompaison.Ordinal with the results of the database strings converted to VARBINARY with the standard VARBINARY ordering, which works, but I'd rather just add an "order by name collate X" clause to my SQL queries, where X is some collation that works exactly like the VARBINARY ordering, rather than converting all strings to VARBINARY as they leave the database and then back to strings as they come in memory.
Have a look at the StringComparer class. This provides for more robust character and string comparisons than you'll find with String.Compare. There are three sets of static instances (CurrentCulture, InvariantCulture, Ordinal) and case-insesitive versions of each. For more specialized cultures, you can use the StringComparer.Create() function to create a comparer tied to a particular culture.
With sql 2005 I know that the db engine does not make OS calls to do the sorting, the ordering rules are statically shipped with the db (may update with a service pack, but doesn't change with the OS). So I don't think you can safely say that a given set of application code can order the same way unless you have the same code as the db server, unless you use a binary collation.
But if you use a binary collation in the db and client code you should have no problem at all.
EDIT - any collation that ends in _BIN will give you binary sorting. The rest of the collation name will determine what code page is used for storing CHAR data, but will not affect the ordering. The _BIN means strictly binary sorting. See http://msdn.microsoft.com/en-us/library/ms143515(SQL.90).aspx
Related
Question: Is it possible to search for values that are in between (i.e., BETWEEN, greater than, and less than type math operators) each other when the data is stored in a VARBINARY data type?
Problem: I have a list of IP addresses (both IPv4 and IPv6) where I need to determine the geolocation of that IP address, which means I need to search between ranges.
Typically, this can be accomplished by converting the address to integer and then using the BETWEEN operator. However, with IPv6 effectively exceeding all numeric, decimal, and integer related data types, as of this posting, then it appears that I need to store the data in the VARBINARY data type.
I have not used this data type in the past, so I am not aware of how, or if it is even possible, to search between values. My searches online have not turned up any hits, so I am asking here.
Note: currently using SQL Server 2014, but will be migrating to SQL Server 2017 for this project.
Your approach is correct.
You can use VARBINARY operator for comparison.
Here is an approved answer in MSDN groups. But the link may be broken in future, so I am pasting the query also below.
Questions about dealing with IPV6 varbinary and comparing hex values in a range?
Query:
DECLARE #b1 varbinary(16) = convert(varbinary(16), newid()),
#b2 varbinary(16) = convert(varbinary(16), newid())
SELECT CASE WHEN #b1 > #b2 THEN '#b1 is bigger' ELSE '#b2 is bigger' END
Overview
This question is a more specific version of this one:
sql server - performance hit when passing argument of C# type Int64 into T-SQL bigint stored procedure parameter
But I've noticed the same performance hit for other data types (and, in fact, in my case I'm not using any bigint types at all).
Here are some other questions that seem like they should cover the answer to this question, but I'm observing the opposite of what they indicate:
c# - When should "SqlDbType" and "size" be used when adding SqlCommand Parameters? - Stack Overflow
.net - What's the best method to pass parameters to SQLCommand? - Stack Overflow
Context
I've got some C# code for inserting data into a table. The code is itself data-driven in that some other data specifies the target table into which the data should be inserted. So, tho I could use dynamic SQL in a stored procedure, I've opted to generate dynamic SQL in my C# application.
The command text is always the same for row I insert so I generate it once, before inserting any rows. The command text is of the form:
INSERT SomeSchema.TargetTable ( Column1, Column2, Column3, ... )
VALUES ( SomeConstant, #p0, #p1, ... );
For each insert, I create an array of SqlParameter objects.
For the 'nvarchar' behavior, I'm just using the SqlParameter(string parameterName, object value) constructor method, and not setting any other properties explicitly.
For the 'degenerate' behavior, I was using the SqlParameter(string parameterName, SqlDbType dbType) constructor method and also setting the Size, Precision, and Scale properties as appropriate.
For both versions of the code, the value either passed to the constructor method or separately assigned to the Value property has a type of object.
The 'nvarchar' version of the code takes about 1-1.5 minutes. The 'degenerate' or 'type-specific' code takes longer than 9 minutes; so 6-9 times slower.
SQL Server Profiler doesn't reveal any obvious culprits. The type-specific code is generating what would seem like better SQL, i.e. a dynamic SQL command whose parameters contain the appropriate data type and type info.
Hypothesis
I suspect that, because I'm passing an object type value as the parameter value, the ADO.NET SQL Server client code is casting, converting, or otherwise validating the value before generating and sending the command to SQL Server. I'm surprised tho that the conversion from nvarchar to each of the relevant target table column types that SQL Server must be performing is so much faster than whatever the client code is doing.
Notes
I'm aware that SqlBulkCopy is probably the best-performing option for inserting large numbers of rows but I'm more curious why the 'nvarchar' case out-performs the 'type-specific' case, and my current code is fast enough as-is given the amount of data it routinely handles.
The answer does depend on the database you are running, but it has to do with the character encoding process. SQL Server introduced the NVarChar and NText field types to handle UTF encoded data. UTF also happens to be the internal string representation for the .NET CLR. NVarChar and NText don't have to be converted to another character encoding, which takes a very short but measurable amount of time.
Other databases allow you to define character encoding at the database level, and others let you define it on a column by column basis. The performance differences really depend on the driver.
Also important to note:
Inserting using a prepared statement emphasizes inefficiencies in converting to the database's internal format
This has no bearing on how efficient the database queries against a string--UTF-16 takes up more space than the default Windows-1252 encoding for Text and VarChar.
Of course, in a global application, UTF support is necessary
They're Not (but They're Almost as Fast)
My original discrepancy was entirely my fault. The way I was creating the SqlParameter objects for the 'degenerate' or 'type-specific' version of the code used an extra loop than the 'nvarchar' version of the code. Once I rewrote the type-specific code to use the same number of loops (one), the performance is almost the same. [About 1–2% slower now instead of 500-800% slower.]
A slightly modified version of the type-specific code is now a little faster; at least based on my (limited) testing – about 3-4% faster for ~37,000 command executions.
But it's still (a little) surprising that it's not even faster, as I'd expect SQL Server converting hundreds of nvarchar values to lots of other data types (for every execution) to be significantly slower than the C# code to add type info to the parameter objects. I'm guessing it's really hard to observe much difference because the time for SQL Server to convert the parameter values is fairly small relative to the time for all of the other code (including the SQL client code communicating with SQL Server).
One lesson I hope to remember is that it's very important to compare like with like.
Another seeming lesson is that SQL Server is pretty fast at converting text to its various other data types.
I understand that collation can be set differently in different tables in a database. Collation is understood from What does character set and collation mean exactly?
There is a query that performs CAST from a char results as shown below. There are no tables involved. I guess, the encoding applied will be based on the collation in database level. Is this assumption correct?
SELECT CAST ( SSS.id_encrypt ('E','0000000{0}') AS CHAR(100) FOR BIT DATA)
AS ENCRYPT_ID FROM FFGLOBAL.ONE_ROW FETCH FIRST 1 ROW ONLY
QUESTION
In the question Get Byte[] from Db2 without Encoding answer given by #AlexFilipovici [.Net BlockCopy ] provides a different result when compared to CAST result. Why is it so if there is no codepage associated?
Based on National language support - Character conversion
Bit data (columns defined as FOR BIT DATA, or BLOBs, or binary strings) is not associated with any character set.
REFERENCE
Get Byte[] from Db2 without Encoding
Default code page for new databases is Unicode
National language support - Character conversion
To find out the collation at database level in SQL Server, try this:
SELECT DATABASEPROPERTYEX('databasename', 'Collation');
More: DATABASEPROPERTYEX
To answer your questions:
#1: Specifying FOR BIT DATA on a character-based data type (in DB2) means that DB2 stores / returns the raw data back with no codepage associated (i.e. it's just a string of bytes and will not go through any codepage conversion between client and server).
#2: In DB2 for Linux, UNIX and Windows, you can determine the database's collation by querying SYSIBMADM.DBCFG
select name,value
from sysibmadm.dbcfg
where name in ('codepage','codeset');
#3: Per #Iswanto San:
SELECT DATABASEPROPERTYEX('databasename', 'Collation');
I have a windows application written in C# that needs to load load 250,000 rows from database and provide a "search as you type" feature which means as soon as user types something in a text box, the application needs to search all 250,000 records (which are btw, single column with 1000 characters each row) using like search and display the found records.
The approach I followed was:
1- The application loads all the records into a typed List<EmployeeData>
while (objSQLReader.Read())
{
lstEmployees.Add(new EmployeesData(
Convert.ToInt32(objSQLReader.GetString(0)),
objSQLReader.GetString(1),
objSQLReader.GetString(2)));
}
2- In TextChanged event, Using LINQ, I search (with combination of Regular Expression) and attach the IEnumerable<EmployeesData> to a ListView which is in Virtual Mode.
String strPattern = "(?=.*wood*)(?=.*james*)";
IEnumerable<EmployeesData> lstFoundItems = from objEmployee in lstEmployees
where Regex.IsMatch(Employee.SearchStr, strPattern, RegexOptions.IgnoreCase)
select objEmployee;
lstFoundEmployees = lstFoundItems;
3- RetrieveVirtualItem event is handled to display items in ListView to display the item.
e.Item = new ListViewItem(new String[] {
lstFoundEmployees.ElementAt(e.ItemIndex).DateProjectTaskClient,
e.ItemIndex.ToString() });
Though the lstEmployees is loaded relatively fast (1.5 seconds) for loading the list from SQL Server, to search on TextChanged, it takes more than 7 minutes to search using LINQ. Searching thru SQL Server directly by performing a LIKE search takes less than 7 seconds.
What am I doing wrong here? How can I make this search faster (not more 2 seconds)? This is a requirement from my client. So, any help is highly appreciated. Please Help...
Does the database column that stores the text data have an index on it? If so, something similar to the trie structure that Nicholas described is already in use. Indexes in SQL Server are implemented using B+ trees, which have a an average search time on the order of log base 2 of n, where n is the height of the tree. This means that if you have 250,000 records in the table the number of operations required to search are log base 2 ( 250,000 ) or approximately 18 operations.
When you load all of the information into a data reader and then use a LINQ expression it's a linear operation, (O) n, where n is the length of the list. So worst case, it's going to be 250,000 operations. If you use a DataView there will be indexes that can be used to help with searching, which will drastically improve performance.
At the end of the day if there will not be too many requests submitted against the database server leverage the query optimizer to do this. As long as the LIKE operation isn't performed with a wildcard at the front of the string (i.e. LIKE %some_string) (negates the use of an index) and there is an index on the table you will have really fast performance. If there are just too many requests that will be submitted to the database server, either put all of the information into a DataView so an index can be used, or use a dictionary as Tim suggested above, which has a search time of O(1) (on the order of one), assuming the dictionary is implemented using a hash table.
You'd be wanting to preload things and build yourself a data structure called a trie
It's memory-intensive, but it's what the doctor ordered in this case.
See my answer to this question. If you need instant response (i.e. as fast as a user types), loading the data into memory can be a very attractive option. It may use a bit of memory, but it is very fast.
Even though there are many characters (250K records * 1000), how many unique values are there? An in-memory structure based off of keys with pointers to records matching those keys really doesn't have to be that big, even accounting for permutations of those keys.
If the data it truly won't fit into memory or changes frequently, keep it in the database and use SQL Server Full Text Indexing, which will handle searches such as this much better than a LIKE. This assumes a fast connection from the application to the database.
Full Text Indexing offers a powerful set of operators/expressions which can be used to make searches more intelligent. It's available with the free SQL Expression Edition, which will handle up to 10GB of data.
If the records can be sorted, you may want to go with a binary search, which is much, much faster for large data sets. There are several implementations in .NET collections, like List<T> and Array.
I have a situation where I have to dynamically create my SQL strings and I'm trying to use paramaters and sp_executesql where possible so I can reuse query plans. In doing lots of reading online and personal experience I have found "NOT IN"s and "INNER/LEFT JOIN"s to be slow performers and expensive when the base (left-most) table is large (1.5M rows with like 50 columns). I also have read that using any type of function should be avoided as it slows down queries, so I'm wondering which is worse?
I have used this workaround in the past, although I'm not sure it's the best thing to do, to avoid using a "NOT IN" with a list of items when, for example I'm passing in a list of 3 character strings with, for example a pipe delimiter (only between elements):
LEN(#param1) = LEN(REPLACE(#param1, [col], ''))
instead of:
[col] NOT IN('ABD', 'RDF', 'TRM', 'HYP', 'UOE')
...imagine the list of strings being 1 to about 80 possible values long, and this method doesn't lend it self to paraterization either.
In this example I can use "=" for a NOT IN and I would use a traditional list technique for my IN, or != if that is a faster although I doubt it. Is this faster than using the NOT IN?
As a possible third alternative, what if I knew all the other possibilities (the IN possabilities, which could potentially be 80-95x longer list) and pass those instead; this would be done in the application's Business Layer as to take the workload off of the SQL Server. Not a very good possability for query plan reuse but if it shaves a sec or two off a big nasty query, why the hell not.
I'm also adept at SQL CLR function creation. Since the above is string manipulation would a CLR function be best?
Thoughts?
Thanks in advance for any and all help/advice/etc.
As Donald Knuth is often (mis)quoted, "premature optimization is the root of all evil".
So, first of all, are you sure that if you write your code in the most clear and simple way (to both write and read), it performs slowly? If not, check it, before starting to use any "clever" optimization tricks.
If the code is slow, check the query plans thouroughly. Most of the time query execution takes much longer than query compilation, so usually you do not have to worry about query plan reuse. Hence, building optimal indexes and/or table structures usually gives significantly better results than tweaking the ways the query is built.
For instance, I seriously doubt that your query with LEN and REPLACE has better performance than NOT IN - in either case all the rows will be scanned and checked for a match. For a long enough list MSSQL optimizer would automatically create a temp table to optimize equality comparison.
Even more, tricks like this tend to introduce bugs: say, your example would work incorrectly if [col] = 'AB'.
IN queries are often faster then NOT IN, because for IN queries only part of the rows is enough to be checked. The efficiency of the method depends on whether you can get a correct list for IN quickly enough.
Speaking of passing a variable-length list to the server, there're many discussions here on SO and elsewhere. Generally, your options are:
table-valued parameters (MSSQL 2008+ only),
dynamically constructed SQL (error prone and/or unsafe),
temp tables (good for long lists, probably too much overhead in writing and execution time for short ones),
delimited strings (good for short lists of 'well-behaved' values - like a handful of integers),
XML parameters (somewhat complex, but works well - if you use a good XML library and do not construct complex XML text 'by hand').
Here is an article with a good overview of these techniques and a few more.
I have found "NOT IN"s and "INNER/LEFT JOIN"s to be slow performers and expensive when the base (left-most) table is large
It shouldn't be slow if you indexed your table correctly. Something that can make the query slow is if you have a dependent subquery. That is, the query must be re-evaluated for each row in the table because the subquery references values from the outer query.
I also have read that using any type of function should be avoided as it slows down queries
It depends. SELECT function(x) FROM ... probably won't make a huge difference to the performance. The problems are when you use function of a column in other places in the query such as JOIN conditions, WHERE clause, or ORDER BY as it may mean that an index cannot be used. A function of a constant value is not a problem though.
Regarding your query, I'd try using [col] NOT IN ('ABD', 'RDF', 'TRM', 'HYP', 'UOE') first. If this is slow, make sure that you have indexed the table appropriately.
First off, since you are only filtering out a small percentage of the records, chances are the index on col isn't being used at all so SARG-ability is moot.
So that leaves query plan reuse.
If you are on SQL Server 2008, replace #param1 with a table-valued parameter, and have your application pass that instead of a delimited list. This solves your problem completely.
If you are on SQL Server 2005, I don't think it matters. You could split the delimited list and use NOT IN/NOT EXISTS against the table, but what's the point if you won't get an index seek on col?
Can anyone speak to the last point? Would splitting the list to a table var and then anti-joining it save enough CPU cycles to offset the setup cost?
EDIT, third method for SQL Server 2005 using XML, inspired by OMG Ponies' link:
DECLARE #not_in_xml XML
SET #not_in_xml = N'<values><value>ABD</value><value>RDF</value></values>'
SELECT * FROM Table1
WHERE #not_in_xml.exist('/values/value[text()=sql:column("col")]') = 0
I have no idea how well this performs compared to a delimited list or TVP.