Optimizing a small table of data for best search (query) speed - c#

I have a table with 4 columns and N rows. At the beginning N will be around 1000 and will have tendency to grow up to 3000.
1st: string unique
2nd: int with N/5 unique values
3rd: int with 5 unique values
4th: data value
The objective is to get to the value of the 4th column with different queries, ex: "get the value, where the 1st column is 17", or: "get all values where the 2nd column is 7", or: "does any row has this data". ~40% of queries will be done against the 4th column, ~30% against 3rd, ~20% 2nd and ~10% for 1st.
Since there would be around 100 queries per second, and around 2 changes (add/update/remove) per second against this table, I was wondering, what would be the fastest way (in C#) to manage this data? Memory is not an issue
I'm currently using a SortedDictionary, where the key is the 4th data value; and the dictionary's value is a class containing the first three values. Verifying the "4th column" is now easy by just using ContainsKey; and when querying by other values I use:
foreach(var object in Objects) if(Objects[Data].2nd==object.Value.2nd) {...}
Any suggestions appreciated.

This is the equivalent problem of how much indexing to use on a table in a database. If you want fast lookups on all 4 columns, you could created SortedDictionarys of each column type, and use the corresponding dictionary for lookups, but this will increase your add/update/remove time by having to update all 4 dictionaries (not to mention locking as well). It all depends on how fast you want updates and lookups on different columns to be.
However, given that multiple columns can have the same data, and SortedDictionary depends on unique key values, you may want to either write your own datastructure or use one of the MultiSet classes available in C# collection libraries (C5 springs to mind, but there are several others)

Related

C# Winforms Fastest Way To Query MS Access

This may be a dumb question, but I wanted to be sure. I am creating a Winforms app, and using c# oledbconnection to connect to a MS Access database. Right now, i am using a "SELECT * FROM table_name" and looping through each row to see if it is the row with the criteria I want, then breaking out of the loop if it is. I wonder if the performance would be improved if I used something like "SELECT * FROM table_name WHERE id=something" so basically use a "WHERE" statement instead of looping through every row?
The best way to validate the performance of anything is to test. Otherwise, a lot of assumptions are made about what is the best versus the reality of performance.
With that said, 100% of the time using a WHERE clause will be better than retrieving the data and then filtering via a loop. This is for a few different reasons, but ultimately you are filtering the data on a column before retrieving all of the columns, versus retrieving all of the columns and then filtering out the data. Relational data should be dealt with according to set logic, which is how a WHERE clause works, according to the data set. The loop is not set logic and compares each individual row, expensively, discarding those that don’t meet the criteria.
Don’t take my word for it though. Try it out. Especially try it out when your app has a lot of data in the table.
yes, of course.
if you have a access database file - say shared on a folder. Then you deploy your .net desktop application to each workstation?
And furthermore, say the table has 1 million rows.
If you do this:
SELECT * from tblInvoice WHERE InvoiceNumber = 123245
Then ONLY one row is pulled down the network pipe - and this holds true EVEN if the table has 1 million rows. To traverse and pull 1 million rows is going to take a HUGE amount of time, but if you add criteria to your select, then it would be in this case about 1 million times faster to pull one row as opposed to the whole table.
And say if this is/was multi-user? Then again, even on a network - again ONLY ONE record that meets your criteria will be pulled. The only requirement for this "one row pull" over the network? Access data engine needs to have a useable index on that criteria. Of course by default the PK column (ID) always has that index - so no worries there. But if as per above we are pulling invoice numbers from a table - then having a index on that column (InvoiceNumber) is required for the data engine to only pull one row. If no index can be used - then all rows behind the scenes are pulled until a match occurs - and over a network, then this means significant amounts of data will be pulled without that index across that network (or if local - then pulled from the file on the disk).

Find each string in a list from a table column

I have a table that has about 1 million rows. One of the columns is a string, let's call it column A.
Now I need to work on a list L of about 1,000 strings, mostly one or two words, and I need to find all the records in the table where column A contains one of the 1,000 strings in the list L.
The only way I can think of is to use each string in L to do a full table scan, find if the string is a substring of column A content of each row. But that will be O(n2), and for a million rows it will take a very long time.
Is there a better way? Either in SQL or in C# code?
One million rows is a relatively small number these days. You should be able to pull all strings from column A, along with your table's primary key, into memory, and do a regex search using a very long regex composed from your 1000 strings:
var regex = new Regex("string one|string two|string three|...|string one thousand");
Since regex gets compiled into a final automaton, you would get reasonably fast scanning times for your strings. Once your filtering is complete, collect the IDs, and query full rows from the table using them.
The best way to do is is using linq. Lets say that you have your list
List<string> test = new List<string>{"aaa","ddd","ddsc"};
then using Linq you can constract
var match = YourTable.Where (t=> test.Contains(t.YourFieldName);
I suggest looking into full text search, it won't decrease the count of the operations you have to perform but it will increase the performance.
Assuming you use Sql server (you should always use the relevant tag to specify the rdbms),
you can create a DataTable from your List<string> and send it to a stored procedure as a table valued parameter.
Inside the stored procedure you can use a simple join of that table valued parameter to your table on database_table.col contains(table_parameter.value) (using full text search).
Of course, things will go a lot faster if you create a full text index as suggested in the comments by Glorfindel

LINQ to Sql performance using multiple LIKE's vs using a List

I have an MVC autocomplete that will search for any number of strings entered into a textbox to find an address.
For example, if they enter John Doe New York, my query will do a LIKE on all the columns in the customer table (first, last, address, city, state, zip) to see if it matches the term. Then will move to the next search word and do the same.
My question is, is it better to hit the Sql Server DB 4 times (in this example) doing LIKE's for each search term for each field, or would it be better to return approx 10,000 rows and search them in memory as a List?
The first would require a lot more DB I/O as it searches the tables, but the 2nd would require a lot more data coming into the app.
None of the data in the Customers table is full text search indexed and, at best, would have a SQL Index on the individual columns.
general part
it is better to have DB do its job
if you go with 4 queries approach you will have:
time for each query, let it be 6 comparisons for each row for 1 word, 6*4 comparisons, let call it 24*q1 (q1 - average number of rows)
time to transmit 4 results, let it be q2*4 (q2 - average number of filtered rows)
time to merge/filter results on client side, which actually be almost the same as p1 - 24 comparisons for each row, again 24*q2
if you go with fully db approach, you will have
time one query will be, 6 comparisons for 4 words = 24 comparisons for each row
time to transmit one result q2_filtered (q2_filtered < q2)
24*q1 + q2*4 + 24*q2 > 24*q1 + q2_filtered so, the answer is obvious - database should filter records
if you want to store whole customer table in memory - of course it will be faster to perform your own search which will take 24*q1, so you're getting rid of only transmission part, but it will consume web servers memory and you will have problems with synchronization between memory/db
some details
depending on how do you use like - you can have very different performance problems, for example like 'ABC%' will use index, but like '%ABC%' cannot use index
here are possible some tricks, like this one: concatenation of all columns into 1, sorting of symbols in it and remove duplicates, storing symbols in different columns, the same for words - this will help a bit as it can use indexes, but you will have some false positive matches
if you really need to fetch data fast - use full text indexes or special approaches to this really huge and global problem

C# SQL Server - More Efficient for Multiple Database accesses or multiple loops through data?

In part of my application I have to get the last ID of a table where a condition is met
For example:
SELECT(MAX) ID FROM TABLE WHERE Num = 2
So I can either grab the whole table and loop through it looking for Num = 2, or I can grab the data from the table where Num = 2. In the latter, I know the last item will be the MAX ID.
Either way, I have to do this around 50 times...so would it be more efficient grabbing all the data and looping through the list of data looking for a specific condition...
Or would it be better to grab the data several times based on the condition..where I know the last item in the list will be the max id
I have 6 conditions I will have to base the queries on
Im just wondering which is more efficient...looping through a list of around 3500 items several times, or hitting the database several times where I can already have the data broken down like I need it
I could speak for SqlServer. If you create a StoredProcedure where Num is a parameter that you pass, you will get the best performance due to its optimization engine on execution plan of the stored procedure. Of course an Index on that field is mandatory.
Let the database do this work, it's what it is designed to do.
Does this table have a high insert frequency? Does it have a high update frequency, specifically on the column that you're applying the MAX function to? If the answer is no, you might consider adding an IS_MAX BIT column and set it using an insert trigger. That way, the row you want is essentially cached, and it's trivial to look up.

Simplifying complexity in for a table object structure

I have an object structure that is mimicking the properties of an excel table. So i have a table object containing properties such as title, header row object and body row objects. Within the header row and each body row object, i have a cell object containing info on each cell per row. I am looking for a more efficient way to store this table structure since in one of my uses for this object, i am printing its structure to screen. Currently, i am doing an O(n^2) complexity for printing each row for each cell:
foreach(var row in Table.Rows){
foreach(var cell in row.Cells){
Console.WriteLine(cell.ToString())
}
}
Is there a more efficient way of storing this structure to avoid the n^2? I ask this because this printing functionality exists in another n^2 loop. Basically i have a list of tables titles and a list of tables. I need to find those tables whose titles are in the title list. Then for each of those tables, i need to print their rows and the cells in each row. Can any part of this operation be optimized by using a different data structure for storage perhaps? Im not sure how exactly they work but i have heard of hashing and dictionary?
Thanks
Since you are looking for tables with specific titles, you could use a dictionary to store the tables by title
Dictionary<string,Table> tablesByTitle = new Dictionary<string,Table>();
tablesByTitle.Add(table.Title, table);
...
table = tablesByTitle["SomeTableTitle"];
This would make finding a table an O(1) operation. Finding n tables would be an O(n) operation.
Printing the tables then of cause depends on the number of rows and columns. There is nothing, which can change that.
UPDATE:
string tablesFromGuiElement = "Employees;Companies;Addresses";
string[] selectedTables = tablesFromGuiElement.Split(';');
foreach (string title in selectedTables) {
Table tbl = tablesByTitle[title];
PrintTable(tbl);
}
There isn't anything more efficient than an N^2 operation for outputting an NxN matrix of values. Worst-case, you will always be doing this.
Now, if instead of storing the values in a multidimensional collection that defines the graphical relationship of rows and columns, you put them in a one-dimensional collection and included the row-column information with each cell, then you would only need to iterate through the cells that had values. Worst-case is still N^2 for a table of N rows and N columns that is fully populated (the one-dimensional array, though linear to enumerate, will have N^2 items), but the best case would be that only one cell in that table is populated (or none are) which would be constant-time.
This answer applies to the, printing the table part, but the question was extended.
for the getting the table part, see the other answer.
No, there is not.
Unless perhaps your values follow some predictable distribution, then you could use a function of x and y and store no data at all, or maybe a seed and a function.
You could cache the print output in a string or StringBuider if you require it multiple times.
If there is enough data I guess you might apply some compression algorithm but I wouldn't say that was simpler or more efficient.

Categories