fastest way to search huge list of big texts

fastest way to search huge list of big texts - c#

I have a windows application written in C# that needs to load load 250,000 rows from database and provide a "search as you type" feature which means as soon as user types something in a text box, the application needs to search all 250,000 records (which are btw, single column with 1000 characters each row) using like search and display the found records.
The approach I followed was:
1- The application loads all the records into a typed List<EmployeeData>
while (objSQLReader.Read())
{
lstEmployees.Add(new EmployeesData(
Convert.ToInt32(objSQLReader.GetString(0)),
objSQLReader.GetString(1),
objSQLReader.GetString(2)));
}
2- In TextChanged event, Using LINQ, I search (with combination of Regular Expression) and attach the IEnumerable<EmployeesData> to a ListView which is in Virtual Mode.
String strPattern = "(?=.*wood*)(?=.*james*)";
IEnumerable<EmployeesData> lstFoundItems = from objEmployee in lstEmployees
where Regex.IsMatch(Employee.SearchStr, strPattern, RegexOptions.IgnoreCase)
select objEmployee;
lstFoundEmployees = lstFoundItems;
3- RetrieveVirtualItem event is handled to display items in ListView to display the item.
e.Item = new ListViewItem(new String[] {
lstFoundEmployees.ElementAt(e.ItemIndex).DateProjectTaskClient,
e.ItemIndex.ToString() });
Though the lstEmployees is loaded relatively fast (1.5 seconds) for loading the list from SQL Server, to search on TextChanged, it takes more than 7 minutes to search using LINQ. Searching thru SQL Server directly by performing a LIKE search takes less than 7 seconds.
What am I doing wrong here? How can I make this search faster (not more 2 seconds)? This is a requirement from my client. So, any help is highly appreciated. Please Help...

Does the database column that stores the text data have an index on it? If so, something similar to the trie structure that Nicholas described is already in use. Indexes in SQL Server are implemented using B+ trees, which have a an average search time on the order of log base 2 of n, where n is the height of the tree. This means that if you have 250,000 records in the table the number of operations required to search are log base 2 ( 250,000 ) or approximately 18 operations.
When you load all of the information into a data reader and then use a LINQ expression it's a linear operation, (O) n, where n is the length of the list. So worst case, it's going to be 250,000 operations. If you use a DataView there will be indexes that can be used to help with searching, which will drastically improve performance.
At the end of the day if there will not be too many requests submitted against the database server leverage the query optimizer to do this. As long as the LIKE operation isn't performed with a wildcard at the front of the string (i.e. LIKE %some_string) (negates the use of an index) and there is an index on the table you will have really fast performance. If there are just too many requests that will be submitted to the database server, either put all of the information into a DataView so an index can be used, or use a dictionary as Tim suggested above, which has a search time of O(1) (on the order of one), assuming the dictionary is implemented using a hash table.

You'd be wanting to preload things and build yourself a data structure called a trie
It's memory-intensive, but it's what the doctor ordered in this case.

See my answer to this question. If you need instant response (i.e. as fast as a user types), loading the data into memory can be a very attractive option. It may use a bit of memory, but it is very fast.
Even though there are many characters (250K records * 1000), how many unique values are there? An in-memory structure based off of keys with pointers to records matching those keys really doesn't have to be that big, even accounting for permutations of those keys.
If the data it truly won't fit into memory or changes frequently, keep it in the database and use SQL Server Full Text Indexing, which will handle searches such as this much better than a LIKE. This assumes a fast connection from the application to the database.
Full Text Indexing offers a powerful set of operators/expressions which can be used to make searches more intelligent. It's available with the free SQL Expression Edition, which will handle up to 10GB of data.

If the records can be sorted, you may want to go with a binary search, which is much, much faster for large data sets. There are several implementations in .NET collections, like List<T> and Array.

Related

C# Winforms Fastest Way To Query MS Access

This may be a dumb question, but I wanted to be sure. I am creating a Winforms app, and using c# oledbconnection to connect to a MS Access database. Right now, i am using a "SELECT * FROM table_name" and looping through each row to see if it is the row with the criteria I want, then breaking out of the loop if it is. I wonder if the performance would be improved if I used something like "SELECT * FROM table_name WHERE id=something" so basically use a "WHERE" statement instead of looping through every row?

The best way to validate the performance of anything is to test. Otherwise, a lot of assumptions are made about what is the best versus the reality of performance.
With that said, 100% of the time using a WHERE clause will be better than retrieving the data and then filtering via a loop. This is for a few different reasons, but ultimately you are filtering the data on a column before retrieving all of the columns, versus retrieving all of the columns and then filtering out the data. Relational data should be dealt with according to set logic, which is how a WHERE clause works, according to the data set. The loop is not set logic and compares each individual row, expensively, discarding those that don’t meet the criteria.
Don’t take my word for it though. Try it out. Especially try it out when your app has a lot of data in the table.

yes, of course.
if you have a access database file - say shared on a folder. Then you deploy your .net desktop application to each workstation?
And furthermore, say the table has 1 million rows.
If you do this:
SELECT * from tblInvoice WHERE InvoiceNumber = 123245
Then ONLY one row is pulled down the network pipe - and this holds true EVEN if the table has 1 million rows. To traverse and pull 1 million rows is going to take a HUGE amount of time, but if you add criteria to your select, then it would be in this case about 1 million times faster to pull one row as opposed to the whole table.
And say if this is/was multi-user? Then again, even on a network - again ONLY ONE record that meets your criteria will be pulled. The only requirement for this "one row pull" over the network? Access data engine needs to have a useable index on that criteria. Of course by default the PK column (ID) always has that index - so no worries there. But if as per above we are pulling invoice numbers from a table - then having a index on that column (InvoiceNumber) is required for the data engine to only pull one row. If no index can be used - then all rows behind the scenes are pulled until a match occurs - and over a network, then this means significant amounts of data will be pulled without that index across that network (or if local - then pulled from the file on the disk).

What is the best optimization technique for a wildcard search through 100,000 records in sql table

I am working on an ASP.NET MVC application. This application is used by 200 users. These
users constantly (every 5 mins) search for an item from the list of 100,000 items (this list is going to increase every month by 1-2 %). This list of 100,000 items are stored in a SQL Server table.
The search is a wildcard search
eg:
Select itemCode, itemName, ItemDesc
from tblItems
Where itemName like '%SearchWord%'
The searching needs to really fast since the main business relies on searching and selecting the item.
I would like to know how to get the best performance. The search results have to come up instantaneously.
What I have tried -
I tried pre-loading the entire 100,000 records into memcache and then reading from the memcache. I was trying to avoid the calls to SQL Server for every search.
This takes a lot of time. Every time user searches for an item, we are retrieving 100,000 records from the memcache and then doing the search. This is taking almost 2-3 times more time than direct SQL searches.
I tried doing a direct search on the SQL Server table but limiting the results to only 50 records at a time (using top 50)
This seems to be Ok but still no-where near the performance we are seeking
I would like to hear the possible solutions and links to any articles/code.
Thanks in advance

Run SQL Profiler and do a tuning profile. This will make recommendations on indexes to execute against your database.
Also, a query such as the following would be worth a try.
SELECT *
FROM
(
SELECT ROW_NUMBER() OVER ( ORDER BY ColumnA) AS RowNumber, itemCode, itemName, ItemDesc
FROM tblItems
WHERE itemName LIKE '%FooBar%'
) AS RowResults
WHERE RowNumber >= 1 AND RowNumber < 50
ORDER BY RowNumber
EDIT: Updated query to reflect your real scenario.

How about having a search without the leading wildcard as your primary search....
Where itemName like 'SearchWord%'
and then have having a "More Results" button that loads
Where itemName like '%SearchWord%'
(alternatively exclude results from the first result set)
Where itemName not like 'SearchWord%' and itemName like '%SearchWord%'

A weird alternative which might work, as it depends on several assumptions etc. Sorry not fully explained but am using ipad so hard to type. (and yes, this solution has been used in high txn commericial systems)
This assumes
That your query is cpu constrained not IO
That itemName is not too long, such that it holds all letters and numbers
That searchword, in total, contains enough selective characters and isnt just highly common characters
Your selection predicates are constrained by a %like%
The basic idea is to expand your query to help the optimiser know which rows need the like scanning.
Step 1. Setup your table
Create an additional 26 or 36 columns for each letter/digit. When I've done this for real it has always been a seperate table, but putting it on source table should be ok for a small volume like 100k. Lets call the colmns trig_a, trig_b etc.
Create a trigger for each insert/edit/delete and put a 1 or 0 into the trig_a field if it contains an 'a', do this for all 26/36 columns. The trigger to do this is complex, but possible (at least using Oracle). If you get stuck I'm sure SO'ers can create it, or I can dig it out.
At this point, we have a series of columns that indicate whether a field contains a letter/digit etc.
Step 2. Helping you query
With this extra info, we are in the position to help the optimiser. Add the following to your query
Select ... Where .... And
((trig_a > 0) or (searchword not like '%a%')) and
((trig_b > 0) or (searchword not like '%b%')) and
... Repeat for all columns monitored...
If the optimiser behaves, it can use the (hopefully) lower cost field>0 predicates to reduce the like predicates evaluated.
Notes.
You may need to force the optimiser to scan trig_? Fields first
Indexes can help on trig_? Fields, especically if in the source table
I haven't shown how to handle upper/lower case, dont forget to handle this
You might find just doing a few letters is all you need to do.
This technique doesnt offer performance gains for every use of like, so it isnt a general purpose technique for everywhere you use a like.

How to store a sparse boolean vector in a database?

Let's say I have a book with ~2^40 pages. Each day, I read a random chunk of contiguous pages (sometimes including some pages I've already read). What's the smartest way to store and update the information of "which pages I've read" in a (SQLite) database ?
My current idea is to store [firstChunkPage, lastChunkPage] entries in a table, but I'm not sure about how to update this efficiently.
Should I first check for every possible overlaps and then update ?
Should I just insert my new range and then merge overlapping entries (perhaps multiple times because multiple overlaps can occur ?) ? I'm not sure about how to build such a SQL query.
This looks like a pretty common problem, so I'm wondering if anyone knows a 'recognized' solution for this.
Any help or idea is welcome !
EDIT : The reading isn't actually random, the number of chunks is expected to be pretty much constant and very small compared to the number of pages.

Your idea to store ranges of (firstChunkPage, lastChunkPage) pairs should work if data is relatively sparse.
Unfortunately, queries like you mentioned:
SELECT count(*) FROM table
WHERE firstChunkPage <= page AND page <= lastChunkPage
cannot work effectively, unless you use spatial indexes.
For SQLite, you should use R-Tree module, which implements support for this kind of index. Quote:
An R-Tree is a special index that is designed for doing range queries. R-Trees are most commonly used in geospatial systems where each entry is a rectangle with minimum and maximum X and Y coordinates. ... For example, suppose a database records the starting and ending times for a large number of events. A R-Tree is able to quickly find all events, for example, that were active at any time during a given time interval, or all events that started during a particular time interval, or all events that both started and ended within a given time interval.
With R-Tree, you can very quickly identify all overlaps before inserting new range and replace them with new combined entry.
To create your RTree index, use something like this:
CREATE VIRTUAL TABLE demo_index USING rtree(
id, firstChunkPage, lastChunkPage
);
For more information, read documentation.

how to check if value is present in a very big data record or a big list efficiently

hi guys i have this doubt ...
if i have a record of username and password details for logging in to a website I'll most probably get the user name and password from the form and will be using to check if the given username is present in the database by using a contains() Boolean operation and if contains then check the password is same as saved in the database..
but for websites like g-mail and Facebook there are million of records and the authentication is very quick ...
how to they do it ..what method do they follow for this
how they check if a value is present in a large record that quickly ?
does the process involve just adding more server for processing speed ?
ty for the answers ...
**
sorry i have posted this question without knowing about indexers ..
(just came to know that by creating indexes to one or multiple column
the full table scan is minimized and index path is used instead which
is less costlier and more efficient operation ..)
**

You just need one SQL query:
select 1 from user u
where u.login = :theEnteredLogin
and u.hashed_password = :theHashedEnteredPassword
(where :xxx are parameters of the query).
If you have an index on the login column or even better, on [login - hashed_password], the query should not take more than a few milliseconds to execute.

Well, they have lots of servers and high-performance databases. At a low level, the table for the hash is probably indexed by the hash for fast lookup - binary search style.

For medium to large data sets indexing, combined with proper sizing of disk, memory and cpus, is the most adopted approach.
For very large data sets, the database can be distributed and data partitioned.
For very, very large data sets, aside from the above scenarios, used technologies usually involve using map reduce model.

How to manage a million records?

I really need an expert's help to answer my query.
Here is the scenario:
Im using an sql select query to retrieve a million records.
I need to perform sorting and grouping on the resultant records which im storing in a datatable( in one execution)
and looping through it for grouping and sorting it.
I know this is so childish and not the right way to process it.
How can i manage the million records effectively and apply the grouping and sorting to it?
Really need help out here. Heard of executing the select query batch wise but how to implement the grouping and sorting while we dont have the entire data in hand?
I cannot go for sql order by and group by directly and that's against my requirement.
Here is what i'm doing right now:
I have the following objects, i.e the column names for grouping and Sorting
List<Group> groupList;
List<Sort> sortList;
DataTable reportData; // Here im having the entire records from db
Im looping through the 'reportData' row by row and matches the current and previous row for the custom grouping and sorting. Would like to know how the same can be done when we are using a batchwise execution or any alternative solution is there?

I need to perform sorting and grouping on the resultant records which
im storing in a datatable( in one execution) and looping through it
for grouping and sorting it.
What for?
Seriously.
Do not pull then try plaing smart with a stupid object model behind (and datasets are not particularly smart, sorry).
Group and sort in your select statement, pull the data lready grouped and joined and be done with it.
A million records was a small amount of data for sql server when the original version was release (4.2 it was, a port of sysase sql server) 17 years of so ago. These days it is something that fits likely into the processor thiird level cache and is nothing a proper sql server even realizes it has just processed.
SQL is particulaly good ad doing projects and ever since they indoruced MARS you can even run multiple queries over one connection, which comes in handy here.
So, go back - throw away the dataset and "I try to program a sort algo" and create proper SQL statements to pull the data as you need it.

Sounds like you should implement Partition Pruning. Partitioning will allow for a separation of content like you are requesting in order to have faster queries.

If I understood correctly, in your case, I would create a temporary database table with the structure I want especially to cover my grouping.
Then I would select the records from main tables and insert them to the temporary one appying all modifications including grouping.
A specific index on how you want them sorted should be also applied.
After that, just select from this table, do what you have to do, and finally if the data are not needed any more, delete the temporary table.
I would choose the above solution because a million of records in memory smells trouble to me...

For example:
1. Lets assume that you would like to group them by their DocumentTypeID
var groupByType = reportData.GroupBy(g=>g.DocumentTypeID);
2. Sorting Alphabetically
var sortAlphabetically = reportData.OrderBy(g=>g.DocumentName);
3. Grouping and Sorting
var groupAndSort = reportData.GroupBy(g=>g.DocumentTypeID)
.OrderBy(g=>g.DocumentName);
4. Sort and Group
var groupAndSort = reportData.OrderBy(g=>g.DocumentName)
.GroupBy(g=>g.DocumentTypeID);
5. Multiple Grouping and sorting
var multipleGroupAndSort = reportData.GroupBy(g=>g.DocumentTypeID)
.GroupBy(g=>g.CreatedOnDate.Month)
.OrderBy(g=>g.DocumentName);
so on and so forth...
But I would still discourage bringing million rows to application. It will cost memory. There are of course ways to manage it through stored procedures etc.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.