How to Store strings to Optimize Searching

How to Store strings to Optimize Searching - c#

I am having a table containing a column of type VARCHAR. I want to search strings inside the column according to user input query. I want to implement Approximate Searching. And my table contains Lacs of records. There are some ways I am thinking I can implement searching.
Load All records in C# and apply searching algorithm on it. (But it will consume too much memory.)
Fetch records individually or in some predefined batch size and apply searching algorithm on it. (But it will establish database connection rapidly, which may downgrade the performance.)
I am sure that, there will be some other mechanism to implement this functionality or some technique to store data so that i can search it faster.
Can anybody give me any better idea, to implement this?

Lucene is one of the best ways to search. You can still store your string in the database, but build a Lucene index out of it and then use it to search.

SQL Server has built-in functionality to do exactly what you're looking to do, it's called Full-Text Search.
Overview from Microsoft here: http://msdn.microsoft.com/en-us/library/ms142571.aspx
The general concept is that you tell SQL Server what tables/columns contain searchable text, and it builds space-efficient and query efficient "full-text indexes"; these indexes are built asynchronously (so your updates/inserts are not slowed down), and since SQL Server 2005 they are stored with your database (eg in backups), so they're easily managed.
When you want to search, the query language is different from "normal" text matching.
Full-Text search is even available in the free "SQL Server 2008 Express with Advanced Services" edition, so cost is no longer a concern.

Related

VB.Net Text File vs Database for storing data?

just a quick question that I am sure someone on here can give me a detailed answer about.
Basically, I am using a DataGridView to display records from a database, which in my case is simply a text file which is being parsed. I feel this is simple, and if I want to select records based on certain parameters, I iterate through the list searching for matches. However, I wonder if this is ineffective vs using a full blown DB such as Mongo or SQL.
Can I get away with this if my software is relatively simple? I really prefer to sway from complicating things, when they don't need to be complicated.
By the way, I am expecting to have a DB (sometimes) larger than 100k entries, so take that into consideration.

#DavidStampher
Even though you may be using just one table or file, I would strongly suggest using a database system for this. Database engines are optimized for speed so it's not so frustrating or time-consuming when performing via query versus trying to update a single text file.
I'm only suggesting MySQL as an option because it's the one I am most familiar with. Other users may have different or better suggestions.
You can easily download and install one from MySQL installer. The setup is relatively simple and should take less than 10 minutes. You could create a new schema, add a table, then query up to do what you need.
I would suggest creating a new user other than root, just in case someone manages to hack into your account.
If you would like the easiest way to manage the database rather than going through the old fashioned phpMyAdmin, download MySQL Workbench. It's pretty cool and relatively easy to use.
Let me know if you have questions. :-)

How to search in table with fulltext search same as google?

I am working on pretty enterprise document management system (DMS project) for big company.
DMS database is Microsoft SQL Server 2012 and document table name is "Document".
At this time more than 4,000,000 records are available in document table.
I need to search in document table same as Google search through SQL Server Full-text Search with very good performance(less than 1 second response time).
User see single text box for intelligence search. For example user need to find document that code contain "1107" and author name contains "Albert", therefore in that text box types: 1107 Albert
I generated below query to find this:
select count(*) over() totalRowFound, DocumentID
from dbo.Document
where contains(*,N'("*1107*")) AND contains(*,N'("*Albert*"))
I used * in contains function for better search result but response time is about 4~7 second.
I know google algorithm is very complicated but I want to implement intelligence search like Google concept in only 4~10 million record with less than 1 second response time.
How can I improve this query?
or
What is best practice for intelligence search same as google?

With the * you are searching all columns
Try
where contains(code,N'("*1107*"))
How is searching all columns better search results if they want to search a specific column?
Not going to SQL fulltext search the same as Google as they are not the same engines.
I don't think the Google engine is available.
Lucene is a freeware search engine.
Why did you go down a path of writing your own DMS?

When you say 'Google search', what you really mean is an inverse index. Apache's Lucene project provides this functionality in a similar indexing fashion. SQL Server's FullText uses inverse indexing as well.
If you want very, very fast performing text searches, you might want to try using Lucene or Solr as it has some features that SQL Server Full-Text search does not (and vice-versa) and when properly configured, can perform very well.

Storing a large amount of analytical data

I normally use SQL Server and C# for all projects I do, however I am looking upon a project that could potentially span to billions of rows of data and I don't feel comfortable doing this in SQL Server .
The data I will be storing is
datetime
ipAddress
linkId
possibly other string related data
I have only ever dealt with relational databases before and hence was looking for some guidance on what database technology would be best suited for this type of data storage. One that could scale and do so at a low cost (when compared to sharding SQL Server)
I would then need to pull this data out based on linkId.
Also would I be able to do ordering within the query to the DB or would that be best done in the application?
EDIT: It will be cloud based. Hence I was looking at SQL Azure, which I have used extensively, however it just starts causing issues as the row count goes up.

Since you are looking for general guidance, I feel it is ok to provide an answer that you have prematurely dismissed ;-). Microsoft SQL Server can definitely handle this situation (in the generic sense of having a table of those fields and billions of rows). I have personally worked on a Data Warehouse that had 4 nodes, each of which had the main fact table holding 1.2 - 1.5 Billion rows (and growing) and responded to queries quickly enough, despite some aspects of the data model and indexing that could have been done better. It is a web-based application with many users hitting it all day long (though some periods of the day much harder than others). Also, that fact table was much wider than the table you are describing, unless that "possibly other string related data" is rather large (but there are ways to properly model that as well). True, the free Express edition might not meet your needs, but Standard Edition likely would and it is not super expensive. Enterprise has a nice feature for doing online index rebuilds, but that alone might not warrant the huge jump in license fees.
Keep in mind that with little to no description of what you are actually trying to accomplish with this data, it is hard for me to say that MS SQL Server will definitely meet your needs. But, given that you seemed to have ruled it out entirely on the basis of the large number of rows you might possibly get, I can at least speak to that situation: with good data modeling, good index design, and regular index maintenance, MS SQL Server can definitely handle billions of rows. Now, whether or not it is the best choice for your project depends on what you are trying to do, what the client is comfortable with maintaining, etc.
Good luck :)
EDIT:
When I said (above) that the queries came back "quickly enough", I
meant anywhere from 1 to 90 seconds, depending on various factors.
Keep in mind that these were not simple queries, and in my opinion,
several improvements could be made to the data modeling and index
strategy.
I intentionally left out the Table Partitioning feature not only
because it is only in Enterprise Edition, but also because it is more
often misunderstood and hence misused than understood and used
properly. Table/Index partitioning in SQL Server is not a means of
"sharding".
I also did not mention Column Store indexes because they are only
available in Enterprise Edition. However, for projects large enough
to justify the cost, Column Store indexes are certainly worth
investigating. They were introduced in SQL Server 2012 and came with
the restriction that the table could not be updated once the Column
Store index was created. You can get around that, to a degree, using
Table Partitioning, but in SQL Server 2014 that restriction will be
removed.

Given that this needs to be cloud-based and that you use .Net / C#, if you really are only talking about a few tables (so far just the stated one and the implied "Link" table--source of LinkID) and hence might not need relationships or some of the other RDBMS features, then one option is to use Amazon's DynamoDB. DynamoDB is part of AWS (Amazon Web Services) and is a NoSQL database. Development and even the initial stage of rolling out a project are made a bit easier by their low-end, free tier. As of 2013-11-04, the main DynamoDB page states that:
AWS Free Tier includes 100MB of Storage, 5 Units of Write Capacity,
and 10 Units of Read Capacity with Amazon DynamoDB.
Here is some documentation: Overview, How to Query with .Net, and general .Net SDK.
BE AWARE: When looking into how much you think it might cost, be sure to include related AWS pieces, such as Network usage, etc.

using nosql database as a replacement for sql server

I'm developing website that (if successful) its going to have a rapidly growing database (maybe terabytes or more). up to now I have always used sql server and didn't know anything about nosql.
I just found out about nosql doing research about the database size, and now I'm not sure if it will fullfil my needs. will I have the same power that I had with sql-server?
my question may seem silly as I'm a newbie in nosql but I just wanted to know if it doesn't support sql queries. how can we do something like:
select *, (select name from cities where id = cityid) from users
how to join tables? use something like stored procedures, views or things like these?

Thats a big question. NoSQL is a broad term pretty much used to describe a bunch of non relational data stores. They can range from MongoDB, RavenDB (which are document stores) to things like Redis and other variants of key/value stores. They all operate very differently to SQL relational models (and the resulting T-SQL).
Document databases like Mongo or Raven typically have a C# driver that (in most cases) allows you to use LinQ queries across the datastore (Mongo example here on this thread and a RavenDB example on their documentation page). They are all specific to their engine and different.
All these engines are not specifically designed to address the 'space' issue you are describing but rather try and have a low friction way of interacting with a datastore, in a fast way. All these data stores will still grow in size in the same way SQL does when throwing massive amounts of data at it. SQL Server will handle massive databases, as will most of the document stores and other NoSQL variants. To be honest, I'd trust SQL Server more than the newer NoSQL stores simply because it has been field tested for longer however as already stated, these document stores (and other stores like Apache Cassandra) can all handle large volumes of data. My only suggestion is to look at how you want to query the data. Document stores typically dont have the concepts of relational integrity like foriegn keys and so normalisation rules do not apply. In addition, you need to assess your reporting needs as SQL typically has an advantage in this area with more tooling. You can also choose a hybrid approach using SQL for your relational data and document stores for other object blobs and the like.
I would suggest looking into how you want to access your data first and then assess which one best suits your needs. One thing to note too is that SQL has some great features but often only in the enterprise versions. This costs a lot. Document databases tend to cost a LOT less for licencing, some being free, with many companies offering hosting so removing the need for you to worry about it. Finally, if going with SQL, I would suggest looking into sharding approaches from the very beginning given the amount of data you will be processing as this will make it much more manageable and also allow better query performance.

I've used MongoDB quite a bit. Id suggest signing up for a sandbox account on Mongolabs and playing around with it. There is an excellent C# driver for it too. NoSql is not really relational although you can relate documents via Ids. In your example you'd store an array of cities (if I am reading your example clearly) against the User document and query that or vice versa. There's less of a concern on data repetition because storage concerns aren't as important as they used to be. I write my scripts (equilivent of stored procs) using JavaScript and run it directly against Mongo, its incredibly flexible and powerful. Of course if you have tons of related objects, perhaps a relational database is your best bet.

What's the best way to implement a search?

I've got a requirement where a user enters a few terms into a search box and clicks "go".
Does anyone have any good resources on how to implement a dynamic search that spans a few database tables?
Thanks,
Mike

I'm gonna throw in my vote for Lucene. While SQL Server does provide full text indexing and some search capabilities, it is not the greatest search engine. In my experience, it does not provide the best results or result ranking until you have a significant volume of indexed items (tens of thousands to hundreds of thousands minimum).
In contrast, Lucene is explicitly a search engine. It is an inverted index, behaving much like your run of the mill internet search engine. Lucene provides a very rich indexing and search platform, as well as some rich C# and .NET API's for querying the indexes. There is even a LINQ to Lucene provider that will allow you to query a Lucene index with LINQ.
The one drawback to using Lucene is that you have to build an index, which is a side-band process that runs independently of the database. You have to write your own tool to manage the index as well. Your search index, depending on how frequently you update it, may not be 100% up-to-the-minute up to date. Generally, that is not a huge concern, but if you have the resources, the Lucene index culd be incrementally updated every few minutes to keep things "fresh".

It is called Full-text Search.
http://msdn.microsoft.com/en-us/library/ms142571.aspx

This is a pretty loaded question given the lack of detail. If you just need a simple search over a few tables/columns then a single (cludgy) search SP may be enough for you.
That said, if you need more features such as:
Searching a large set of tables
Support for large amounts of data
Searching over forms of a word
Logical operations
etc
then you might want to look into Full-Text Search (which is a part of MS Sql 2000 and above). The initial investment to get up to speed with Full-Text Search can be a bit offsetting, but compared to implementing the above features you'll likely save yourself a ton of time and energy.
Here are some Full-Text Search links to get you started:
Msdn Page
Initial Set Up
Set Up Video
Hope that helps.

Ok there were a few requests for more info so let me provide some.
I have several tables (ie. users, companies, addresses) and I'd like a user to be able to enter something like this:
"microsoft wa gates"
and bring up a result list containing results for "gates", "microsoft", and "washington".
Lucene seems like it could be pretty cool.

You can create a SP that receive the search terms as parameters and retun some "selects" (recordsets) to the program that launched. It can return a select for each table and you can do whatever you need with the data in your app code.
If you need to receive only a dataset, you can make a View using UNION of the tables for consolidate the columns in a common schema and then filter the view same way. You will receive in your application only a dataset with all the information consolidated in the view and filtered.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.