Ideas to improve metadata count search

Ideas to improve metadata count search - c#

I am working on a classified website, i am converting from sql server to mongodb/no-sql and probably node.js. I really want to come up with a better solution to the filtering of metadata (Eg the filter in ebay). I currently have all my metadata in a table
Metadata Value Table
AdvertID
MetaDataValue
PrentID
Level
So as you can see the metadata takes a hierarchical structure. I then perform group by count on this table to produce the filter count for the left pane of the site. Then as a user drills in to the metadata options i end up doing recursive group by count sub queries.
As you can imagine this gets a little costly on time and performance. As i am going to move no a no-sql option, i thought this would be a good time to investigate an alternative option. I am open to any suggestions, but this is hobby site so i would be looking for an open source/free solution.

Related

Umbraco 4.7 system architecture advice to work with 200K+ nodes

I need to create website on Umbraco 4.7 where I need to compare some products by price and some other properties (about 10). I need to make search and sorting of information, amount of products will be more than 200K+ items. I have tested now on 30K and it seems little slow. So, my question: how I should build my system?
By using umbraco nodes, than how I can increase speed of search in collection of 200K+ nodes?
Or maybe I have to combine SQl server and umbraco, in this case I will be sure that I have optimal speed to work with this amount of data?
If you have any experience or ideas how implement this solution, please give some hints. Best if you have some links for some concrete implementation.

There are three architectural options if Umbraco is a given:
Firstly add the products and product ranges as a tree structure in the content section - but given the bloat this will cause in app_data\umbraco.config I reckon that 200,000 products will slow things down dreadfully.
Secondly use a product catalogue product like ucommerce where you can catalogue your products and then use umbraco to layout the range, product, search pages - and hook into the ucommerce API to pull the products through from your Sql Server database. This will be more performant and there is good support but ucommerce has a fee element (for large installations - you can try it for nothing) and you won't be able to set up individual range management.
Finally you could roll your own database and product maintenance system and add your own dedicated section - but that will be costly to develop.
Personally I would use ucommerce or a similar product/catalogue maintenance Umbraco add-id as this would avoid slowing Umbraco down and give you a pre-written maintenance facility.

I concur with Amelvin with regards to your options. Offloading the data to a custom database implementation, using LinqToSql would be a valid option. The issue here is purely taking the strain off Umbraco.
WRT the search, I would seriously consider using Examine. It is designed to handle the amount of data you are talking about and more. It is built upon Lucene.net and so is incredibly fast regardless of the amount data.

As Digbyswift mentioned, use Examine to perform your searches, this is much faster than the standard search and you are not hitting the database when performing a search.

Can Lucene.net be used for a tag based search system?

I'm developing a ASP.Net MVC3 app which will have few hundred videos. I want to create a search system based on tags and other parameters like the user type that uploaded the video, the date of the video, video category, etc..
I have been looking around and Lucene.NET seems really good tool for full text search, but I don't know if it's the best solution for my project... I have read the tutorials and they recommend to keep the search index to a minimum but also that you should NOT hit your database for retrieving extra data that is not stored in the search index...
How this can be possible?
Lets put an example: I have a video row (as a concept, this is really held in different SQL tables) which has columns for the video id, the video name, the video file name, the full path, user id, user type, tags, creation date, video category, video subcategory, video location, etc... If I want to create a lucene search index I think I will have to put all the information in there so that later on I can query on every parameter, right?
This seems to me a duplicate of the SQL Database but with the overload of adding, editing and removing documents from lucene search index. Is this the standard scenario when using lucene? All the examples I have seen with lucene are based on a post id, post title and post body..
What do you think? Can you give me some light?

Yes, if you want to query multiple fields (including things like tags) from within lucene, you'll need to make that data available to lucene. It might sound like this is duplication, but it is not redundant duplication - it is restructuring the data into a very different layout - indexed for search.
It should work fine; it is pretty much how search works here on stackoverflow (which is using lucene.net to perform the search).
It should be noted, however, that a few hundred is not a large sample: frankly you could do that any way you like, and it'll take about the same amount of time. Writing a complex SQL query should work, as should full-text-search in the database (that is how stackoverflow's search used to work), as should filtering objects in-memory (at the few-hundred level, you could trivially just cache all the data excluding video frames in memory).

Ensure "Reasonable" queries only

In our organization we have the need to let employees filter data in our web application by supplying WHERE clauses. It's worked great for a long time, but we occasionally run into users providing queries that require full table scans on large tables or inefficient joins, etc.
Some clown might write something like:
select * from big_table where
Name in (select name from some_table where name like '%search everything%')
or name in ('a', 'b', 'c')
or price < 20
or price > 40
or exists (select 1 from some_other_table where col1 + col2 + col3 = 4)
or exists (select 1 from table_a, table+b)
Obviously, this is not a great way to query these tables with computed values, non-indexed columns, lots of OR's and an unrestricted join on table_a and table_b.
But for a user, this may make total sense.
So what's the best way, if any, to allow internal users to supply a query to the database while ensuring that it won't lock a dozen tables and hang the webserver for 5 minutes?
I'm guessing that's a programmatic way in c#/sql-server to get the execution plan for a query before it runs. And if so, what factors contribute to cost? Estimated I/O cost? Estimated CPU cost? What would be reasonable limits at which to tell the user that his query's no good?
EDIT: We're a market research company. We have thousands of surveys, each with their own data. We have dozens of researchers that want to slice that data in arbitrary ways. We have tools to let them construct "valid" filters using a GUI, but some "power users" want to supply their own queries. I realize this isn't standard or best practice, but how else can I let dozens of users query tables for the rows they want using arbitrarily complex conditions and ever-changing conditions?

The premise of your question states:
In our organization we have the need to let employees filter date in our web application by supplying WHERE clauses.
I find this premise to be flawed on its face. I can't imagine a situation where I would allow users to do this. In addition to the problems you have already identified, you are opening yourself up to SQL Injection attacks.
I would highly recommend reassessing your requirements to see if you can't build a safer, more focused way of allowing your users to search.
However, if your users really are sophisticated (and trusted!) enough to be supplying WHERE clauses directly, they need to be educated on what they can and can't submit as a filter.

You can try using the following:
SET SHOWPLAN_ALL ON
GO
SET FMTONLY ON
GO
<<< Your SQL code here >>>
GO
SET FMTONLY OFF
GO
SET SHOWPLAN_ALL OFF
GO
Then you can parse through what you've got. As to where to draw the line on various things, that's going to take some experience. There are some things to watch for, but nothing that is cut and dried. It's often more of an art to examine the query plans than a science.
As others have pointed out though, I think that your problem goes deeper than the technology implications. The fact that you let unqualified people access your database in such a way is the underlying problem. From past experience, I often see this in companies where they are too lazy or too inexperienced to properly capture their application's requirements. I'm not saying that this is necessarily the case with your corporate environment, but that's what I've seen.

In addition of trying to control what the users enter (which is a loosing battle, there will always be a new hire that will come up with an immaginative query), I'd look into Resource Governor, see Managing SQL Server Workloads with Resource Governor. You put the ad-hoc queries into a separate pool and cap the allocated resources. This way you can mitigate the problem by limiting the amount of damage a bad query can do to other tasks.
And you should also consider giving access to the data by other means, like Power Pivot and let users massage their data as hard as they want on their own Excel. Business power users love that, and the impact on the transaciton processign server is minimal.

Instead of allowing employees to directly write (append to) queries, and then trying to calculate the query cost before running it, why not create some kind of Advanced Search or filter feature that is NOT writing SQL you cannot control?

In very large Enterprise originations on internal application this is a common practice. Often during your design phase you will limit the criteria or put sensible limits on data ranges, but once the business gets hold of the app there will be calls from the business unit management to remove the restrictions. In my origination this is a management problem not an engineering issue.
What we did was profile all of the criteria and found the largest offenders, both users and what types of queries caused the most problems and put limitations on some of the queries. Also some very expensive queries that were used on a regular basis were added to the app and the app cached the results and ran the queries when load was low. We also created caned optimized queries for standard users and gave only specified users the ability to search for anything. Just a couple of ideas.

You could make a data model for your database and allow users to use SQL Reporting Services' Report Builder. Its GUI-based and doesn't require writing WHERE clauses, so there should be a limit to how much damage they can do.
Or you could warehouse a copy of the db for the purpose of user queries, update the db every hour or so, and let them go to town... :)

I have worked a few places where this also came up. What we ended up doing was NOT allowing users unconstrained access, and promising to have IT do their best to provide queries when needed. The issue was that the database is fairly complicated, and even if users could write grammatically and syntactically correct SQL, they don't necessarily understand the relationships between the tables. In other words, even if they could write their own SQL they would get the wrong answers. We convinced the users that the risk of making the wrong decision based on a flawed or incomplete understanding of the 200 tables in the database was too high. Better to get the right answer after a day than the wrong one instantly.
The other part of this is what does IT do when user A writes a query and gets 1 answer, then user B writes what he thinks is the same query and gets a different answer? Is it IT's job to find the differences? To fix both pieces of SQL? etc. The bottom line is that I would not allow them access. I would load the system with predefined queries, as others have mentioned, and try to train mgmt why that is the only way it will work in the long run.

If you have so much data and you want to provide to your customers the ability to analyse and view the information as they want to, I strongly recommand to thing about OLAP technologies.

I guess you've never heard of SQL Injection attacks? What if the user enters A DROP DATABASE command after the WHERE clause?

This is the reason that direct SELECT permission is almost never given to users in the vast majority of applications.
A far better approach would be to engineer your application around use cases so that you are able to cover a reasonable percentage of requirements with specifically designed filters/aggregation/layout options.
There are a myriad of ways to do this so some analysis of your specific problem domain will definitely be required together with research into viable methods.
Whilst direct SQL access is the most flexible for your users, long executing queries are likely to be just the start of your headaches. SQL injection is a big concern here, whether it's source is malicious or simply misguided.

(Chad mentioned this in a comment, but I think it deserves to be an answer.)
Maybe you should copy data that needs to be queried ad-hoc into a separate database, to isolate any problems from the majority of users.

What's the best way to implement a search?

I've got a requirement where a user enters a few terms into a search box and clicks "go".
Does anyone have any good resources on how to implement a dynamic search that spans a few database tables?
Thanks,
Mike

I'm gonna throw in my vote for Lucene. While SQL Server does provide full text indexing and some search capabilities, it is not the greatest search engine. In my experience, it does not provide the best results or result ranking until you have a significant volume of indexed items (tens of thousands to hundreds of thousands minimum).
In contrast, Lucene is explicitly a search engine. It is an inverted index, behaving much like your run of the mill internet search engine. Lucene provides a very rich indexing and search platform, as well as some rich C# and .NET API's for querying the indexes. There is even a LINQ to Lucene provider that will allow you to query a Lucene index with LINQ.
The one drawback to using Lucene is that you have to build an index, which is a side-band process that runs independently of the database. You have to write your own tool to manage the index as well. Your search index, depending on how frequently you update it, may not be 100% up-to-the-minute up to date. Generally, that is not a huge concern, but if you have the resources, the Lucene index culd be incrementally updated every few minutes to keep things "fresh".

It is called Full-text Search.
http://msdn.microsoft.com/en-us/library/ms142571.aspx

This is a pretty loaded question given the lack of detail. If you just need a simple search over a few tables/columns then a single (cludgy) search SP may be enough for you.
That said, if you need more features such as:
Searching a large set of tables
Support for large amounts of data
Searching over forms of a word
Logical operations
etc
then you might want to look into Full-Text Search (which is a part of MS Sql 2000 and above). The initial investment to get up to speed with Full-Text Search can be a bit offsetting, but compared to implementing the above features you'll likely save yourself a ton of time and energy.
Here are some Full-Text Search links to get you started:
Msdn Page
Initial Set Up
Set Up Video
Hope that helps.

Ok there were a few requests for more info so let me provide some.
I have several tables (ie. users, companies, addresses) and I'd like a user to be able to enter something like this:
"microsoft wa gates"
and bring up a result list containing results for "gates", "microsoft", and "washington".
Lucene seems like it could be pretty cool.

You can create a SP that receive the search terms as parameters and retun some "selects" (recordsets) to the program that launched. It can return a select for each table and you can do whatever you need with the data in your app code.
If you need to receive only a dataset, you can make a View using UNION of the tables for consolidate the columns in a common schema and then filter the view same way. You will receive in your application only a dataset with all the information consolidated in the view and filtered.

Tag system with search for .NET (and ASP.NET)

Is there a good tag search system that i can use in my C# .NET prototype that will also run on ASP.NET ?

When you say "tag search system" I am going to assume that you mean the ability in a social network to allow your users to tag content thereby bubbling up the things that are most popular in your site by way of a tag cloud. Also allowing your users to navigate by way of tagged content, etc. ??
I like to create a SystemObjects table which holds the various tables in my system that might have tags applied to it (or comments, or ratings, etc.) thereby allowing me to have a generic tagging system that can span my entire database. Then I would also have a SystemObjectTags table that would have a reference to the SystemObjectID (telling me which table has the record that I am interested in) as well as the SystemObjectRecordID (which tells me which specific record I am interested in). From there I have all of my detail data with regards to the tags and what was tagged. I then like to keep a running list of the tag specific information in a Tags table which keeps the unique tag (string value of a tag) the TagID (which the SystemObjectTags table references), the count of that tags usage across the system (a summed value of all uses), etc. If you have a high traffic site this data should be kept in your cache so that you don't hit the data too frequently.
With this subsystem in place you can then move to the search capabilities. You have all the data that you need with these three tables to easily be able to perform filtering, searching, etc. However, you might find that there is so much data in here and that the tables are so generic that your searches are not as fast as a more optimized table structure might be. For this reason I suggest that you use a Lucene.NET index to hold all of your searchable data. Lucene.NET provides a very fast read time and provides far more flexibility in search algorithms than SQL Servers freetext stuff does.
This would then allow you to provide filtering of your content by tags, searching for content by tag, tag counts, etc. Lucene.net is a big scary topic though! Be prepared to do some reading to get your past the basics.

An option we are using is to put our "tags" in the Meta Keywords on the page, and then we use Bing for our search.
http://msdn.microsoft.com/en-us/library/dd251056.aspx
Our architect said it best. "Let the search engines do what they do best. Search."
You can limit the search to your site only, pull back the results and display them yourself...on your own page with your own formatting.
The only downside is that until your site is live and has been indexed, you can't fully test your search.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.