Umbraco 4.7 system architecture advice to work with 200K+ nodes

Umbraco 4.7 system architecture advice to work with 200K+ nodes - c#

I need to create website on Umbraco 4.7 where I need to compare some products by price and some other properties (about 10). I need to make search and sorting of information, amount of products will be more than 200K+ items. I have tested now on 30K and it seems little slow. So, my question: how I should build my system?
By using umbraco nodes, than how I can increase speed of search in collection of 200K+ nodes?
Or maybe I have to combine SQl server and umbraco, in this case I will be sure that I have optimal speed to work with this amount of data?
If you have any experience or ideas how implement this solution, please give some hints. Best if you have some links for some concrete implementation.

There are three architectural options if Umbraco is a given:
Firstly add the products and product ranges as a tree structure in the content section - but given the bloat this will cause in app_data\umbraco.config I reckon that 200,000 products will slow things down dreadfully.
Secondly use a product catalogue product like ucommerce where you can catalogue your products and then use umbraco to layout the range, product, search pages - and hook into the ucommerce API to pull the products through from your Sql Server database. This will be more performant and there is good support but ucommerce has a fee element (for large installations - you can try it for nothing) and you won't be able to set up individual range management.
Finally you could roll your own database and product maintenance system and add your own dedicated section - but that will be costly to develop.
Personally I would use ucommerce or a similar product/catalogue maintenance Umbraco add-id as this would avoid slowing Umbraco down and give you a pre-written maintenance facility.

I concur with Amelvin with regards to your options. Offloading the data to a custom database implementation, using LinqToSql would be a valid option. The issue here is purely taking the strain off Umbraco.
WRT the search, I would seriously consider using Examine. It is designed to handle the amount of data you are talking about and more. It is built upon Lucene.net and so is incredibly fast regardless of the amount data.

As Digbyswift mentioned, use Examine to perform your searches, this is much faster than the standard search and you are not hitting the database when performing a search.

Related

Storing a large amount of analytical data

I normally use SQL Server and C# for all projects I do, however I am looking upon a project that could potentially span to billions of rows of data and I don't feel comfortable doing this in SQL Server .
The data I will be storing is
datetime
ipAddress
linkId
possibly other string related data
I have only ever dealt with relational databases before and hence was looking for some guidance on what database technology would be best suited for this type of data storage. One that could scale and do so at a low cost (when compared to sharding SQL Server)
I would then need to pull this data out based on linkId.
Also would I be able to do ordering within the query to the DB or would that be best done in the application?
EDIT: It will be cloud based. Hence I was looking at SQL Azure, which I have used extensively, however it just starts causing issues as the row count goes up.

Since you are looking for general guidance, I feel it is ok to provide an answer that you have prematurely dismissed ;-). Microsoft SQL Server can definitely handle this situation (in the generic sense of having a table of those fields and billions of rows). I have personally worked on a Data Warehouse that had 4 nodes, each of which had the main fact table holding 1.2 - 1.5 Billion rows (and growing) and responded to queries quickly enough, despite some aspects of the data model and indexing that could have been done better. It is a web-based application with many users hitting it all day long (though some periods of the day much harder than others). Also, that fact table was much wider than the table you are describing, unless that "possibly other string related data" is rather large (but there are ways to properly model that as well). True, the free Express edition might not meet your needs, but Standard Edition likely would and it is not super expensive. Enterprise has a nice feature for doing online index rebuilds, but that alone might not warrant the huge jump in license fees.
Keep in mind that with little to no description of what you are actually trying to accomplish with this data, it is hard for me to say that MS SQL Server will definitely meet your needs. But, given that you seemed to have ruled it out entirely on the basis of the large number of rows you might possibly get, I can at least speak to that situation: with good data modeling, good index design, and regular index maintenance, MS SQL Server can definitely handle billions of rows. Now, whether or not it is the best choice for your project depends on what you are trying to do, what the client is comfortable with maintaining, etc.
Good luck :)
EDIT:
When I said (above) that the queries came back "quickly enough", I
meant anywhere from 1 to 90 seconds, depending on various factors.
Keep in mind that these were not simple queries, and in my opinion,
several improvements could be made to the data modeling and index
strategy.
I intentionally left out the Table Partitioning feature not only
because it is only in Enterprise Edition, but also because it is more
often misunderstood and hence misused than understood and used
properly. Table/Index partitioning in SQL Server is not a means of
"sharding".
I also did not mention Column Store indexes because they are only
available in Enterprise Edition. However, for projects large enough
to justify the cost, Column Store indexes are certainly worth
investigating. They were introduced in SQL Server 2012 and came with
the restriction that the table could not be updated once the Column
Store index was created. You can get around that, to a degree, using
Table Partitioning, but in SQL Server 2014 that restriction will be
removed.

Given that this needs to be cloud-based and that you use .Net / C#, if you really are only talking about a few tables (so far just the stated one and the implied "Link" table--source of LinkID) and hence might not need relationships or some of the other RDBMS features, then one option is to use Amazon's DynamoDB. DynamoDB is part of AWS (Amazon Web Services) and is a NoSQL database. Development and even the initial stage of rolling out a project are made a bit easier by their low-end, free tier. As of 2013-11-04, the main DynamoDB page states that:
AWS Free Tier includes 100MB of Storage, 5 Units of Write Capacity,
and 10 Units of Read Capacity with Amazon DynamoDB.
Here is some documentation: Overview, How to Query with .Net, and general .Net SDK.
BE AWARE: When looking into how much you think it might cost, be sure to include related AWS pieces, such as Network usage, etc.

Issues with Lucene.NET version 3.0.3

I've been using Lucene.NET v3.0.3 on a project for several of weeks ago, it very good library in addition to FacetedSearch is wonderful; but there are some points to need to say regarding this version, and I wish some tell me the best practice to tackle:
It does not support nested documents (relation between documents), [as it does in lucene java latest versions], for example on my domain model i have (Request, Applicant), one Request contains many Applicants.
a. In Indexing phase:i indexed one Request for one Applicant per document,in order to search particular information on Request and Applicant as will; but this makes:
redundant request information on different documents,
difficult to use faceted search on (Request) on such document
Anybody can tell me if there any (way, plugin, code) to handle this issues? but not using solar library.
How can return unique result (distinct), Is it the only way to return the whole result then implement code to distinct on the result; this makes performance problem on 1 Million document.
Any Implementation on extra cache level, example caching a document field (requestID) for fast performance querying.
Any news regarding the next Lucene.NET release date?
Any implementation on nested query results on different indexing files.

If you can map your relationships to hierarchies, you might look at my Stupid Lucene Tricks: Hierarchies (edit: updated link) which talks about using path enumerations to express and search hierarchies in Lucene.

handling large datasets in web api & odata

I have been working with asp.net web api over recent weeks with great success. It has really assisted me with producing an interface for mobile clients to programme against over http.
I reached a point where I need some assistance.
I have a new endpoint which will can a database and could return 100K results. I am using OData to filter the data and return a paginated set of the data.
As this could happen for mutliple requests, I am concerned with performance. Returning 100K records from the database every time is not ideal. So I have some ideas.
First one is to cache the 100K results and let OData do its magic on this every time. I am working with AppFabric distributed cache as its a load balanced environment. However caching such an amount of data in AppFabric could result in memory complications so think I am best avoiding this.
Next option is to forget about the magic of OData and send the filters I use in to the database and return only the required data each time. So in other words hit the db every time.
I could look at using a caching handler like the version outlined in this article to cache in the http cache -> http://byterot.blogspot.ie/2012/06/aspnet-web-api-caching-handler.html The drawback of this is if the data gets update via another system which it may, the cached data is not expired.
Any other tips as to how I may handle this scenario, large amount of data, filtered with odata in conjunction with web api?

This is a question that's likely to result in a wide variety of answers. That said, let me put on my pre-MSFT hat and give you my two cents.
A lot of architecture questions are best answered with the consultant's answer, "It depends." The answer depends in your case on a few things specifically. Some developers have a problem with caching layers because there are additional things to think about. An ACID-compliant database buys you a lot of insurance that you have at least a very finite amount of eventual consistency.
If it were me making this decision, I would be considering a few things:
How many rows am I returning on a regular basis?
Are they the same rows over and over?
How big is that in memory? (100k is really not that many rows; you're right about not wanting those 100k rows to hit the disk every time, but it's probably not a problem to keep them all in memory; SQL Server would probably do this for you anyway.)
What am I willing to deal with re: eventual consistency? Do I want some other software to deal with it? (What frequently scares people about caches are things like ensuring that invalidation and insertion get done properly and consistently from different applications/different places in the application.)
Given the information you've already provided (tiered architecture, willingness to try a distributed cache) I think you should pursue a caching layer. There are lots of good caches out there. AppFabric worked fine for us before I worked at Microsoft, but I've also dealt with a variety of other caching layers as well.

Assuming you use Entity Framework it would be the best option to return the IQueryable of EF directly. This way the magic of OData will work directly on your database. $limit and $take will be mapped directly to your SQL query.

best way is to a distributed cache, which you are already using. but the cache provider which you are using i.e. AppFabric, has some limitations. by limitations i mean the feature limitations. check out NCache which is a well mature and feature rich third party distributed cache provider.
if you want to understand the differences of NCache and Appfabric, check the youtube link below, this is FYI....
http://www.youtube.com/watch?v=3CPi1QlskrU

The caching that I have pointed out in the blog http://byterot.blogspot.ie/2012/06/aspnet-web-api-caching-handler.html applies to HTTP caching also known as output caching. Actually the data itself is not cached on the server but on the client or mid-stream cache servers, so is not suitable for what you have it mind.

Ensure "Reasonable" queries only

In our organization we have the need to let employees filter data in our web application by supplying WHERE clauses. It's worked great for a long time, but we occasionally run into users providing queries that require full table scans on large tables or inefficient joins, etc.
Some clown might write something like:
select * from big_table where
Name in (select name from some_table where name like '%search everything%')
or name in ('a', 'b', 'c')
or price < 20
or price > 40
or exists (select 1 from some_other_table where col1 + col2 + col3 = 4)
or exists (select 1 from table_a, table+b)
Obviously, this is not a great way to query these tables with computed values, non-indexed columns, lots of OR's and an unrestricted join on table_a and table_b.
But for a user, this may make total sense.
So what's the best way, if any, to allow internal users to supply a query to the database while ensuring that it won't lock a dozen tables and hang the webserver for 5 minutes?
I'm guessing that's a programmatic way in c#/sql-server to get the execution plan for a query before it runs. And if so, what factors contribute to cost? Estimated I/O cost? Estimated CPU cost? What would be reasonable limits at which to tell the user that his query's no good?
EDIT: We're a market research company. We have thousands of surveys, each with their own data. We have dozens of researchers that want to slice that data in arbitrary ways. We have tools to let them construct "valid" filters using a GUI, but some "power users" want to supply their own queries. I realize this isn't standard or best practice, but how else can I let dozens of users query tables for the rows they want using arbitrarily complex conditions and ever-changing conditions?

The premise of your question states:
In our organization we have the need to let employees filter date in our web application by supplying WHERE clauses.
I find this premise to be flawed on its face. I can't imagine a situation where I would allow users to do this. In addition to the problems you have already identified, you are opening yourself up to SQL Injection attacks.
I would highly recommend reassessing your requirements to see if you can't build a safer, more focused way of allowing your users to search.
However, if your users really are sophisticated (and trusted!) enough to be supplying WHERE clauses directly, they need to be educated on what they can and can't submit as a filter.

You can try using the following:
SET SHOWPLAN_ALL ON
GO
SET FMTONLY ON
GO
<<< Your SQL code here >>>
GO
SET FMTONLY OFF
GO
SET SHOWPLAN_ALL OFF
GO
Then you can parse through what you've got. As to where to draw the line on various things, that's going to take some experience. There are some things to watch for, but nothing that is cut and dried. It's often more of an art to examine the query plans than a science.
As others have pointed out though, I think that your problem goes deeper than the technology implications. The fact that you let unqualified people access your database in such a way is the underlying problem. From past experience, I often see this in companies where they are too lazy or too inexperienced to properly capture their application's requirements. I'm not saying that this is necessarily the case with your corporate environment, but that's what I've seen.

In addition of trying to control what the users enter (which is a loosing battle, there will always be a new hire that will come up with an immaginative query), I'd look into Resource Governor, see Managing SQL Server Workloads with Resource Governor. You put the ad-hoc queries into a separate pool and cap the allocated resources. This way you can mitigate the problem by limiting the amount of damage a bad query can do to other tasks.
And you should also consider giving access to the data by other means, like Power Pivot and let users massage their data as hard as they want on their own Excel. Business power users love that, and the impact on the transaciton processign server is minimal.

Instead of allowing employees to directly write (append to) queries, and then trying to calculate the query cost before running it, why not create some kind of Advanced Search or filter feature that is NOT writing SQL you cannot control?

In very large Enterprise originations on internal application this is a common practice. Often during your design phase you will limit the criteria or put sensible limits on data ranges, but once the business gets hold of the app there will be calls from the business unit management to remove the restrictions. In my origination this is a management problem not an engineering issue.
What we did was profile all of the criteria and found the largest offenders, both users and what types of queries caused the most problems and put limitations on some of the queries. Also some very expensive queries that were used on a regular basis were added to the app and the app cached the results and ran the queries when load was low. We also created caned optimized queries for standard users and gave only specified users the ability to search for anything. Just a couple of ideas.

You could make a data model for your database and allow users to use SQL Reporting Services' Report Builder. Its GUI-based and doesn't require writing WHERE clauses, so there should be a limit to how much damage they can do.
Or you could warehouse a copy of the db for the purpose of user queries, update the db every hour or so, and let them go to town... :)

I have worked a few places where this also came up. What we ended up doing was NOT allowing users unconstrained access, and promising to have IT do their best to provide queries when needed. The issue was that the database is fairly complicated, and even if users could write grammatically and syntactically correct SQL, they don't necessarily understand the relationships between the tables. In other words, even if they could write their own SQL they would get the wrong answers. We convinced the users that the risk of making the wrong decision based on a flawed or incomplete understanding of the 200 tables in the database was too high. Better to get the right answer after a day than the wrong one instantly.
The other part of this is what does IT do when user A writes a query and gets 1 answer, then user B writes what he thinks is the same query and gets a different answer? Is it IT's job to find the differences? To fix both pieces of SQL? etc. The bottom line is that I would not allow them access. I would load the system with predefined queries, as others have mentioned, and try to train mgmt why that is the only way it will work in the long run.

If you have so much data and you want to provide to your customers the ability to analyse and view the information as they want to, I strongly recommand to thing about OLAP technologies.

I guess you've never heard of SQL Injection attacks? What if the user enters A DROP DATABASE command after the WHERE clause?

This is the reason that direct SELECT permission is almost never given to users in the vast majority of applications.
A far better approach would be to engineer your application around use cases so that you are able to cover a reasonable percentage of requirements with specifically designed filters/aggregation/layout options.
There are a myriad of ways to do this so some analysis of your specific problem domain will definitely be required together with research into viable methods.
Whilst direct SQL access is the most flexible for your users, long executing queries are likely to be just the start of your headaches. SQL injection is a big concern here, whether it's source is malicious or simply misguided.

(Chad mentioned this in a comment, but I think it deserves to be an answer.)
Maybe you should copy data that needs to be queried ad-hoc into a separate database, to isolate any problems from the majority of users.

What's the best way to implement a search?

I've got a requirement where a user enters a few terms into a search box and clicks "go".
Does anyone have any good resources on how to implement a dynamic search that spans a few database tables?
Thanks,
Mike

I'm gonna throw in my vote for Lucene. While SQL Server does provide full text indexing and some search capabilities, it is not the greatest search engine. In my experience, it does not provide the best results or result ranking until you have a significant volume of indexed items (tens of thousands to hundreds of thousands minimum).
In contrast, Lucene is explicitly a search engine. It is an inverted index, behaving much like your run of the mill internet search engine. Lucene provides a very rich indexing and search platform, as well as some rich C# and .NET API's for querying the indexes. There is even a LINQ to Lucene provider that will allow you to query a Lucene index with LINQ.
The one drawback to using Lucene is that you have to build an index, which is a side-band process that runs independently of the database. You have to write your own tool to manage the index as well. Your search index, depending on how frequently you update it, may not be 100% up-to-the-minute up to date. Generally, that is not a huge concern, but if you have the resources, the Lucene index culd be incrementally updated every few minutes to keep things "fresh".

It is called Full-text Search.
http://msdn.microsoft.com/en-us/library/ms142571.aspx

This is a pretty loaded question given the lack of detail. If you just need a simple search over a few tables/columns then a single (cludgy) search SP may be enough for you.
That said, if you need more features such as:
Searching a large set of tables
Support for large amounts of data
Searching over forms of a word
Logical operations
etc
then you might want to look into Full-Text Search (which is a part of MS Sql 2000 and above). The initial investment to get up to speed with Full-Text Search can be a bit offsetting, but compared to implementing the above features you'll likely save yourself a ton of time and energy.
Here are some Full-Text Search links to get you started:
Msdn Page
Initial Set Up
Set Up Video
Hope that helps.

Ok there were a few requests for more info so let me provide some.
I have several tables (ie. users, companies, addresses) and I'd like a user to be able to enter something like this:
"microsoft wa gates"
and bring up a result list containing results for "gates", "microsoft", and "washington".
Lucene seems like it could be pretty cool.

You can create a SP that receive the search terms as parameters and retun some "selects" (recordsets) to the program that launched. It can return a select for each table and you can do whatever you need with the data in your app code.
If you need to receive only a dataset, you can make a View using UNION of the tables for consolidate the columns in a common schema and then filter the view same way. You will receive in your application only a dataset with all the information consolidated in the view and filtered.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.