Issues with Lucene.NET version 3.0.3

Issues with Lucene.NET version 3.0.3 - c#

I've been using Lucene.NET v3.0.3 on a project for several of weeks ago, it very good library in addition to FacetedSearch is wonderful; but there are some points to need to say regarding this version, and I wish some tell me the best practice to tackle:
It does not support nested documents (relation between documents), [as it does in lucene java latest versions], for example on my domain model i have (Request, Applicant), one Request contains many Applicants.
a. In Indexing phase:i indexed one Request for one Applicant per document,in order to search particular information on Request and Applicant as will; but this makes:
redundant request information on different documents,
difficult to use faceted search on (Request) on such document
Anybody can tell me if there any (way, plugin, code) to handle this issues? but not using solar library.
How can return unique result (distinct), Is it the only way to return the whole result then implement code to distinct on the result; this makes performance problem on 1 Million document.
Any Implementation on extra cache level, example caching a document field (requestID) for fast performance querying.
Any news regarding the next Lucene.NET release date?
Any implementation on nested query results on different indexing files.

If you can map your relationships to hierarchies, you might look at my Stupid Lucene Tricks: Hierarchies (edit: updated link) which talks about using path enumerations to express and search hierarchies in Lucene.

Related

Practical performance comparision of Neo4j and MSSQL for C# developers

Assume we have a web site with a small social graph that people (say ~1M users) can "like" stuff, follow each other, comment on each other posts and ... (the usual scenario).
In .NET for this we have two options:
Using EF (currently 6.1) and MSSQL (v2012 or above) to implement the social graph (the hard way)
Using Neo4j (currently 2.1.4) and Neo4jClient (which as far as I know is the best driver for .NET users)
Given the above scenario and the fact that Neo4j doesn't have a native driver for .NET and the current version of Neo4jClient (1.0.0.657) uses REST api to connect to the database engine, which one would be faster for questions like "Who likes stuff like I do" or "What a person would like (based on the people it follow)" and some other usual question regarding the social graphs?

You haven't specified that much information; your question may be likely to elicit a lot of opinion, but I'll try to give this a fair shake. (Disclaimer: I'm from the neo4j side of this, but I've worked with most of the other things you mention)
Your question has three elements I want to split apart:
Graph or Relational? (MySQL vs. Neo4J)
Driver/Engineering issues (Neo4jClient/REST vs EF/MySQL)
Modeling practicalities (implementing the social graph "the hard way" vs. in neo4j)
Graph or Relational?
You should read another answer I posted about general parameters of the performance of graph databases and graph database query. I won't recap all of that (since it's already on SO) but here's the executive summary: graph databases are very good and fast at path-associative queries where you need to traverse a bunch of edges. Those operations correspond to things in the relational world where you'd join a whole pile of tables together, or where the join depth is variable. In those situations, graph will be better than relational (performance wise). If you want to do bulk scans of users or single joins, you're probably better off with relational (again, see other answer for more detail here). So on this criteria, I am inferring that you only really want to traverse one edge at a time - e.g. "Show me all of the stuff that Bob likes" and that you don't need to do deeper queries like "Show me everyone who is separated by 3-4 degrees from Bob".
Driver/Engineering Issues
Speed wise, it's generally known that the java API is faster than the REST API for neo4j. Performance for the REST API would be variable, and depend on a lot of other factors like whether the DB is hosted on the same machine, or how "network far" away it is. You always have extra overhead with REST that comes with things like HTTP and serializing/deserializing JSON that you wouldn't have if you used the java API. So all other things being equal (disclaimer: they never are ;) the REST API will generally be slower than something like EF.
Modeling Practicalities
Here, neo4j is going to win by a lot. With MySQL, you'll have the ever-present object-relational impedance mismatch; neo4j lessens (but does not eliminate) those impedance mismatch problems. Modeling wise, neo4j is schemaless, which comes with lots of pros and cons. You can probably cobble together a working model faster with neo4j because your domain is fundamentally graphy-y.

Umbraco 4.7 system architecture advice to work with 200K+ nodes

I need to create website on Umbraco 4.7 where I need to compare some products by price and some other properties (about 10). I need to make search and sorting of information, amount of products will be more than 200K+ items. I have tested now on 30K and it seems little slow. So, my question: how I should build my system?
By using umbraco nodes, than how I can increase speed of search in collection of 200K+ nodes?
Or maybe I have to combine SQl server and umbraco, in this case I will be sure that I have optimal speed to work with this amount of data?
If you have any experience or ideas how implement this solution, please give some hints. Best if you have some links for some concrete implementation.

There are three architectural options if Umbraco is a given:
Firstly add the products and product ranges as a tree structure in the content section - but given the bloat this will cause in app_data\umbraco.config I reckon that 200,000 products will slow things down dreadfully.
Secondly use a product catalogue product like ucommerce where you can catalogue your products and then use umbraco to layout the range, product, search pages - and hook into the ucommerce API to pull the products through from your Sql Server database. This will be more performant and there is good support but ucommerce has a fee element (for large installations - you can try it for nothing) and you won't be able to set up individual range management.
Finally you could roll your own database and product maintenance system and add your own dedicated section - but that will be costly to develop.
Personally I would use ucommerce or a similar product/catalogue maintenance Umbraco add-id as this would avoid slowing Umbraco down and give you a pre-written maintenance facility.

I concur with Amelvin with regards to your options. Offloading the data to a custom database implementation, using LinqToSql would be a valid option. The issue here is purely taking the strain off Umbraco.
WRT the search, I would seriously consider using Examine. It is designed to handle the amount of data you are talking about and more. It is built upon Lucene.net and so is incredibly fast regardless of the amount data.

As Digbyswift mentioned, use Examine to perform your searches, this is much faster than the standard search and you are not hitting the database when performing a search.

LINQ - Select all in parent-child hierarchy

I was wondering if there is a neat way do to this, that DOESN'T use any kind of while loop or similar, preferably that would run against Linq to Entities as a single SQL round-trip, and also against Linq To Objects.
I have an entity - Forum - that has a parent-child relationship going on. That is, a Forum may (or in the case of the top level, may not) have a ParentForum, and may have many ChildForums. A Forum then contains many Posts.
What I'm after here is a way to get all the Posts from a tree of Forums - i.e. the Forum in question, and all it's children, grandchildren etc. I don't know in advance how many sub-levels the Forum in question may have.
(Note - I know this example isn't necessarily a valuable use case, but the Forum object model one is one that is familiar to most people, and so serves as a generic and accessible premise rather than my actual domain model.)

One possible way would be if your actual data tables were stored using a left/right tree (example here: http://www.sitepoint.com/hierarchical-data-database-2/ . Note, that example is in MySQL/PHP, but it's trivial to implement). Using this, you can find out all forums that fall within a parent's left/right values and given that, you can retrieve all posts who's forum IDs is IN those forum IDs.

I'm sure you might get a few proper answers regarding the Linq queries. I'm posting this as an advisory when it comes to the SQL side of things.
I had a similar issue with a virtual filesystem in SQL. I needed to be able to query files in folders recursively - with folders, of course, having a recursive parent-child relationship. I also needed it to be fast, and I certainly didn't want to be dropping back to client-side processing.
For performance I ended up writing stored procedures and inline functions - unfortunately much too complicated to post here (and I might get the sack for sharing company code!). The key, however, was to learn how to work with Recursive CTEs http://msdn.microsoft.com/en-us/library/ms186243.aspx. It took me a few days to nail it but the performance is incredible (they are very easy to get wrong though - so pay attention to the query plans).

RDF integration with C#

I'm working on a project where i had been asked to do a semantic search. The scenario is a database with a table containing 3 pieces of information, Doctor Name, Patient Name, and Date of Visit. I had been asked to create a form that contains 3 fields: Doctor, Patient and Date. So when a user wants to search for a patient's corresponding doctor or doctors for corresponding patients or their dates, they can just enter any of the fields to retrieve information form the database. I had done the coding in C# using Regular Expressions for string manipulation and information retrieval. But the main task is that the search should work using RDF and URI.
Now that I had worked on most part of the coding can someone help me how to create the search using RDF and URI, is there any solution for this, how can I integrate RDF in C#, is there any documentation.
But as per my supervisor's requirements he had asked me to build a search that works with RDF, I mean the details of patients (e.g. Patient's Name), Doctor's Name and Date would be in a form of URI which locates the details of patients, doctors and date information in the database so if anyone is trying to search for any information like doctor or patient can just enter their name in the corresponding field and retrieve the information. I'm attaching 2 snapshots of my code for your understanding.
Image 1: http://img29.imageshack.us/i/15035706.jpg
Image 2: http://img31.imageshack.us/img31/1117/86105845.jpg
The first image is where I enter all the details to the database and the second image is the search.
This is the overall idea about my project, can you advice me how this can be done?
I would be really grateful to you if someone could help me on this as soon as possible.

Doing an RDF and URI based search is going to be dependent on whether your data is in RDF in the first place. If it's not you've either got to convert it from its current form into RDF on the fly or permanently. To do it on the fly you could use a technology like D2R which maps relational databases to RDF http://www4.wiwiss.fu-berlin.de/bizer/d2r-server/
There's some other Semantic Web C# stuff about like Rowlex http://rowlex.nc3a.nato.int/ which is more OWL based or there's my own dotNetRDF library http://www.dotnetrdf.org but that's only just about to be a first Alpha release so I wouldn't recommend it for any production systems yet. SemWeb as Alex mentions is pretty good and scales particularly well - only disadvantage is that it's .Net 2.0 so you need a separate library if you want to do LINQ with it
A question about your question...
Your question is unclear about what you mean by semantic search, are you sure you're actually meaning to do an RDF search or did someone just specify "semantic search" in the spec and you googled it and got articles about RDF? Semantic search doesn't necessarily imply a need for RDF, it could be that you actually want to do natural language search.
By this I mean that it could be that you want the ability to search for things like "patients of Dr Smith" and that your search engine should be able to interpret this as a search for patients where the doctor field corresponds to Dr Smith.
Equally I could be wrong and you could indeed be attempting to build something that sounds very like TimBL's example from his 2001 Scientific American article on the Semantic Web.
Edit
So as you do want to do proper RDF search then I would advise that you put your data into a Triple Store rather than a database and preferably use a Triple Store that provides SPARQL query so you can convert the inputs on your query form into a SPARQL query and query the Triple store with that.
Maybe take a look at Talis http://www.talis.com or Virtuoso http://www.openlinksw.com/virtuoso/
If you decide to use SemWeb then you could just use the Triple Store that it provides.

You may be able to do what you need using LinqToRdf. LinqToRdf exposes two LINQ query providers (i.e. you will need .NET 3.5+) including one that produces standards compliant SPARQL queries.
Here's a typical LinqToRdf Query, which if you're familiar with LINQ to SQL, should be totally natural:
MusicDataContext ctx = new MusicDataContext(#"http://localhost/linqtordf/SparqlQuery.aspx");
var q = (from t in ctx.Tracks
where t.Year == "2006" &&
t.GenreName == "History 5 | Fall 2006 | UC Berkeley"
orderby t.FileLocation
select new {t.Title, t.FileLocation}).Skip(10).Take(5);
foreach (var track in q)
{
Console.WriteLine(track.Title + ": " + track.FileLocation);
}

I suggest you try RDFSharp (http://rdfsharp.codeplex.com/) because, as far as I can understand from your question, you probably need to quickly setup an RDF application capable of performing elementary triple-based searches like SUBJECT="xxx";PREDICATE=NULL;OBJECT="yyy".
Feel free to try it, of course there exist more powerful tools but for your scenario I believe it is the most simple to apply.

Using semantic web technologies for the scenario you describe is overkill. However, if you are interested in a mature .NET library for working with Semantic Web standards in .NET and SQL definitely take a look at Intellidimension's offerings.

A C# library for RDF which seems to be becoming quite popular in the community of LinqToRDF. The project was originated by Andrew Matthews and has been going since 2007b I think. The software is up on Google COde and can be found here:
LinkToRDF
Together with the library, there's also something called "LinqToRDF designer" which fits into Visual Studio and allows you model RDF graphically.

What's the best way to implement a search?

I've got a requirement where a user enters a few terms into a search box and clicks "go".
Does anyone have any good resources on how to implement a dynamic search that spans a few database tables?
Thanks,
Mike

I'm gonna throw in my vote for Lucene. While SQL Server does provide full text indexing and some search capabilities, it is not the greatest search engine. In my experience, it does not provide the best results or result ranking until you have a significant volume of indexed items (tens of thousands to hundreds of thousands minimum).
In contrast, Lucene is explicitly a search engine. It is an inverted index, behaving much like your run of the mill internet search engine. Lucene provides a very rich indexing and search platform, as well as some rich C# and .NET API's for querying the indexes. There is even a LINQ to Lucene provider that will allow you to query a Lucene index with LINQ.
The one drawback to using Lucene is that you have to build an index, which is a side-band process that runs independently of the database. You have to write your own tool to manage the index as well. Your search index, depending on how frequently you update it, may not be 100% up-to-the-minute up to date. Generally, that is not a huge concern, but if you have the resources, the Lucene index culd be incrementally updated every few minutes to keep things "fresh".

It is called Full-text Search.
http://msdn.microsoft.com/en-us/library/ms142571.aspx

This is a pretty loaded question given the lack of detail. If you just need a simple search over a few tables/columns then a single (cludgy) search SP may be enough for you.
That said, if you need more features such as:
Searching a large set of tables
Support for large amounts of data
Searching over forms of a word
Logical operations
etc
then you might want to look into Full-Text Search (which is a part of MS Sql 2000 and above). The initial investment to get up to speed with Full-Text Search can be a bit offsetting, but compared to implementing the above features you'll likely save yourself a ton of time and energy.
Here are some Full-Text Search links to get you started:
Msdn Page
Initial Set Up
Set Up Video
Hope that helps.

Ok there were a few requests for more info so let me provide some.
I have several tables (ie. users, companies, addresses) and I'd like a user to be able to enter something like this:
"microsoft wa gates"
and bring up a result list containing results for "gates", "microsoft", and "washington".
Lucene seems like it could be pretty cool.

You can create a SP that receive the search terms as parameters and retun some "selects" (recordsets) to the program that launched. It can return a select for each table and you can do whatever you need with the data in your app code.
If you need to receive only a dataset, you can make a View using UNION of the tables for consolidate the columns in a common schema and then filter the view same way. You will receive in your application only a dataset with all the information consolidated in the view and filtered.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.