I'm trying to do a global search on the website (I'm using Sitecore 8.1) using Lucene and field boosting. The idea is that I want to search in the content that is on the pages, and not all the pages have the same template. So I cannot know what fields I should be searching for to see if those contain the content I'm looking for.
Here I also want to integrate the field boosting, for which I haven't found yet any example.
Does anyone know if the way I'm trying to do it it is a good idea, and point me into some direction?
Whan I'm trying to find out is how I should create my query and how to access the field boosting to sort my results?
You can boost the importance of specific fields.
For example, you may want to boost the value of specific fields, such as title or abstract.Set the boost attribute of the relevant /configuration/sitecore/contentSearch/configuration/DefaultIndexConfiguration/fieldMap/fieldNames/fieldName
element in the Web.config file, typically specified in the /App_Config/Include/Sitecore.ContentSearch.Lucene.DefaultIndexConfiguration.config
Web.config include file. All indexes share this configuration by default.
And also you can boost field inside content editor on Indexing Section.
Field boosting applies at indexing time
More information you can find here:
http://www.sitecore.net/learn/blogs/technical-blogs/john-west-sitecore-blog/posts/2013/04/sitecore-7-six-types-of-search-boosting.aspx
After you set the boost value and are indexing your content use Luke to check the rank are your fields. My suggestion is to not use boost on fields because are not really relevant for the end user if the text they are searching is on Title or Abstract field.
Related
I have a scenario where a single document in a Lucene index could have multiple locations. The document is a representation of a Sitecore item and N location items that are assigned to it. A point and radius would be used to search for all documents that have at least one location in that radius. Other search criteria such as name and tagging would also be considered. The documents would need to be sorted by distance, using the closest matching location assigned to that document. I have used lucene.net.contrib.spatial for single points, but I can't quite piece together how multipoint would, or could, work.
I suggest you to use this module or to modify it for your requirements . https://marketplace.sitecore.net/en/Modules/L/Lucene_Spatial_Search_Support.aspx
I don't know what version of Sitecore are you using. From comments looks like it doesn't work on Sitecore 8.
You find source code here:
https://github.com/aokour/Sitecore.ContentSearch.Spatial
After trying a bunch of different solutions, I've created created a reverse tagging system.
The short of it is, I use Sitecore's links database to create a computed index field on each location which stores the ID of each item which is tagged with that location. I then search locations first, then use the IDs on the location results as search parameters for a query of the content that I am actually looking for.
I've outlined the full implementation here:
http://alextselevich.com/2016/08/performing-a-geospatial-search-with-lucene-on-a-document-that-has-multiple-locations/
Sorry if I will ask the question that already exists or just has easy solution, i'm just a new comer to Sharepoint.
The question relates to Sharepoint 2007 search engine.
I have a few crawled properties and correspond managed properties. I need to crawling all these properties, but without one property!
I have own xsl and I use own presentation of search results, but it so similar to native SP presentation. And there i use "Description" SP property (isn't mine), which show me all my properties (it's values), and it's ok, but i need to exclude only one property from the "Description" property in the search results.
Could you give me explanation of how to do this by the code (.NET) or maybe good advices and references?
Thanks in advance!
I already found out the solution. In the SP UI we can uncheck "Include values for this property in the search index" checkbox of crawl properties. And these properties will removed from search results in SP "Description" property. And it isn't hard to found out the way how to do this programmatically.
Thanks to all anyway.
I've been working on a project in the last few days and there is a task in this project that I actually don't know how to do, the project includes analyzing web pages to find tags that Characterizes the page.
hey buddy , what you mean by tags? by saying tags I mean keywords that summarize what the web page about. For example here on SO you write you're own tags so people can find you're question better. What I am talking about is building an algorithm to analyze the web pages to find it's tags by the text within the page.
I started with getting the text from the page -> accomplished
generally im looking for a way to find the keywords that Concludes what the webpage about
However, I don't really know what to do next. Does anyone have a suggestion?
For a really basic approach, you could use the TF-IDF algorithm to find the most important word in your page
Quick overlook from wikipedia:
The tf–idf weight (term frequency–inverse document frequency) is a
weight often used in information retrieval and text mining. This
weight is a statistical measure used to evaluate how important a word
is to a document in a collection or corpus. The importance increases
proportionally to the number of times a word appears in the document
but is offset by the frequency of the word in the corpus. Variations
of the tf–idf weighting scheme are often used by search engines as a
central tool in scoring and ranking a document's relevance given a
user query. tf–idf can be successfully used for stop-words filtering
in various subject fields including text summarization and
classification
Once you find the most important word in your page you can use them as tags.
If you want to improve your tags and make them more relevant.
There are a lot of way to proceed, but you can proceed as below:
Extract a bunch of text from which you know the main tags.
For all this text run a TF-IDF algorithm and create a vector with the
ones with the highest score.
Try to find a main direction will all these vectors. (running an ACP
for example, or any machine learning tool)
And use this tag to represent the set of words from the main direction. (the largest vector of the ACP)
Hope it's understandable and it helps
Typically you look for certain words surrounded by certain html. For example, titles are typically in an H tag such as <h1>.
If you parse a page for all of it's H1 tags then it stands to reason that the content following that tag is related. An example is this very page. It has an H1 tag surrounding the question title. This gives google a hint that the page is about "algorithm", "analyzing", "web pages", etc.
The hard part is to determine context.
In our example here, the term "pages" is very generic and can relate to anything. However "web pages" are a bit more specific. You can do this with an internal dictionary that is built up over time based on term frequency after analyizing a number of documents to find commonality. The frequency should provide a weighted value in determining the top X "tags" for a given page.
This is more of an Information Retrieval and Data Mining question. Reviewing some of Rao's lectures may help.
When you're spidering web pages, you're essentially trying to build an index. You do this by building a global Term-Frequency dictionary, where each word in the language (often stemmed to account for pluralization and other modifications) is stored as a key, and the number of times they occur in the document as values.
From there, you can use algorthms such as PageRank and Authorities and hubs to do data analysis.
You can implement a number of heuristics:
Acronyms and words in all uppercase
Words that are not frequent, i.e. discard words that appear in all or most documents and favour the ones that appear relatively frequently only on this one.
Sequences of words that always appear in the same order in this document and possibly in others as well
etc.
I need to be able to remove the description text within search results which displays a portion of an indexed document, however I only want this to affect a single library's documents (or sub-site). Is it possible to localize something like this in such a way? Through XSLT, or the sp object model, or custom trimming or anything.. maybe somehow intercept the index query results, strip out the relevant text, then pass it along.
One idea that almost worked was to wrap the srch-description div in the core web part's XSLT in an if statement that checks if the item's url contains my library's name, however this xslt change would have to go into any site that searches on my library and that's not possible. I wonder if there's anything more I can do to remove srch-description or decouple it from my items..
Disclaimer: this is a suggestion - I have not tried this!
I suggest populating (and if necessary creating) a Description field in your document library. This field might contain some innocuous description text. Next, create a new SharePoint search content source pointing to the document library. Map the metadata crawled properties (Description -> ows_Description) and check the "Include values for this property in the search index" checkbox. You may also need to add a crawl rule to the original source to exclude your "special" document library.
I have a textbox and a button in one page.I want to enter a word in the textbox and click the button. After clicking the button I want to display the name of the web pages containing the word entered in the textbox. So please tell me how to do it? I am using C#.
So you want to create a search engine internal to your website. There are a couple of different options
You can use something like google custom search which requires no coding and uses the google technology which I think we all agree does a pretty good job compared to other search engines. More information at http://www.google.com/cse/
Or you can implement it in .net which I will try to give some pointers about below.
A search engine in general exists out of (some of) the following parts:
a index which is searched against
a query system which allows searches to be specified and results shown
a way to get documents into the index like a crawler or some event thats handled when the documents are created/published/updated.
These are non trivial things to create especially if you want a rich feature set like stemming (returning documents containing plural forms of search terms), highlighting results, indexing different document formats like pdf, rtf, html etc... so you want to use something already made for this purpose. This would only leave the task of connecting and orchestrating the different parts, writing the flow control logic.
You could use Lucene.net a opensource project with a lot of features. http://usoniandream.blogspot.com/2007/10/tutorial-implementing-lucenenet-search.html explains how to get started with it.
The other option is Microsoft indexing service which comes with windows but I would advice against it since it's difficult to tweak to work like you want and the results are sub-optimal in my opinion.
You are going to need some sort of backing store, and full text indexing. To the best of my knowledge, C# alone is not enough.