Having an issue with diacritics and Solr Search - c#

I am working with a dot NET MVC application and with an Apache Solr. I have two fields indexed into Solr, one is Name & second is Category. I have indexed some diacritics words in Name as well as in Category field with this encoding method.
HttpUtility.UrlEncode()
The reason I have index it with encoding is that I want to display Category with Facets.
So all this values are indexed in encoded form into Solr. Now, during Search process I am encoding searchterm and then searching it into Solr and it gives me result.
But the problem is that if I search the same word without diacritic, it does not give me any result as the word is stored into encoded form.
Is there any solution to solve this?

Create a new field category_norm and add a normalizing analyzer chain to it (I think the example schema has one for text), then use a copyField to automatically copy the content from your category into the new field.
Or you could turn it around and introduce category_facet for faceting with the raw value and let the field name have the search-version, again using copyField to keep them synchronized.

Related

Reverse RegExp from user entered string ( C#)

Is it possible to generate regular expressions from a user entered string? Are there any C# libraries to do this?
For example a user enters a string e.g. ABCxyz123 and the C# code automatically generates [A-Z]{3}[a-z]{3}\d{3}.
This is a simple string but we could have more complicated strings like
MON-0123/AB/5678-abc 2/7
Or
1234-678/abc::1234ABC?246
I already have a string tokeniser (from a previous stackoverflow question) so I could construct a regex from the list of tokens.
But I was wondering if there is a lib or C# code out there that’ll do it.
Edit: Important, I should of also said: It's not the actual character in the string that are important but the type of character and how many.
e.g A user could enter a "pattern" string of ABCxyz123.
This would be interpreted as
3 upper case alphas followed by
3 lower case alphas followed by
3 digits
So other users (when complied) must enter strings that match that pattern [A-Z]{3}[a-z]{3}\d{3}., e.g. QAZplm789
It's the format of user entered strings that's need to be checked not the actual content if that makes sense
Jerry has a related link
creating a regular expression for a list of strings
There are a few other links off this.
I'm not trying to do anything complicated e.g NLP etc.
I could use C# expression builder and dynamic linq at a push, but that seems overkill and a code maintainable nightmare .
I'll write my own "simple" regex builder from the tokenized string.
Example Use Case:
An admin office user where I work could setup the string patterns for each field by typing a string pattern, My code converts this to a regex, I store these in a database.
E.g: Field one requires 3 digits at the start. If there are 2 digits then send to workflow 1 if 3 then send to workflow 2. I could simply check the number of chars by substr or what ever. But this would be a concrete solution.
I am trying to do this generically for multiple documents with multiple fields. Also, each field could have multiple format checkers.
I don't want to write specific C# checks for every single field in numerous documents.
I'll get on with it, should keep me amused for a couple of days.

How can I use periods in MailMerge Field Names?

I have a List of Merge Field names passed into my application, along with a Word Document. The Merge Fields include periods and underscores, in the form "Here.Is.An.Example.Merge_Field". These separators are required - the characters may be able to be changed, but they cannot be removed altogether.
Several of these Merge Fields are contained within a datasource that is created for the document. The datasource has no data save for the merge field names - the merge itself takes place elsewhere in the process.
I attach the document to the datasource as below:
WordApplication.MailMerge.OpenDataSource(DataFilePath);
This loads the Merge Fields into the menu as desired, but all the periods have gone. I now have "HereIsAnExampleMerge_Field", which causes issues at other points in my application.
How can I prevent Word from removing these characters?
I don't think you can, because AFAIK the merge field names are modified to be valid bookmark names, which have certain restrictions
(See, e.g. What are the limitations for bookmark names in Microsoft Word? for a discussion, except in this case I think Word will insert an "M" before an initial "_". Plus, names over 40 characters long are mangled to produce unique names.)
So what to do really depends on what issues you are facing.
You can retrieve the fieldnames that Word actually uses (i.e. its mangled names) from either ActiveDocument.MailMerge.DataSource.DataFields or .FieldNames. AFAIK these are in the same sequence as the fields in the data source, even though Word sometimes rearranges fields for display in the Edit Recipients dialog (e.g., it sorts field names that it considers to be address field names in a certain sequence). So that should allow you to match its field names with the originals.
Alternatively, if your code needs to insert { MERGEFIELD } fields in a Mail Merge Main Document and knows the sequence of the fields in the data source, you can use field numbers (1 for the first field retrieved by Word etc.), e.g. { MERGEFIELD 1 }. But beware, as I have never seen that facility documented anywhere by Microsoft.

ASP.Net Resource Strings in Link format display

The requirement goes this way. I have localized strings in my asp.net web application. I have a aspx page. Consider there is a string key-value pair in the resource file. The value is translated in different languages and available in different resource files. Ex: Strings.fr-FR.resx. Consider the value in resource file is "Hello World", that is translated to different languages.
In my aspx page, I want to retrieve the string from the resource file and display it in the page. But, I want "World" only to be in link format. How can i do it? If I display entire string in an anchor tag , then entire word "Hello World" would be in link format.
Again, my question is how to display only "World" in link format after retrieving from the resource file.
Thanks in advance.
Normally you can't do that, if you need a specific string to be localized and you create a resource element for that purpose, than you should consider it as an atomic element. Therefore: create a hello resource item.
If you really need to split that string you can call .ToString(), then .Split() on whitespace and finally take the second element with [1].
But I don't advise you to do this because some languages can have the word corrisponding to world as first word or even using a different number of words to say hello world. Even if you have only N languages, all translating 'hello world' with 2 words and having world as second word, I'd not do that because you are creating a strong relation between string form and its semantic, which is a source of bugs in case of new supported languages or string change.

How to do partial word searches in Lucene.NET?

I have a relatively small index containing around 4,000 locations. Among other things, I'm using it to populate an autocomplete field on a search form.
My index contains documents with a Location field containing values like
Ohio
Dayton, Ohio
Dublin, Ohio
Columbus, Ohio
I want to be able to type in "ohi" and have all of these results appear and right now nothing shows up until I type the full word "ohio".
I'm using Lucene.NET v2.3.2.1 and the relevant portion of my code is as follows for setting up my query....
BooleanQuery keywords = new BooleanQuery();
QueryParser parser = new QueryParser("location", new StandardAnalyzer());
parser.SetAllowLeadingWildcard(true);
keywords.Add(parser.Parse("\"*" + location + "*\""), BooleanClause.Occur.SHOULD);
luceneQuery.Add(keywords, BooleanClause.Occur.MUST);
In short, I'd like to get this working like a LIKE clause similar to
SELECT * from Location where Name LIKE '%ohi%'
Can I do this with Lucene?
Try this query:
parser.Parse(query.Keywords.ToLower() + "*")
Yes, this can be done. But, leading wildcard can result in slow queries. Check the documentation. Also, if you are indexing the entire string (eg. "Dayton, Ohio") as single token, most of the queries will degenerate to leading prefix queries. Using a tokenizer like StandardAnalyzer (which I suppose, you are already doing) will lessen the requirement for leading wildcard.
If you don't want leading prefixes for performance reasons, you can try out indexing ngrams. That way, there will not be any leading wildcard queries. The ngram (assuming only of length 4) tokenizer will create tokens for "Dayton Ohio" as "dayt", "ayto", "yton" and so on.
it's more a matter of populating your index with partial words in the first place. your analyzer needs to put in the partial keywords into the index as it analyzes (and hopefully weight them lower then full keywords as it does).
lucene index lookup trees work from left to right. if you want to search in the middle of a keyword, you have break it up as you analyze. the problem is that partial keywords will explode your index sizes usually.
people usually use really creative analyzers that break up words in root words (that take off prefixes and suffixes).
get down in to deep into understand lucene. it's good stuff. :-)

validating user input tags

I know this question might sound a little cheesy but this is the first time I am implementing a "tagging" feature to one of my project sites and I want to make sure I do everything right.
Right now, I am using the very same tagging system as in SO.. space seperated, dash(-) combined multiple words. so when I am validating a user-input tag field I am checking for
Empty string (cannot be empty)
Make sure the string doesnt contain particular letters (suggestions are welcommed here..)
At least one word
if there is a space (there are more than one words) split the string
for each splitted, insert into db
I am missing something here? or is this roughly ok?
Split the string at " ", iterate over the parts, make sure that they comply with your expectations. If they do, put them into the DB.
For example, you can use this regex to check the individual parts:
^[-\w]{2,25}$
This would limit allowed input to consecutive strings of alphanumerics (and "_", which is part of "\w" as well as "-" because you asked for it) 2..25 characters long. This essentially removes any code injection threat you might be facing.
EDIT: In place of the "\w", you are free to take any more closely defined range of characters, I chose it for simplicity only.
I've never implemented a tagging system, but am likely to do so soon for a project I'm working on. I'm primarily a database guy and it occurs to me that for performance reasons it may be best to relate your tagged entities with the tag keywords via a resolution table. So, for instance, with example tables such as:
TechQuestion
TechQuestionID (pk)
SubjectLine
QuestionBody
TechQuestionTag
TechQuestionID (pk)
TagID (pk)
Active (indexed)
Tag
TagID (pk)
TagText (indexed)
... you'd only add new Tag table entries when never-before-used tags were used. You'd re-associate previously provided tags via the TechQuestionTag table entry. And your query to pull TechQuestions related to a given tag would look like:
SELECT
q.TechQuestionID,
q.SubjectLine,
q.QuestionBody
FROM
Tag t INNER JOIN TechQuestionTag qt
ON t.TagID = qt.TagID AND qt.Active = 1
INNER JOIN TechQuestion q
ON qt.TechQuestionID = q.TechQuestionID
WHERE
t.TagText = #tagText
... or what have you. I don't know, perhaps this was obvious to everyone already, but I thought I'd put it out there... because I don't believe the alternative (redundant, indexed, text-tag entries) wouldn't query as efficiently.
Be sure your algorithm can handle leading/trailing/extra spaces with no trouble = )
Also worth thinking about might be a tag blacklist for inappropriate tags (profanity for example).
I hope you're doing the usual protection against injection attacks - maybe that's included under #2.
At the very least, you're going to want to escape quote characters and make embedded HTML harmless - in PHP, functions like addslashes and htmlentities can help you with that. Given that it's for a tagging system, my guess is you'll only want to allow alphanumeric characters. I'm not sure what the best way to accomplish that is, maybe using regular expressions.

Categories