Regex word boundaries capturing wrong words - c#

I am having some difficulties trying to get my simple Regex statement in C# working the way I want it to.
If I have a long string and I want to find the word "executive" but NOT "executives" I thought my regex would look something like this:
Regex.IsMatch(input, string.Format(#"\b{0}\b", "executive");
This, however, is still matching on inputs that contain only executives and not executive (singular).
I thought word boundaries in regex, when used at the beginning and end of your regex text, would specify that you only want to match that word and not any other form of that word?
Edit: To clarify whats happening, I am trying to find all of the Notes among Students that contain the word executive and ignoring words that simply contain "executive". As follows:
var studentMatches =
Students.SelectMany(o => o.Notes)
.Where(c => Regex.Match(c.NoteText, string.Format(#"\b{0}\b", query)).Success).ToList();
where query would be "executive" in this case.
Whats strange is that while the above code will match on executives even though I don't want it to, the following code will not (aka it does what I am expecting it to do):
foreach (var stu in Students)
{
foreach (var note in stu.Notes)
{
if (Regex.IsMatch(note.NoteText, string.Format(#"\b{0}\b", query)))
Console.WriteLine(stu.LastName);
}
}
Why would a nested for loop with the same regex code produce accurate matches while a linq expression seems to want to return anything that contains the word I am searching for?

Your linq query produces the correct result. What you see is what you have written.
Let's give proper names to make it clear
var noteMatches = Students.SelectMany(student => student.Notes)
.Where(note => Regex.Match(note.NoteText, string.Format(#"\b{0}\b", query)).Success)
.ToList();
In this query after executing SelectMany we received a flattened list of all notes. Thus was lost the information about which note belonged to which student.
Meanwhile, in the sample code with foreach loops you output information about the student.
I can assume that you need a query like the following
var studentMatches = Students.Where(student => student.Notes
.Any(note => Regex.IsMatch(note.NoteText, string.Format(#"\b{0}\b", query))))
.ToList();
However, it is not clear what result you want to obtain if the same student will have notes containing both executive and executives.

Related

how to get a value from json with just the index?

Im making an app which needs to loop through steam games.
reading libraryfolder.vbf, i need to loop through and find the first value and save it as a string.
"libraryfolders"
{
"0"
{
"path" "D:\\Steam"
"label" ""
"contentid" "-1387328137801257092942"
"totalsize" "0"
"update_clean_bytes_tally" "42563526469"
"time_last_update_corruption" "1663765126"
"apps"
{
"730" "31892201109"
"4560" "9665045969"
"9200" "22815860246"
"11020" "776953234"
"34010" "11967809445"
"34270" "1583765638"
for example, it would record:
730
4560
9200
11020
34010
34270
Im already using System.Text.JSON in the program, is there any way i could loop through and just get the first value using System.Text.JSON or would i need to do something different as vdf doesnt separate the values with colons or commas?
That is not JSON, that is the KeyValues format developed by Valve. You can read more about the format here:
https://developer.valvesoftware.com/wiki/KeyValues
There are existing stackoverflow questions regarding converting a VDF file to JSON, and they mention libraries already developed to help read VDF which can help you out.
VDF to JSON in C#
If you want a very quick and dirty way to read the file without needing any external library I would probably use REGEX and do something like this:
string pattern = "\"apps\"\\s+{\\s+(\"(\\d+)\"\\s+\"\\d+\"\\s+)+\\s+}";
string libraryPath = #"C:\Program Files (x86)\Steam\steamapps\libraryfolders.vdf";
string input = File.ReadAllText(libraryPath);
List<string> indexes = Regex.Matches(input, pattern, RegexOptions.Singleline)
.Cast<Match>().ToList()
.Select(m => m.Groups[2].Captures).ToList()
.SelectMany(c => c.Cast<Capture>())
.Select(c => c.Value).ToList();
foreach(string s in indexes)
{
Debug.WriteLine(s);
}
See the regular expression explaination here:
https://regex101.com/r/bQSt79/1
It basically captures all occurances of "apps" { } in the 0 group, and does a repeating capture of pairs of numbers inbetween the curely brackets in the 1 group, but also captures the left most number in the pair of numbers in the 2 group. Generally repeating captures will only keep the last occurance but because this is C# we can still access the values.
The rest of the code takes each match, the 2nd group of each match, the captures of each group, and the values of those captures, and puts them in a list of strings. Then a foreach will print the value of those strings to log.

Elastic query to search eliminating spaces from mongodb data C#

I wanted to know if there is a way to search MongoDB data excluding space. My MongoDB has data such as -
"Element":"Ele1"
"Element":"Ele2"
"Element":"Ele 3"
And my search query has string excluding space i.e "Ele1,Ele2,Ele3". So When I pass "Ele3" to my search query it does not return any result. I am using ElasticSearch.net nuget. My code is something like this -
var elResp = await data.MultiSearchAsync(m => m
.Query(q => q
.Nested(n => n
.Path("Element")
.Query(qq => qq
.Bool(b => b
.Must(m1 => m1
.Term("Element.keyword", "Ele1")))))))
Is there anything I can do or I have to create an extra field in MongoDb which saves Element Value without space. So I can perform check from that. Thanks
You'd achieve the best performance by creating extra field without spaces, however in MongoDB you can just use $regex to match both fields with and without spaces.
db.collection.find({ "Element": { $regex: /Ele\s*3/ } })
where \s is a whitespace and *means zero or more occurences. The more spaces you predict, the more you should consider adding a field without any spaces (because of performance), but /E\s*l\s*e\s*3/ will be still working for this case.

Sitecore Lucene index search term with space match same word without space

This seems so simple that I'm convinced I must be overlooking something. I cannot establish how to do the following in Lucene:
The problem
I'm searching for place names.
I have a field called Name
It is using Lucene.Net.Analysis.Standard.StandardAnalyzer
It is TOKENIZED
The value of Name contains 1 space in the value: halong bay.
The search term may or may not contain an extra space due to culturally different spellings or genuine spelling mistakes. E.g. ha long bay instead of halong bay.
If I use the term halong bay I get a hit.
If I use the term ha long bay I do not get a hit.
The attempted solution
Here's the code I'm using to build my predicate using LINQ to Lucene from Sitecore:
var searchContext = ContentSearchManager.GetIndex("my_index").CreateSearchContext();
var term = "ha long bay";
var predicate = PredicateBuilder.Create<MySearchResultItemClass>(sri => sri.Name == term);
var results = searchContext.GetQueryable<MySearchResultItemClass>().Where(predicate);
I have also tried a fuzzy match using the .Like() extension:
var predicate = PredicateBuilder.Create<MySearchResultItemClass>(sri => sri.Like(term));
This also yields no results for ha long bay.
How do I configure Lucene in Sitecore to return a hit for both halong bay and ha long bay search terms, ideally without having to do anything fancy with the input term (e.g. stripping space, adding wildcards, etc)?
Note: I recognise that this would also allow the term h a l o n g b a y to produce a hit, but I don't think I have a problem with this.
A TOKENIZED field means that the field value is split by a token (space in that case) and the resulting terms are added to the index dictionary. If you index "halong bay" in such a field, it will create the "halong" and "bay" terms.
It's normal for the search engine to fail to retrieve this result for the "ha long" search query because it doesn't know any result with the "ha" or "long" terms.
A manual approach would be to define all the other ways to write the place name in another multi-value computed index field named AlternateNames. Then you could issue this kind of query: Name==query OR AlternateNames==query.
An automatic approach would be to also index the place names without spaces in a separate computed index field named CompactName. Then you could issue this kind of query: Name==query OR CompactName==compactedQueryWithoutSpaces
I hope this helps
Jeff
Something like this might do the trick:
var predicate = PredicateBuilder.False<MySearchResultItemClass>();
foreach (var t in term.Split(' '))
{
var tempTerm = t;
predicate = predicate.Or(p => p.Name.Contains(tempTerm));
}
var results = searchContext.GetQueryable<MySearchResultItemClass>().Where(predicate);
It does split your input string, but I guess that is not 'fancy' ;)

Find Results with at Least One term from array

I'm writing a search algorithm. For the last portion of it, I want to split their search into individual words and then find any results that have at least one of those words in it. Is there any function that would work something like "ContainsAny" below? Otherwise, how can I make that happen?
string[] splitStr = text.Split();
result = db.Table.Where(x => x.Name.ContainsAny(splitStr).FirstOrDefault();
For example, if they search for "Metal Spoon" both "Metal Chair" and "spoon book" would be valid results because each contains at least one of the search terms.
There is no ContainsAny, but you can use combination of Any and Contains like this:
var results = db.Table.Where(x => splitStr.Any(s => x.Name.Contains(s)));

String-parsing-fu: Can you help me find a way to retrieve this value?

I need to somehow detect if there is a parent OU value, and if there is retrieve it.
For example, here there is no parent:
LDAP://servera/OU=Santa Cruz,DC=contoso,DC=com
But here, there is a parent:
LDAP://servera/OU=Ventas,OU=Santa Cruz,DC=contoso,DC=com
So I would need to retrieve that "Ventas" string.
Another example:
LDAP://servera/OU=Contabilidad,OU=Ventas,OU=Santa Cruz,DC=contoso,DC=com
I would need to retrieve that "Ventas" string as well.
Any suggestions on how to tackle this?
string ldap = "LDAP://servera/OU=Ventas,OU=Santa Cruz,DC=contoso,DC=com";
Match match = Regex.Match(ldap, #"LDAP://\w+/OU=(?<toplevelou>\w+?),OU=");
if(match.Success)
{
Console.WriteLine(match.Result("${toplevelou}"));
}
I'd find the first occurrence of OU=... and get it's value. Then I'd check if there was another occurrence after it. If so, return the value I've got. If not, return whatever it is you want if there's no parent (String.Empty, or, null, or whatever).
You could also use a regular express like this:
var regex = new Regex(#"OU=(.*?),");
var matches = regex.Matches(ldapString);
Then check how many matches there are. If >1 return the captured value from the first match.
Update
The regex above needs to be improved to allow the case where there's an escaped comma (\,) in the LDAP string. Maybe something like:
var regex = new Regex(#"OU=((.*?(\\\,)+?)+?),");
That may be broken, and there may be simpler way to do the same thing. I'm not a regex wizard.
Another Update
Per Kimberly's comment below the regex should be #"OU=((?:.*?(?:\\\,)*?)+?),".
Call me crazy, but I 'd do it this way (hey ma, look, an one-liner!):
var str = "LDAP://servera/OU=Ventas,OU=Santa Cruz,DC=contoso,DC=com";
var result = str.Substring(str.LastIndexOf('/') + 1).Split(',')
.Select(s => s.Split('='))
.Where(a => a[0] == "OU")
.Select(a => a[1])
.Reverse().Skip(1).FirstOrDefault();
result is either null or has the string you want. This will work no matter how many OUs are in there and return the second-to-last one, as long as the format of the string is valid to begin with.
Update: possible improvements:
The above will not work correctly if your DN contains an escaped forward slash or an escaped comma.
To fix both of these you need to use regular expressions. Change:
str.Substring(str.LastIndexOf('/') + 1).Split(',')
to:
Regex.Split(Regex.Split(str, "(?<!\\\\)/").Last(), "(?<!\\\\),")
What this does is separate the DN by getting the last part of str after splitting on forward slashes, and split the in parts DN by splitting on commas. In both cases, negative lookbehind is used to make sure that the slashes/commas are not escaped.
Not as pretty, I know. But it's still an one-liner (yay!) and it still allows you to use LINQ further down to handle multiple OUs any way you choose to.

Categories