find Count of Substring in string C# - c#

I am trying to find how many times the word "Serotonin" appears in the gathered web data but cannot find a method for finding the number of times.
IEnumerator OnMouseDown()
{
string GatheredData;
StringToFind = "Serotonin"
string url = "https://en.wikipedia.org/wiki/Dopamine";
WWW www = new WWW(url);
yield return www;
GatheredData = www.text;
//Attempted methods below
M1_count = GatheredData.Contains(StringToFind);
M1_count = GatheredData.Count(StringToFind);
M1_count = GatheredData.IndexOf(StringToFind);
}
I can easily use the data from those methods 1 and 3 when I tell it what number in the index and method 2 would work but only works for chars not strings
I have checked online and on here but found nothing of finding the count of the StringToFind

Assume string is like this
string test = "word means collection of chars, and every word has meaning";
then just use regex to find how many times word is matched in your test string like this
int count = Regex.Matches(test, "word").Count;
output would be 2

The solution
int count = Regex.Matches(someString, potencialSubstring).Count;
did not work for me. Even thou I used Regex.Escape(str)
So I wrote it myself, it is quite slow, but the performance is not an issue in my app.
private static List<int> StringOccurencesCount(String haystack, String needle, StringComparison strComp)
{
var results = new List<int>();
int index = haystack.IndexOf(needle, strComp);
while (index != -1)
{
results.Add(index);
index = haystack.IndexOf(needle, index + needle.Length, strComp);
}
return results;
}
Maybe someone will find this useful.

Improvement on #Petr Nohejl's excellent answer:
public static int Count (this string s, string substr, StringComparison strComp = StringComparison.CurrentCulture)
{
int count = 0, index = s.IndexOf(substr, strComp);
while (index != -1)
{
count++;
index = s.IndexOf(substr, index + substr.Length, strComp);
}
return count;
}
This does not use Regex.Matches and probably has better performance and is more predictable.
See on .NET Fiddle

If you don't worry about performance, here are 3 alternative solutions:
int Count(string src, string target)
=> src.Length - src.Replace(target, target[1..]).Length;
int Count(string src, string target)
=> src.Split(target).Length - 1;
int Count(string src, string target)
=> Enumerable
.Range(0, src.Length - target.Length + 1)
.Count(index => src.Substring(index, target.Length) == target);

Oh yes, I have it now.
I will split() the array and get the length
2nd to that I will IndexOf until I return a -1
Thanks for help in the comments!

A possible solution would be to use Regex:
var count = Regex.Matches(GatheredData.ToLower(), String.Format("\b{0}\b", StringToFind)).Count;

Related

Odd-Even Index String Sort

Task:
Implement the method at each iteration of which, the odd characters of the string are combined and wrapped to its beginning, and the even characters are wrapped to the end.
"source" The source string.
"count" The count of iterations.
My code:
public static string ShuffleChars(string s, int count)
{
string res = string.Empty;
for (int i = 0; i <= count; i++)
{
res = $"{string.Concat(s.Where((x, i) => i % 2 == 0))}{string.Concat(s.Where((x, i) => i % 2 != 0))}";
}
}
return res;
I sorted string but I don't know how can I do iterations on same value , I tried use "for" , but it is not working, help me pls
i need to sort like this:
1."123456789"
2."135792468" first iteration
3."159483726" second iteration
4."198765432" third iteration
but if I use loop , anyway count = 2 or count = 10 it returns "135792468", I don't know why
The problems with your code are:
You return from inside the loop. This prevents any but the first iteration to complete.
You use <= instead of < in your loop condition. Since we start at 0, this will iterate count + 1 times.
You use the same variable name i for the loop counter as you do in the Where clause, which is illegal since they're in the same scope.
To resolve these issues (and use string.Concat instead of string.Join):
public static string ShuffleChars(string s, int count)
{
for (int i = 0; i < count; i++)
{
s = string.Concat(s.Where((item, index) => index % 2 == 0)) +
string.Concat(s.Where((item, index) => index % 2 != 0));
}
return s;
}
Testing the output:
static void Main()
{
var input = "123456789";
Console.WriteLine($"Starting input = {input}");
Console.WriteLine($"One iteration = {ShuffleChars(input, 1)}");
Console.WriteLine($"Two iterations = {ShuffleChars(input, 2)}");
Console.WriteLine($"Three iterations = {ShuffleChars(input, 3)}");
GetKeyFromUser("\nDone! Press any key to exit...");
}
Output
If I'm reading your problem correctly, you pretty much want to do what your code is doing; shuffle the characters at odd positions to the beginning and even positions to the end.
However, you want to continue to shuffle them, the number of times that you pass in, count. If you try to just loop what you have, you're continuing to use the original string that you passed in, s, and then will always end up returning the same value.
The easiest way to accomplish this is to declare an output string that you continue to assign to until you break out of the loop. So something like:
public static string ShuffleChars(string s, int count)
{
var output = s;
for(int i = 0; i < count; i++) {
output = string.Join("", output.Where((v, j) => j % 2 == 0))
+ string.Join("", output.Where((v, j) => j % 2 != 0));
}
return output;
}
The key here is that you are declaring a new value, output, and initializing it to your string value you passed in. Then for each iteration of the loop, you reassign the value of output to the new value. Finally, once you break out of the loop, you return the final value of output.
As others have stated, there are other ways you could improve on the assignment line. Personally, I probably prefer using string interpolation:
output = $"{output.Where((v, j) => j % 2 == 0)}{output.Where((v, j) => j % 2 != 0)};"
You could convert the string to an IEnumerable<char>, and then apply the same LINQ transformations a count number of times. Finally materialize the IEnumerable<char> to a char[] using the ToArray operator, and then convert the array back to a string.
public static string ShuffleChars(string s, int count)
{
IEnumerable<char> chars = s;
foreach (var _ in Enumerable.Range(0, count))
{
chars = chars
.Select((c, i) => (c, i))
.OrderBy(e => e.i % 2)
.Select(e => e.c);
}
return new String(chars.ToArray());
}

Count occurences of a string pattern, in a string using Linq

Im working to correctly map links on websites.
I need to be able to count how often ../ occurs in a string. At this moment I have a function that loops through the string and counts, while this works, im looking for a Linq solution.
I know that I can count with a single character like this
int count = Href.Count(f => f == '/');
But, can I, by using LINQ , count how often the pattern ../ occurs? Is this possible?
You can do that nicely with Regex
var dotdotslash=new Regex(#"\.\./");
string test="../../bla/../";
int count=dotdotslash.Matches(test).Count;
↓
3
You could use this extension method:
public static int ContainsCount(this string input, string subString, bool countIntersecting = true, StringComparison comparison = StringComparison.CurrentCulture)
{
int occurences = 0;
int step = countIntersecting ? 1 : subString.Length;
int index = -step;
while ((index = input.IndexOf(subString, index + step, comparison)) >= 0)
occurences++;
return occurences;
}
which returns the number of sub-strings in a given string with pure string-methods:
int count = Href.ContainsCount("../");
String-methods are superior to other methods which use LINQ or regex in terms of efficiency.
This method supports counting intersecting sub-strings(default) and non-overlapping sub-strings.
This shows the difference:
string str = "ottotto";
int count = str.ContainsCount("otto"); // 2
count = str.ContainsCount("otto", false); // 1
Yes, it's possible, but it's very awkward, it will be slow, and it will be hard to read. Don't use it.
How would you count occurrences of a string within a string?
src.Select((c, i) => src.Substring(i)).Count(sub => sub.StartsWith(target))
Alternatively, this looks pretty beautiful:
public static class StringExtensions
{
public static IEnumerable<int> IndexOfAll(this string input, string value){
var currentIndex = 0;
while((currentIndex = input.IndexOf(value, currentIndex)) != -1)
yield return currentIndex++;
}
}
and usage:
"TESTHATEST"
.IndexOfAll("TEST")
.Count()
.Dump();
Regular expression (see Dmitry Ledentsov's answer) is much better here; however Linq is also possible:
String source = #"abc../def../";
// 2
int result = source
.Where((item, index) => source.Substring(index).StartsWith(#"../"))
.Count();
Actually, you can do it in a really LINQy (and awkward :) ) way like this:
private static int CountPatternAppearancesInString(string str, string pattern)
{
var count = str
.Select(
(_, index) =>
index < str.Length - pattern.Length + 1 &&
str.Skip(index)
.Take(pattern.Length)
.Zip(pattern, (strChar, patternChar) => strChar == patternChar)
.All(areEqual => areEqual))
.Count(isMatch => isMatch);
return count;
}
Or, using some of the String-provided methods:
private static int CountPatternAppearancesInString(string str, string pattern)
{
var count = str
.Select(
(_, index) =>
index < str.Length - pattern.Length + 1 &&
str.IndexOf(pattern, index, pattern.Length) >= 0)
.Count(isMatch => isMatch);
return count;
}
But, as already said, it is suboptimal and serves for illustration purpose only.

.indexOf for multiple results

Let's say I have a text and I want to locate the positions of each comma. The string, a shorter version, would look like this:
string s = "A lot, of text, with commas, here and,there";
Ideally, I would use something like:
int[] i = s.indexOf(',');
but since indexOf only returns the first comma, I instead do:
List<int> list = new List<int>();
for (int i = 0; i < s.Length; i++)
{
if (s[i] == ',')
list.Add(i);
}
Is there an alternative, more optimized way of doing this?
Here I got a extension method for that, for the same use as IndexOf:
public static IEnumerable<int> AllIndexesOf(this string str, string searchstring)
{
int minIndex = str.IndexOf(searchstring);
while (minIndex != -1)
{
yield return minIndex;
minIndex = str.IndexOf(searchstring, minIndex + searchstring.Length);
}
}
so you can use
s.AllIndexesOf(","); // 5 14 27 37
https://dotnetfiddle.net/DZdQ0L
You could use Regex.Matches(string, string) method. This will return a MatchCollection and then you could determine the Match.Index. MSDN has a good example,
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string pattern = #"\b\w+es\b";
string sentence = "Who writes these notes?";
foreach (Match match in Regex.Matches(sentence, pattern))
Console.WriteLine("Found '{0}' at position {1}",
match.Value, match.Index);
}
}
// The example displays the following output:
// Found 'writes' at position 4
// Found 'notes' at position 17
IndexOf also allows you to add another parameter for where to start looking. You can set that parameter to be the last known comma location +1. For example:
string s = "A lot, of text, with commas, here and, there";
int loc = s.IndexOf(',');
while (loc != -1) {
Console.WriteLine(loc);
loc = s.IndexOf(',', loc + 1);
}
You could use the overload of the IndexOf method that also takes a start index to get the following comma, but you would still have to do that in a loop, and it would perform pretty much the same as the code that you have.
You could use a regular expression to find all commas, but that produces quite some overhead, so that's not more optimised than what you have.
You could write a LINQ query to do it in a different way, but that also has some overhead so it's not more optimised than what you have.
So, there are many alternative ways, but not any way that is more optimised.
A bit unorthodox, but why not use a split? Might be less aggressive than iterating over the entire string
string longString = "Some, string, with, commas.";
string[] splitString = longString.Split(",");
int numSplits = splitString.Length - 1;
Debug.Log("number of commas "+numSplits);
Debug.Log("first comma index = "+GetIndex(splitString, 0)+" second comma index = "+GetIndex(splitString, 1));
public int GetIndex(string[] stringArray, int num)
{
int charIndex = 0;
for (int n = num; n >= 0; n--)
{
charIndex+=stringArray[n].Length;
}
return charIndex + num;
}

C#: Cleanest way to divide a string array into N instances N items long

I know how to do this in an ugly way, but am wondering if there is a more elegant and succinct method.
I have a string array of e-mail addresses. Assume the string array is of arbitrary length -- it could have a few items or it could have a great many items. I want to build another string consisting of say, 50 email addresses from the string array, until the end of the array, and invoke a send operation after each 50, using the string of 50 addresses in the Send() method.
The question more generally is what's the cleanest/clearest way to do this kind of thing. I have a solution that's a legacy of my VBScript learnings, but I'm betting there's a better way in C#.
You want elegant and succinct, I'll give you elegant and succinct:
var fifties = from index in Enumerable.Range(0, addresses.Length)
group addresses[index] by index/50;
foreach(var fifty in fifties)
Send(string.Join(";", fifty.ToArray());
Why mess around with all that awful looping code when you don't have to? You want to group things by fifties, then group them by fifties.
That's what the group operator is for!
UPDATE: commenter MoreCoffee asks how this works. Let's suppose we wanted to group by threes, because that's easier to type.
var threes = from index in Enumerable.Range(0, addresses.Length)
group addresses[index] by index/3;
Let's suppose that there are nine addresses, indexed zero through eight
What does this query mean?
The Enumerable.Range is a range of nine numbers starting at zero, so 0, 1, 2, 3, 4, 5, 6, 7, 8.
Range variable index takes on each of these values in turn.
We then go over each corresponding addresses[index] and assign it to a group.
What group do we assign it to? To group index/3. Integer arithmetic rounds towards zero in C#, so indexes 0, 1 and 2 become 0 when divided by 3. Indexes 3, 4, 5 become 1 when divided by 3. Indexes 6, 7, 8 become 2.
So we assign addresses[0], addresses[1] and addresses[2] to group 0, addresses[3], addresses[4] and addresses[5] to group 1, and so on.
The result of the query is a sequence of three groups, and each group is a sequence of three items.
Does that make sense?
Remember also that the result of the query expression is a query which represents this operation. It does not perform the operation until the foreach loop executes.
Seems similar to this question: Split a collection into n parts with LINQ?
A modified version of Hasan Khan's answer there should do the trick:
public static IEnumerable<IEnumerable<T>> Chunk<T>(
this IEnumerable<T> list, int chunkSize)
{
int i = 0;
var chunks = from name in list
group name by i++ / chunkSize into part
select part.AsEnumerable();
return chunks;
}
Usage example:
var addresses = new[] { "a#example.com", "b#example.org", ...... };
foreach (var chunk in Chunk(addresses, 50))
{
SendEmail(chunk.ToArray(), "Buy V14gr4");
}
It sounds like the input consists of separate email address strings in a large array, not several email address in one string, right? And in the output, each batch is a single combined string.
string[] allAddresses = GetLongArrayOfAddresses();
const int batchSize = 50;
for (int n = 0; n < allAddresses.Length; n += batchSize)
{
string batch = string.Join(";", allAddresses, n,
Math.Min(batchSize, allAddresses.Length - n));
// use batch somehow
}
Assuming you are using .NET 3.5 and C# 3, something like this should work nicely:
string[] s = new string[] {"1", "2", "3", "4"....};
for (int i = 0; i < s.Count(); i = i + 50)
{
string s = string.Join(";", s.Skip(i).Take(50).ToArray());
DoSomething(s);
}
I would just loop through the array and using StringBuilder to create the list (I'm assuming it's separated by ; like you would for email). Just send when you hit mod 50 or the end.
void Foo(string[] addresses)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < addresses.Length; i++)
{
sb.Append(addresses[i]);
if ((i + 1) % 50 == 0 || i == addresses.Length - 1)
{
Send(sb.ToString());
sb = new StringBuilder();
}
else
{
sb.Append("; ");
}
}
}
void Send(string addresses)
{
}
I think we need to have a little bit more context on what exactly this list looks like to give a definitive answer. For now I'm assuming that it's a semicolon delimeted list of email addresses. If so you can do the following to get a chunked up list.
public IEnumerable<string> DivideEmailList(string list) {
var last = 0;
var cur = list.IndexOf(';');
while ( cur >= 0 ) {
yield return list.SubString(last, cur-last);
last = cur + 1;
cur = list.IndexOf(';', last);
}
}
public IEnumerable<List<string>> ChunkEmails(string list) {
using ( var e = DivideEmailList(list).GetEnumerator() ) {
var list = new List<string>();
while ( e.MoveNext() ) {
list.Add(e.Current);
if ( list.Count == 50 ) {
yield return list;
list = new List<string>();
}
}
if ( list.Count != 0 ) {
yield return list;
}
}
}
I think this is simple and fast enough.The example below divides the long sentence into 15 parts,but you can pass batch size as parameter to make it dynamic.Here I simply divide using "/n".
private static string Concatenated(string longsentence)
{
const int batchSize = 15;
string concatanated = "";
int chanks = longsentence.Length / batchSize;
int currentIndex = 0;
while (chanks > 0)
{
var sub = longsentence.Substring(currentIndex, batchSize);
concatanated += sub + "/n";
chanks -= 1;
currentIndex += batchSize;
}
if (currentIndex < longsentence.Length)
{
int start = currentIndex;
var finalsub = longsentence.Substring(start);
concatanated += finalsub;
}
return concatanated;
}
This show result of split operation.
var parts = Concatenated(longsentence).Split(new string[] { "/n" }, StringSplitOptions.None);
Extensions methods based on Eric's answer:
public static IEnumerable<IEnumerable<T>> SplitIntoChunks<T>(this T[] source, int chunkSize)
{
var chunks = from index in Enumerable.Range(0, source.Length)
group source[index] by index / chunkSize;
return chunks;
}
public static T[][] SplitIntoArrayChunks<T>(this T[] source, int chunkSize)
{
var chunks = from index in Enumerable.Range(0, source.Length)
group source[index] by index / chunkSize;
return chunks.Select(e => e.ToArray()).ToArray();
}

C# Finding relevant document snippets for search result display

In developing search for a site I am building, I decided to go the cheap and quick way and use Microsoft Sql Server's Full Text Search engine instead of something more robust like Lucene.Net.
One of the features I would like to have, though, is google-esque relevant document snippets. I quickly found determining "relevant" snippets is more difficult than I realized.
I want to choose snippets based on search term density in the found text. So, essentially, I need to find the most search term dense passage in the text. Where a passage is some arbitrary number of characters (say 200 -- but it really doesn't matter).
My first thought is to use .IndexOf() in a loop and build an array of term distances (subtract the index of the found term from the previously found term), then ... what? Add up any two, any three, any four, any five, sequential array elements and use the one with the smallest sum (hence, the smallest distance between search terms).
That seems messy.
Is there an established, better, or more obvious way to do this than what I have come up with?
Although it is implemented in Java, you can see one approach for that problem here:
http://rcrezende.blogspot.com/2010/08/smallest-relevant-text-snippet-for.html
I know this thread is way old, but I gave this a try last week and it was a pain in the back side. This is far from perfect, but this is what I came up with.
The snippet generator:
public static string SelectKeywordSnippets(string StringToSnip, string[] Keywords, int SnippetLength)
{
string snippedString = "";
List<int> keywordLocations = new List<int>();
//Get the locations of all keywords
for (int i = 0; i < Keywords.Count(); i++)
keywordLocations.AddRange(SharedTools.IndexOfAll(StringToSnip, Keywords[i], StringComparison.CurrentCultureIgnoreCase));
//Sort locations
keywordLocations.Sort();
//Remove locations which are closer to each other than the SnippetLength
if (keywordLocations.Count > 1)
{
bool found = true;
while (found)
{
found = false;
for (int i = keywordLocations.Count - 1; i > 0; i--)
if (keywordLocations[i] - keywordLocations[i - 1] < SnippetLength / 2)
{
keywordLocations[i - 1] = (keywordLocations[i] + keywordLocations[i - 1]) / 2;
keywordLocations.RemoveAt(i);
found = true;
}
}
}
//Make the snippets
if (keywordLocations.Count > 0 && keywordLocations[0] - SnippetLength / 2 > 0)
snippedString = "... ";
foreach (int i in keywordLocations)
{
int stringStart = Math.Max(0, i - SnippetLength / 2);
int stringEnd = Math.Min(i + SnippetLength / 2, StringToSnip.Length);
int stringLength = Math.Min(stringEnd - stringStart, StringToSnip.Length - stringStart);
snippedString += StringToSnip.Substring(stringStart, stringLength);
if (stringEnd < StringToSnip.Length) snippedString += " ... ";
if (snippedString.Length > 200) break;
}
return snippedString;
}
The function which will find the index of all keywords in the sample text
private static List<int> IndexOfAll(string haystack, string needle, StringComparison Comparison)
{
int pos;
int offset = 0;
int length = needle.Length;
List<int> positions = new List<int>();
while ((pos = haystack.IndexOf(needle, offset, Comparison)) != -1)
{
positions.Add(pos);
offset = pos + length;
}
return positions;
}
It's a bit clumsy in its execution. The way it works is by finding the position of all keywords in the string. Then checking that no keywords are closer to each other than the desired snippet length, so that snippets won't overlap (that's where it's a bit iffy...). And then grabs substrings of the desired length centered around the position of the keywords and stitches the whole thing together.
I know this is years late, but posting just in case it might help somebody coming across this question.
public class Highlighter
{
private class Packet
{
public string Sentence;
public double Density;
public int Offset;
}
public static string FindSnippet(string text, string query, int maxLength)
{
if (maxLength < 0)
{
throw new ArgumentException("maxLength");
}
var words = query.Split(' ').Where(w => !string.IsNullOrWhiteSpace(w)).Select(word => word.ToLower()).ToLookup(s => s);
var sentences = text.Split('.');
var i = 0;
var packets = sentences.Select(sentence => new Packet
{
Sentence = sentence,
Density = ComputeDensity(words, sentence),
Offset = i++
}).OrderByDescending(packet => packet.Density);
var list = new SortedList<int, string>();
int length = 0;
foreach (var packet in packets)
{
if (length >= maxLength || packet.Density == 0)
{
break;
}
string sentence = packet.Sentence;
list.Add(packet.Offset, sentence.Substring(0, Math.Min(sentence.Length, maxLength - length)));
length += packet.Sentence.Length;
}
var sb = new List<string>();
int previous = -1;
foreach (var item in list)
{
var offset = item.Key;
var sentence = item.Value;
if (previous != -1 && offset - previous != 1)
{
sb.Add(".");
}
previous = offset;
sb.Add(Highlight(sentence, words));
}
return String.Join(".", sb);
}
private static string Highlight(string sentence, ILookup<string, string> words)
{
var sb = new List<string>();
var ff = true;
foreach (var word in sentence.Split(' '))
{
var token = word.ToLower();
if (ff && words.Contains(token))
{
sb.Add("[[HIGHLIGHT]]");
ff = !ff;
}
if (!ff && !string.IsNullOrWhiteSpace(token) && !words.Contains(token))
{
sb.Add("[[ENDHIGHLIGHT]]");
ff = !ff;
}
sb.Add(word);
}
if (!ff)
{
sb.Add("[[ENDHIGHLIGHT]]");
}
return String.Join(" ", sb);
}
private static double ComputeDensity(ILookup<string, string> words, string sentence)
{
if (string.IsNullOrEmpty(sentence) || words.Count == 0)
{
return 0;
}
int numerator = 0;
int denominator = 0;
foreach(var word in sentence.Split(' ').Select(w => w.ToLower()))
{
if (words.Contains(word))
{
numerator++;
}
denominator++;
}
if (denominator != 0)
{
return (double)numerator / denominator;
}
else
{
return 0;
}
}
}
Example:
highlight "Optic flow is defined as the change of structured light in the image, e.g. on the retina or the camera’s sensor, due to a relative motion between the eyeball or camera and the scene. Further definitions from the literature highlight different properties of optic flow" "optic flow"
Output:
[[HIGHLIGHT]] Optic flow [[ENDHIGHLIGHT]] is defined as the change of structured
light in the image, e... Further definitions from the literature highlight diff
erent properties of [[HIGHLIGHT]] optic flow [[ENDHIGHLIGHT]]
Well, here's the hacked together version I made using the algorithm I described above. I don't think it is all that great. It uses three (count em, three!) loops an array and two lists. But, well, it is better than nothing. I also hardcoded the maximum length instead of turning it into a parameter.
private static string FindRelevantSnippets(string infoText, string[] searchTerms)
{
List<int> termLocations = new List<int>();
foreach (string term in searchTerms)
{
int termStart = infoText.IndexOf(term);
while (termStart > 0)
{
termLocations.Add(termStart);
termStart = infoText.IndexOf(term, termStart + 1);
}
}
if (termLocations.Count == 0)
{
if (infoText.Length > 250)
return infoText.Substring(0, 250);
else
return infoText;
}
termLocations.Sort();
List<int> termDistances = new List<int>();
for (int i = 0; i < termLocations.Count; i++)
{
if (i == 0)
{
termDistances.Add(0);
continue;
}
termDistances.Add(termLocations[i] - termLocations[i - 1]);
}
int smallestSum = int.MaxValue;
int smallestSumIndex = 0;
for (int i = 0; i < termDistances.Count; i++)
{
int sum = termDistances.Skip(i).Take(5).Sum();
if (sum < smallestSum)
{
smallestSum = sum;
smallestSumIndex = i;
}
}
int start = Math.Max(termLocations[smallestSumIndex] - 128, 0);
int len = Math.Min(smallestSum, infoText.Length - start);
len = Math.Min(len, 250);
return infoText.Substring(start, len);
}
Some improvements I could think of would be to return multiple "snippets" with a shorter length that add up to the longer length -- this way multiple parts of the document can be sampled.
This is a nice problem :)
I think I'd create an index vector: For each word, create an entry 1 if search term or otherwise 0. Then find the i such that sum(indexvector[i:i+maxlength]) is maximized.
This can actually be done rather efficiently. Start with the number of searchterms in the first maxlength words. then, as you move on, decrease your counter if indexvector[i]=1 (i.e. your about to lose that search term as you increase i) and increase it if indexvector[i+maxlength+1]=1. As you go, keep track of the i with the highest counter value.
Once you got your favourite i, you can still do finetuning like see if you can reduce the actual size without compromising your counter, e.g. in order to find sentence boundaries or whatever. Or like picking the right i of a number of is with equivalent counter values.
Not sure if this is a better approach than yours - it's a different one.
You might also want to check out this paper on the topic, which comes with yet-another baseline: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.4357&rep=rep1&type=pdf
I took another approach, perhaps it will help someone...
First it searches if it word appears in my case with IgnoreCase (you change this of course yourself).
Then I create a list of Regex matches on each separators and search for the first occurrence of the word (allowing partial case insensitive matches).
From that index, I get the 10 matches in front and behind the word, which makes the snippet.
public static string GetSnippet(string text, string word)
{
if (text.IndexOf(word, StringComparison.InvariantCultureIgnoreCase) == -1)
{
return "";
}
var matches = new Regex(#"\b(\S+)\s?", RegexOptions.Singleline | RegexOptions.Compiled).Matches(text);
var p = -1;
for (var i = 0; i < matches.Count; i++)
{
if (matches[i].Value.IndexOf(word, StringComparison.InvariantCultureIgnoreCase) != -1)
{
p = i;
break;
}
}
if (p == -1) return "";
var snippet = "";
for (var x = Math.Max(p - 10, 0); x < p + 10; x++)
{
snippet += matches[x].Value + " ";
}
return snippet;
}
If you use CONTAINSTABLE you will get a RANK back , this is in essence a density value - higher the RANK value, the higher the density. This way, you just run a query to get the results you want and dont have to result to massaging the data when its returned.
Wrote a function to do this just now. You want to pass in:
Inputs:
Document text
This is the full text of the document you're taking a snippet from. Most likely you will want to strip out any BBCode/HTML from this document.
Original query
The string the user entered as their search
Snippet length
Length of the snippet you wish to display.
Return Value:
Start index of the document text to take the snippet from. To get the snippet simply do documentText.Substring(returnValue, snippetLength). This has the advantage that you know if the snippet is take from the start/end/middle so you can add some decoration like ... if you wish at the snippet start/end.
Performance
A resolution set to 1 will find the best snippet but moves the window along 1 char at a time. Set this value higher to speed up execution.
Tweaks
You can work out score however you want. In this example I've done Math.pow(wordLength, 2) to favour longer words.
private static int GetSnippetStartPoint(string documentText, string originalQuery, int snippetLength)
{
// Normalise document text
documentText = documentText.Trim();
if (string.IsNullOrWhiteSpace(documentText)) return 0;
// Return 0 if entire doc fits in snippet
if (documentText.Length <= snippetLength) return 0;
// Break query down into words
var wordsInQuery = new HashSet<string>();
{
var queryWords = originalQuery.Split(' ');
foreach (var word in queryWords)
{
var normalisedWord = word.Trim().ToLower();
if (string.IsNullOrWhiteSpace(normalisedWord)) continue;
if (wordsInQuery.Contains(normalisedWord)) continue;
wordsInQuery.Add(normalisedWord);
}
}
// Create moving window to get maximum trues
var windowStart = 0;
double maxScore = 0;
var maxWindowStart = 0;
// Higher number less accurate but faster
const int resolution = 5;
while (true)
{
var text = documentText.Substring(windowStart, snippetLength);
// Get score of this chunk
// This isn't perfect, as window moves in steps of resolution first and last words will be partial.
// Could probably be improved to iterate words and not characters.
var words = text.Split(' ').Select(c => c.Trim().ToLower());
double score = 0;
foreach (var word in words)
{
if (wordsInQuery.Contains(word))
{
// The longer the word, the more important.
// Can simply replace with score += 1 for simpler model.
score += Math.Pow(word.Length, 2);
}
}
if (score > maxScore)
{
maxScore = score;
maxWindowStart = windowStart;
}
// Setup next iteration
windowStart += resolution;
// Window end passed document end
if (windowStart + snippetLength >= documentText.Length)
{
break;
}
}
return maxWindowStart;
}
Lots more you can add to this, for example instead of comparing exact words perhaps you might want to try comparing the SOUNDEX where you weight soundex matches less than exact matches.

Categories