shortest substring of string in C# - c#

I try to write I program that given a string comprised of lowercase lettes in the range ascii[a-z] and determine the length of the smallest substring that contains all of the letters present on the string.
but I got Terminated due to timeout.
How can I improve the sulotion?
I tried:
public static int shortestSubstring(string s){
int n = s.Length;
int max_distinct = max_distinct_char(s, n);
int minl = n;
for (int i = 0; i < n; i++)
{
for (int j = 0; j < n; j++)
{
String subs = null;
if (i < j)
subs = s.Substring(i, s.Length - j);
else
subs = s.Substring(j, s.Length - i);
int subs_lenght = subs.Length;
int sub_distinct_char = max_distinct_char(subs, subs_lenght);
if (subs_lenght < minl && max_distinct == sub_distinct_char)
{
minl = subs_lenght;
}
}
}
return minl;
}
private static int max_distinct_char(String s, int n)
{
int[] count = new int[NO_OF_CHARS];
for (int i = 0; i < n; i++)
count[s[i]]++;
int max_distinct = 0;
for (int i = 0; i < NO_OF_CHARS; i++)
{
if (count[i] != 0)
max_distinct++;
}
return max_distinct;
}
}

I believe there is an O(n) solution to this problem as follows:
We first traverse the string to find out how many distinct characters are in it. After this, we initialize two pointers denoting the left and right index of the substring to 0. We also keep an array counting the number of each character currently present in the substring. If not all characters are contained, we increase the right pointer in order to get another character. If all characters are contained, we increase the left pointer in order to possibly get a smaller substring. Since either the left or right pointer increase at each step, this algorithm should run in O(n) time.
For inspiration for this algorithm, see Kadane's algorithm for the maximum subarray problem.
Unfortunately, I do not know C#. However, I have written a Java solution (which hopefully has similar syntax). I haven't stress tested this rigorously so it's possible I missed an edge case.
import java.io.*;
public class allChars {
public static void main (String[] args) throws IOException {
BufferedReader br = new BufferedReader (new InputStreamReader(System.in));
String s = br.readLine();
System.out.println(shortestSubstring(s));
}
public static int shortestSubstring(String s) {
//If length of string is 0, answer is 0
if (s.length() == 0) {
return 0;
}
int[] charCounts = new int[26];
//Find number of distinct characters in string
int count = 0;
for (int i = 0; i < s.length(); i ++) {
char c = s.charAt(i);
//If new character (current count of it is 0)
if (charCounts[c - 97] == 0) {
//Increase count of distinct characters
count ++;
//Increase count of this character to 1
//Can put inside if statement because don't care if count is greater than 1 here
//Only care if character is present
charCounts[c - 97]++;
}
}
int shortestLen = Integer.MAX_VALUE;
charCounts = new int[26];
//Initialize left and right pointers to 0
int left = 0;
int right = 0;
//Substring already contains first character of string
int curCount = 1;
charCounts[s.charAt(0)-97] ++;
while (Math.max(left,right) < s.length()) {
//If all distinct characters present
if (curCount == count) {
//Update shortest length
shortestLen = Math.min(right - left + 1, shortestLen);
//Decrease character count of left character
charCounts[s.charAt(left) - 97] --;
//If new count of left character is 0
if (charCounts[s.charAt(left) - 97] == 0) {
//Decrease count of distinct characters
curCount --;
}
//Increment left pointer to create smaller substring
left ++;
}
//If not all characters present
else {
//Increment right pointer to get another character
right ++;
//If character is new (old count was 0)
if (right < s.length() && charCounts[s.charAt(right) - 97]++ == 0) {
//Increment distinct character count
curCount ++;
}
}
}
return shortestLen;
}
}

I hope I understand it right, here's the code to get the smallest string.
string str = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec dictum elementum condimentum. Aliquam commodo ipsum enim. Vivamus tincidunt feugiat urna.";
char[] operators = { ' ', ',', '.', ':', '!', '?', ';' };
string[] vs = str.Split(operators);
string shortestWord = vs[0];
for (int i = 0; i < vs.Length; i++)
{
if (vs[i].Length < shortestWord.Length && vs[i] != "" && vs[i] != " ")
{
shortestWord = vs[i];
}
}
Console.WriteLine(shortestWord);

This seems to be an O(n^2) problem. This is not ideal; however, we can do several things to avoid testing sub-strings that cannot be valid candidates.
I suggest to return the sub-string itself, instead of its length. This helps to validate the result.
public static string ShortestSubstring(string input)
We begin by counting the occurrence of each character in the range ['a' .. 'z']. We can subtract 'a' from a character to get its zero-based index.
var charCount = new int[26];
foreach (char c in input) {
charCount[c - 'a']++;
}
The shortest possible sub-string is equal to the count of distinct characters in the input.
int totalDistinctCharCount = charCount.Where(c => c > 0).Count();
To count the number of distinct characters in the sub-string, we need the following Boolean array:
var hasCharOccurred = new bool[26];
Now, let's test sub-strings starting at different positions. The maximum start position must allow sub-strings at least as long as the totalDistinctCharCount (the shortest possible sub-string).
string shortest = input;
for (int start = 0; start <= input.Length - totalDistinctCharCount; start++) {
...
}
return shortest;
Inside this loop we have another loop counting the distinct characters of the sub-string. Note that we work directly on the input string to avoid creating a lot of new strings. We only need to test sub-strings that are shorter than any shortest one found before. Therefore the inner loop uses Math.Min(input.Length, start + shortest.Length - 1) as limit. Content of loop (in place of the ... in the upper code snippet):
int distinctCharCount = 0;
// No need to go past the length the previously found shortest.
for (int i = start; i < Math.Min(input.Length, start + shortest.Length - 1); i++) {
int chIndex = input[i] - 'a';
if (!hasCharOccurred[chIndex]) {
hasCharOccurred[chIndex] = true;
distinctCharCount++;
if (distinctCharCount == totalDistinctCharCount) {
shortest = input.Substring(start, i - start + 1);
break; // Found a shorter one, exit this inner loop.
}
}
}
// We cannot omit characters occurring only once
if (charCount[input[start] - 'a'] == 1) {
break; // Start cannot go beyond this point.
}
// Clear hasCharOccurred, to avoid creating a new array evey time.
for (int i = 0; i < 26; i++) {
hasCharOccurred[i] = false;
}
A further optimization is that we stop as soon as we encounter a character at the start position occurring only once in the input string (charCount[input[start] - 'a'] == 1). Since every distinct character of the input must be present in the sub-string, this one must be part of the sub-string.
We can print the result in the console with
string shortest = ShortestSubstring(TestString);
Console.WriteLine($"Shortest, Length = {shortest.Length}, \"{shortest}\"");

Related

Obtain the lexicographically smallest string possible, by using at most P points

⦁ Replace and/or re-arrange characters of this given string to get the lexicographically smallest string possible. For this, you can perform the following two operations any number of times.
⦁ Swap any two characters in the string. This operation costs 1 point. (any two, need not be adjacent)
⦁ Replace a character in the string with any other lower case English letter. This operation costs 2 points.
Obtain the lexicographically smallest string possible, by using at most P points.
Input:
⦁ Two lines of input, first-line containing two integers N, P.
⦁ The second line contains a string S consisting of N characters.
Output:
Lexicographically smallest string obtained.
for e.g
Sample Input:
3 3
bba
Sample Output:
aab
Ive tried this but it doesnt contain P points i dont know how to do that can you guys please help me with that:
namespace Lexicographical
{
class GFG
{
// Function to return the lexicographically
// smallest String that can be formed by
// swapping at most one character.
// The characters might not necessarily
// be adjacent.
static String findSmallest(char[] s)
{
int len = s.Length;
// Store last occurrence of every character
int[] loccur = new int[26];
// Set -1 as default for every character.
for (int i = 0; i < 26; i++)
loccur[i] = -1;
for (int i = len - 1; i >= 0; --i)
{
// char index to fill
// in the last occurrence array
int chI = s[i] - 'a';
if (loccur[chI] == -1)
{
// If this is true then this
// character is being visited
// for the first time from the last
// Thus last occurrence of this
// character is stored in this index
loccur[chI] = i;
}
}
char[] sorted_s = s;
Array.Sort(sorted_s);
for (int i = 0; i < len; ++i)
{
if (s[i] != sorted_s[i])
{
// char to replace
int chI = sorted_s[i] - 'a';
// Find the last occurrence
// of this character.
int last_occ = loccur[chI];
// Swap this with the last occurrence
char temp = s[last_occ];
s[last_occ] = s[i];
s[i] = temp;
break;
}
}
return String.Join("", s);
}
// Driver code
public static void Main(String[] args)
{
String s = "abb";
Console.Write(findSmallest(s.ToCharArray()));
}
}
}
The output for this is abb but it should be aab...
I want to know how i can use above Question in this

Skipping a range of values in for loop C#

I'm trying to cycle through chars in a string.
string cycleMe = "Hi StackOverflow! Here is my string."
However, I want to skip over certain ranges of indexes. The ranges I want to skip over are stored in a List of objects, delims.
List<Delim> delims = delimCreator();
To retrieve each starting index and ending index for a range, I have to write a loop that accesses each "delim":
delims[0].getFirstIndex() //results in, say, index 2
delims[0].getLastIndex() //results in, say, index 4
delims[1].getFirstIndex() //results in, say, index 5
delims[1].getLastIndex() //results in, say, index 7
(there can be infinitely many "delim" objects in play)
If the above were my list, I'd want to print the string cycleMe, but skip all the chars between 2 and 4 (inclusive) and 5 and 7 (inclusive).
Expected output using the numbers above:
HiOverflow! Here is my string.
Here is the code I have written so far. It loops far more often than I'd expect (it loops ~x2 the number of characters in the string). Thanks in advance! =)
List<Delim> delims = delimAggregateInator(displayTextRaw);
for (int x = 0; x < cycleMe.Length;x++){
for (int i = 0; i < delims.Count; i++){
if (!(x >= delims[i].getFirstIndex() && x <= delims[i].getLastIndex())){
Debug.Log("test");
}
}
I assume that by skipping you meant you want to omit those characters from the original string. If that is the case, you can try Aggregate extension method like below.
string result = delims.Aggregate<Delim, string>(cycleMe, (str, d) => cycleMe = cycleMe.Remove(d.FirstIndex, (d.LastIndex - d.FirstIndex) + 1));
Make sure that the delim list is in the proper order.
Solution might be converting the string to char array, replacing the desired parts to spaces, and converting the output back to string.
Here is the modified version of your code:
string cycleMe = "Hi StackOverflow! Here is my string."
var charArray = cycleMe.ToCharArray(); // Converting to char array
List<Delim> delims = delimAggregateInator(displayTextRaw);
for (int x = 0; x < cycleMe.Length;x++){
for (int i = 0; i < delims.Count; i++){
// ORIGINAL: if (!(x >= delims[i].getFirstIndex() && x <= delims[i].getLastIndex())){
if (x >= delims[i].getFirstIndex() && x <= delims[i].getLastIndex()){
Debug.Log("test");
charArray[x] = ' '; // Replacing the item with space
}
}
string output = new string(charArray); // Converting back to string
P.S. This is probably not the most optimal solution but at least it should work.
You should use LINQ for that
struct Delim
{
public int First { get; set; }
public int Last { get; set; }
}
static void Main(string[] args)
{
string cycleMe = "Hi StackOverflow! Here is my string.";
var delimns = new List<Delim> { new Delim { First=2, Last=4}, new Delim { First = 5, Last = 7 } };
var cut = cycleMe.Where((c, i) =>
!delimns.Any(d => i >= d.First && i <= d.Last));
Console.WriteLine(new string(cut.ToArray());
}
That means I am basically only selecting letters, at positions which are not part of any cutting range.
Also: Fix your naming. A delimiter is a character, not a position (numeric)

Reassigning a value to a character array not working

I am currently having issues reassigning a value to a character array. Below is my code (unfinished solution to find the next smallest palindrome):
public int nextSmallestPalindrome(int number)
{
string numberString = number.ToString();
// Case 1: Palindrome is all 9s
for (int i = 0; i < numberString.Length; i++)
{
if (numberString[i] != '9')
{
break;
}
int result = number + 2;
return result;
}
// Case 2: Is a palindrome
int high = numberString.Length - 1;
int low = 0;
bool isPalindrome = true;
for (low = 0; low <= high; low++, high--)
{
if (numberString[low] != numberString[high])
{
isPalindrome = false;
break;
}
}
char[] array = numberString.ToCharArray();
if (isPalindrome == true)
{
// While the middle character is 9
while (numberString[high] == '9' || numberString[low] == '9')
{
array[high] = '0';
array[low] = '0';
high++;
low--;
}
int replacedvalue1 = (int)Char.GetNumericValue(numberString[high]) + 1;
int replacedvalue2 = (int)Char.GetNumericValue(numberString[low]) + 1;
StringBuilder result = new StringBuilder(new string(array));
if (high == low)
{
result[high] = (char)replacedvalue1;
}
else
{
Console.WriteLine(result.ToString());
result[high] = (char)replacedvalue1;
Console.WriteLine(result.ToString());
result[low] = (char)replacedvalue2;
}
return Int32.Parse(result.ToString());
}
else return -1;
}
Main class runs:
Console.WriteLine(nextSmallestPalindrome(1001));
This returns 1001, then 101 and then gives a formatexception at the return Int32.Parse(result.ToString()); statement.
I am very confused, as I believe "result" should be 1101 after I assign result[high] = (char)replacedvalue1;. Printing replacedvalue1 gives me "1" as expected. However, debugging it line by line shows that "1001" turns into "1 1" at the end, signifying strange characters.
What could be going wrong?
Thanks
Characters and numbers aren't the same thing. I find it easiest to keep an ASCII chart open when doing this sort of thing.
If you look at one of those charts, you'll see that the character 0 actually has a decimal value of 48.
char c = (char)48; // Equals the character '0'
The reverse is also true:
char c = '0';
int i = (int)c; // Equals the number 48
You managed to keep chars and ints separate for the most part, but at the end you got them mixed up:
// Char.GetNumericValue('0') will return the number 0
// so now replacedvalue1 will equal 1
int replacedvalue1 = (int)Char.GetNumericValue(numberString[high]) + 1;
// You are casting the number 1 to a character, which according to the
// ASCII chart is the (unprintable) character SOH (start of heading)
result[high] = (char)replacedvalue1;
FYI you don't actually need to cast a char back-and-forth in order to perform operations on it. char c = 'a'; c++; is valid, and will equal the next character on the table ('b'). Similarly you can increment numeric characters:
char c = '0'; c++; // c now equals '1'
Edit: The easiest way to turn an integer 1 into the character '1' is to "add" the integer to the character '0':
result[high] = (char)('0' + replacedvalue1);
Of course there are much easier ways to accomplish what you are trying to do, but these techniques (converting and adding chars and ints) are good tools to know.
You do not have write that much code to do it.
Here is your IsPalindrome method;
private static bool IsPalindrome(int n)
{
string ns = n.ToString(CultureInfo.InvariantCulture);
var reversed = string.Join("", ns.Reverse());
return (ns == reversed);
}
private static int FindTheNextSmallestPalindrome(int x)
{
for (int i = x; i < 2147483647; i++)
{
if (IsPalindrome(i))
{
return i;
}
}
throw new Exception("Number must be less than 2147483647");
}
This is how you call it. You do not need an array to call it. You can just enter any number which is less than 2147483647(max value of int) and get the next palindrome value.
var mynumbers = new[] {10, 101, 120, 110, 1001};
foreach (var mynumber in mynumbers)
{
Console.WriteLine(FindTheNextPalindrome(mynumber));
}

String Combinations With Character Replacement

I am trying to work through a scenario I haven't seen before and am struggling to come up with an algorithm to implement this properly. Part of my problem is a hazy recollection of the proper terminology. I believe what I am needing is a variation of the standard "combination" problem, but I could well be off there.
The Scenario
Given an example string "100" (let's call it x), produce all combinations of x that swap out one of those 0 (zero) characters for a o (lower-case o). So, for the simple example of "100", I would expect this output:
"100"
"10o"
"1o0"
"1oo"
This would need to support varying length strings with varying numbers of 0 characters, but assume there would never be more than 5 instances of 0.
I have this very simple algorithm that works for my sample of "100" but falls apart for anything longer/more complicated:
public IEnumerable<string> Combinations(string input)
{
char[] buffer = new char[input.Length];
for(int i = 0; i != buffer.Length; ++i)
{
buffer[i] = input[i];
}
//return the original input
yield return new string(buffer);
//look for 0's and replace them
for(int i = 0; i != buffer.Length; ++i)
{
if (input[i] == '0')
{
buffer[i] = 'o';
yield return new string(buffer);
buffer[i] = '0';
}
}
//handle the replace-all scenario
yield return input.Replace("0", "o");
}
I have a nagging feeling that recursion could be my friend here, but I am struggling to figure out how to incorporate the conditional logic I need here.
Your guess was correct; recursion is your friend for this challenge. Here is a simple solution:
public static IEnumerable<string> Combinations(string input)
{
int firstZero = input.IndexOf('0'); // Get index of first '0'
if (firstZero == -1) // Base case: no further combinations
return new string[] { input };
string prefix = input.Substring(0, firstZero); // Substring preceding '0'
string suffix = input.Substring(firstZero + 1); // Substring succeeding '0'
// e.g. Suppose input was "fr0d00"
// Prefix is "fr"; suffix is "d00"
// Recursion: Generate all combinations of suffix
// e.g. "d00", "d0o", "do0", "doo"
var recursiveCombinations = Combinations(suffix);
// Return sequence in which each string is a concatenation of the
// prefix, either '0' or 'o', and one of the recursively-found suffixes
return
from chr in "0o" // char sequence equivalent to: new [] { '0', 'o' }
from recSuffix in recursiveCombinations
select prefix + chr + recSuffix;
}
This works for me:
public IEnumerable<string> Combinations(string input)
{
var head = input[0] == '0' //Do I have a `0`?
? new [] { "0", "o" } //If so output both `"0"` & `"o"`
: new [] { input[0].ToString() }; //Otherwise output the current character
var tails = input.Length > 1 //Is there any more string?
? Combinations(input.Substring(1)) //Yes, recursively compute
: new[] { "" }; //Otherwise, output empty string
//Now, join it up and return
return
from h in head
from t in tails
select h + t;
}
You don't need recursion here, you can enumerate your patterns and treat them as binary numbers. For example, if you have three zeros in your string, you get:
0 000 ....0..0....0...
1 001 ....0..0....o...
2 010 ....0..o....0...
3 011 ....0..o....o...
4 100 ....o..0....0...
5 101 ....o..0....o...
6 110 ....o..o....0...
7 111 ....o..o....o...
You can implement that with bitwise operators or by treating the chars that you want to replace like an odometer.
Below is an implementation in C. I'm not familiar with C# and from the other answers I see that C# already has suitable standard classes to implement what you want. (Although I'm surprised that so many peolpe propose recursion here.)
So this is more an explanation or illustration of my comment to the question than an implementation advice for your problem.
int binrep(char str[])
{
int zero[40]; // indices of zeros
int nzero = 0; // number of zeros in string
int ncombo = 1; // number of result strings
int i, j;
for (i = 0; str[i]; i++) {
if (str[i] == '0') {
zero[nzero++] = i;
ncombo <<= 1;
}
}
for (i = 0; i < ncombo; i++) {
for (j = 0; j < nzero; j++) {
str[zero[j]] = ((i >> j) & 1) ? 'o' : '0';
}
printf("%s\n", str); // should yield here
}
return ncombo;
}
Here's a solution using recursion, and your buffer array:
private static void Main(string[] args)
{
var a = Combinations("100");
var b = Combinations("10000");
}
public static IEnumerable<string> Combinations(string input)
{
var combinations = new List<string>();
combinations.Add(input);
for (int i = 0; i < input.Length; i++)
{
char[] buffer= input.ToArray();
if (buffer[i] == '0')
{
buffer[i] = 'o';
combinations.Add(new string(buffer));
combinations = combinations.Concat(Combinations(new string(buffer))).ToList();
}
}
return combinations.Distinct();
}
The method adds the raw input as the first result. After that, we replace in a loop the 0s we see as a o and call our method back with that new input, which will cover the case of multiple 0s.
Finally, we end up with a couple duplicates, so we use Distinct.
I know that the earlier answers are better. But I don't want my code to go to waste. :)
My approach for this combinatorics problem would be to take advantage of how binary numbers work. My algorithm would be as follows:
List<string> ZeroCombiner(string str)
{
// Get number of zeros.
var n = str.Count(c => c == '0');
var limit = (int)Math.Pow(2, n);
// Create strings of '0' and 'o' based on binary numbers from 0 to 2^n.
var binaryStrings = new List<string>();
for (int i = 0; i < limit; ++i )
{
binaryStrings.Add(Binary(i, n + 1));
}
// Replace each zero with respect to each binary string.
var result = new List<string>();
foreach (var binaryString in binaryStrings)
{
var zeroCounter = 0;
var combinedString = string.Empty;
for (int i = 0; i < str.Length; ++i )
{
if (str[i] == '0')
{
combinedString += binaryString[zeroCounter];
++zeroCounter;
}
else
combinedString += str[i];
}
result.Add(combinedString);
}
return result;
}
string Binary(int i, int n)
{
string result = string.Empty;
while (n != 0)
{
result = result + (i % 2 == 0 ? '0' : 'o');
i = i / 2;
--n;
}
return result;
}

C# Finding relevant document snippets for search result display

In developing search for a site I am building, I decided to go the cheap and quick way and use Microsoft Sql Server's Full Text Search engine instead of something more robust like Lucene.Net.
One of the features I would like to have, though, is google-esque relevant document snippets. I quickly found determining "relevant" snippets is more difficult than I realized.
I want to choose snippets based on search term density in the found text. So, essentially, I need to find the most search term dense passage in the text. Where a passage is some arbitrary number of characters (say 200 -- but it really doesn't matter).
My first thought is to use .IndexOf() in a loop and build an array of term distances (subtract the index of the found term from the previously found term), then ... what? Add up any two, any three, any four, any five, sequential array elements and use the one with the smallest sum (hence, the smallest distance between search terms).
That seems messy.
Is there an established, better, or more obvious way to do this than what I have come up with?
Although it is implemented in Java, you can see one approach for that problem here:
http://rcrezende.blogspot.com/2010/08/smallest-relevant-text-snippet-for.html
I know this thread is way old, but I gave this a try last week and it was a pain in the back side. This is far from perfect, but this is what I came up with.
The snippet generator:
public static string SelectKeywordSnippets(string StringToSnip, string[] Keywords, int SnippetLength)
{
string snippedString = "";
List<int> keywordLocations = new List<int>();
//Get the locations of all keywords
for (int i = 0; i < Keywords.Count(); i++)
keywordLocations.AddRange(SharedTools.IndexOfAll(StringToSnip, Keywords[i], StringComparison.CurrentCultureIgnoreCase));
//Sort locations
keywordLocations.Sort();
//Remove locations which are closer to each other than the SnippetLength
if (keywordLocations.Count > 1)
{
bool found = true;
while (found)
{
found = false;
for (int i = keywordLocations.Count - 1; i > 0; i--)
if (keywordLocations[i] - keywordLocations[i - 1] < SnippetLength / 2)
{
keywordLocations[i - 1] = (keywordLocations[i] + keywordLocations[i - 1]) / 2;
keywordLocations.RemoveAt(i);
found = true;
}
}
}
//Make the snippets
if (keywordLocations.Count > 0 && keywordLocations[0] - SnippetLength / 2 > 0)
snippedString = "... ";
foreach (int i in keywordLocations)
{
int stringStart = Math.Max(0, i - SnippetLength / 2);
int stringEnd = Math.Min(i + SnippetLength / 2, StringToSnip.Length);
int stringLength = Math.Min(stringEnd - stringStart, StringToSnip.Length - stringStart);
snippedString += StringToSnip.Substring(stringStart, stringLength);
if (stringEnd < StringToSnip.Length) snippedString += " ... ";
if (snippedString.Length > 200) break;
}
return snippedString;
}
The function which will find the index of all keywords in the sample text
private static List<int> IndexOfAll(string haystack, string needle, StringComparison Comparison)
{
int pos;
int offset = 0;
int length = needle.Length;
List<int> positions = new List<int>();
while ((pos = haystack.IndexOf(needle, offset, Comparison)) != -1)
{
positions.Add(pos);
offset = pos + length;
}
return positions;
}
It's a bit clumsy in its execution. The way it works is by finding the position of all keywords in the string. Then checking that no keywords are closer to each other than the desired snippet length, so that snippets won't overlap (that's where it's a bit iffy...). And then grabs substrings of the desired length centered around the position of the keywords and stitches the whole thing together.
I know this is years late, but posting just in case it might help somebody coming across this question.
public class Highlighter
{
private class Packet
{
public string Sentence;
public double Density;
public int Offset;
}
public static string FindSnippet(string text, string query, int maxLength)
{
if (maxLength < 0)
{
throw new ArgumentException("maxLength");
}
var words = query.Split(' ').Where(w => !string.IsNullOrWhiteSpace(w)).Select(word => word.ToLower()).ToLookup(s => s);
var sentences = text.Split('.');
var i = 0;
var packets = sentences.Select(sentence => new Packet
{
Sentence = sentence,
Density = ComputeDensity(words, sentence),
Offset = i++
}).OrderByDescending(packet => packet.Density);
var list = new SortedList<int, string>();
int length = 0;
foreach (var packet in packets)
{
if (length >= maxLength || packet.Density == 0)
{
break;
}
string sentence = packet.Sentence;
list.Add(packet.Offset, sentence.Substring(0, Math.Min(sentence.Length, maxLength - length)));
length += packet.Sentence.Length;
}
var sb = new List<string>();
int previous = -1;
foreach (var item in list)
{
var offset = item.Key;
var sentence = item.Value;
if (previous != -1 && offset - previous != 1)
{
sb.Add(".");
}
previous = offset;
sb.Add(Highlight(sentence, words));
}
return String.Join(".", sb);
}
private static string Highlight(string sentence, ILookup<string, string> words)
{
var sb = new List<string>();
var ff = true;
foreach (var word in sentence.Split(' '))
{
var token = word.ToLower();
if (ff && words.Contains(token))
{
sb.Add("[[HIGHLIGHT]]");
ff = !ff;
}
if (!ff && !string.IsNullOrWhiteSpace(token) && !words.Contains(token))
{
sb.Add("[[ENDHIGHLIGHT]]");
ff = !ff;
}
sb.Add(word);
}
if (!ff)
{
sb.Add("[[ENDHIGHLIGHT]]");
}
return String.Join(" ", sb);
}
private static double ComputeDensity(ILookup<string, string> words, string sentence)
{
if (string.IsNullOrEmpty(sentence) || words.Count == 0)
{
return 0;
}
int numerator = 0;
int denominator = 0;
foreach(var word in sentence.Split(' ').Select(w => w.ToLower()))
{
if (words.Contains(word))
{
numerator++;
}
denominator++;
}
if (denominator != 0)
{
return (double)numerator / denominator;
}
else
{
return 0;
}
}
}
Example:
highlight "Optic flow is defined as the change of structured light in the image, e.g. on the retina or the camera’s sensor, due to a relative motion between the eyeball or camera and the scene. Further definitions from the literature highlight different properties of optic flow" "optic flow"
Output:
[[HIGHLIGHT]] Optic flow [[ENDHIGHLIGHT]] is defined as the change of structured
light in the image, e... Further definitions from the literature highlight diff
erent properties of [[HIGHLIGHT]] optic flow [[ENDHIGHLIGHT]]
Well, here's the hacked together version I made using the algorithm I described above. I don't think it is all that great. It uses three (count em, three!) loops an array and two lists. But, well, it is better than nothing. I also hardcoded the maximum length instead of turning it into a parameter.
private static string FindRelevantSnippets(string infoText, string[] searchTerms)
{
List<int> termLocations = new List<int>();
foreach (string term in searchTerms)
{
int termStart = infoText.IndexOf(term);
while (termStart > 0)
{
termLocations.Add(termStart);
termStart = infoText.IndexOf(term, termStart + 1);
}
}
if (termLocations.Count == 0)
{
if (infoText.Length > 250)
return infoText.Substring(0, 250);
else
return infoText;
}
termLocations.Sort();
List<int> termDistances = new List<int>();
for (int i = 0; i < termLocations.Count; i++)
{
if (i == 0)
{
termDistances.Add(0);
continue;
}
termDistances.Add(termLocations[i] - termLocations[i - 1]);
}
int smallestSum = int.MaxValue;
int smallestSumIndex = 0;
for (int i = 0; i < termDistances.Count; i++)
{
int sum = termDistances.Skip(i).Take(5).Sum();
if (sum < smallestSum)
{
smallestSum = sum;
smallestSumIndex = i;
}
}
int start = Math.Max(termLocations[smallestSumIndex] - 128, 0);
int len = Math.Min(smallestSum, infoText.Length - start);
len = Math.Min(len, 250);
return infoText.Substring(start, len);
}
Some improvements I could think of would be to return multiple "snippets" with a shorter length that add up to the longer length -- this way multiple parts of the document can be sampled.
This is a nice problem :)
I think I'd create an index vector: For each word, create an entry 1 if search term or otherwise 0. Then find the i such that sum(indexvector[i:i+maxlength]) is maximized.
This can actually be done rather efficiently. Start with the number of searchterms in the first maxlength words. then, as you move on, decrease your counter if indexvector[i]=1 (i.e. your about to lose that search term as you increase i) and increase it if indexvector[i+maxlength+1]=1. As you go, keep track of the i with the highest counter value.
Once you got your favourite i, you can still do finetuning like see if you can reduce the actual size without compromising your counter, e.g. in order to find sentence boundaries or whatever. Or like picking the right i of a number of is with equivalent counter values.
Not sure if this is a better approach than yours - it's a different one.
You might also want to check out this paper on the topic, which comes with yet-another baseline: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.4357&rep=rep1&type=pdf
I took another approach, perhaps it will help someone...
First it searches if it word appears in my case with IgnoreCase (you change this of course yourself).
Then I create a list of Regex matches on each separators and search for the first occurrence of the word (allowing partial case insensitive matches).
From that index, I get the 10 matches in front and behind the word, which makes the snippet.
public static string GetSnippet(string text, string word)
{
if (text.IndexOf(word, StringComparison.InvariantCultureIgnoreCase) == -1)
{
return "";
}
var matches = new Regex(#"\b(\S+)\s?", RegexOptions.Singleline | RegexOptions.Compiled).Matches(text);
var p = -1;
for (var i = 0; i < matches.Count; i++)
{
if (matches[i].Value.IndexOf(word, StringComparison.InvariantCultureIgnoreCase) != -1)
{
p = i;
break;
}
}
if (p == -1) return "";
var snippet = "";
for (var x = Math.Max(p - 10, 0); x < p + 10; x++)
{
snippet += matches[x].Value + " ";
}
return snippet;
}
If you use CONTAINSTABLE you will get a RANK back , this is in essence a density value - higher the RANK value, the higher the density. This way, you just run a query to get the results you want and dont have to result to massaging the data when its returned.
Wrote a function to do this just now. You want to pass in:
Inputs:
Document text
This is the full text of the document you're taking a snippet from. Most likely you will want to strip out any BBCode/HTML from this document.
Original query
The string the user entered as their search
Snippet length
Length of the snippet you wish to display.
Return Value:
Start index of the document text to take the snippet from. To get the snippet simply do documentText.Substring(returnValue, snippetLength). This has the advantage that you know if the snippet is take from the start/end/middle so you can add some decoration like ... if you wish at the snippet start/end.
Performance
A resolution set to 1 will find the best snippet but moves the window along 1 char at a time. Set this value higher to speed up execution.
Tweaks
You can work out score however you want. In this example I've done Math.pow(wordLength, 2) to favour longer words.
private static int GetSnippetStartPoint(string documentText, string originalQuery, int snippetLength)
{
// Normalise document text
documentText = documentText.Trim();
if (string.IsNullOrWhiteSpace(documentText)) return 0;
// Return 0 if entire doc fits in snippet
if (documentText.Length <= snippetLength) return 0;
// Break query down into words
var wordsInQuery = new HashSet<string>();
{
var queryWords = originalQuery.Split(' ');
foreach (var word in queryWords)
{
var normalisedWord = word.Trim().ToLower();
if (string.IsNullOrWhiteSpace(normalisedWord)) continue;
if (wordsInQuery.Contains(normalisedWord)) continue;
wordsInQuery.Add(normalisedWord);
}
}
// Create moving window to get maximum trues
var windowStart = 0;
double maxScore = 0;
var maxWindowStart = 0;
// Higher number less accurate but faster
const int resolution = 5;
while (true)
{
var text = documentText.Substring(windowStart, snippetLength);
// Get score of this chunk
// This isn't perfect, as window moves in steps of resolution first and last words will be partial.
// Could probably be improved to iterate words and not characters.
var words = text.Split(' ').Select(c => c.Trim().ToLower());
double score = 0;
foreach (var word in words)
{
if (wordsInQuery.Contains(word))
{
// The longer the word, the more important.
// Can simply replace with score += 1 for simpler model.
score += Math.Pow(word.Length, 2);
}
}
if (score > maxScore)
{
maxScore = score;
maxWindowStart = windowStart;
}
// Setup next iteration
windowStart += resolution;
// Window end passed document end
if (windowStart + snippetLength >= documentText.Length)
{
break;
}
}
return maxWindowStart;
}
Lots more you can add to this, for example instead of comparing exact words perhaps you might want to try comparing the SOUNDEX where you weight soundex matches less than exact matches.

Categories