For my program, I've prompted the user to put 20 names into an array (the array size is 5 for testing for now), this array is then sent to a text document. I need to make it so that it will randomly pick a name from the list and display it (which I have done). But I now need to make it increase the chances of a name being picked, how would I go about doing this?
Eg. I want to increase the chances of the name 'Jim' being picked from the array.
class Programt
{
static void readFile()
{
}
static void Main(string[] args)
{
string winner;
string file = #"C:\names.txt";
string[] classNames = new string[5];
Random RandString = new Random();
Console.ForegroundColor = ConsoleColor.White;
if (File.Exists(file))
{
Console.WriteLine("Names in the text document are: ");
foreach (var displayFile in File.ReadAllLines(file))
Console.WriteLine(displayFile);
Console.ReadKey();
}
else
{
Console.WriteLine("Please enter 5 names:");
for (int i = 0; i < 5; i++)
classNames[i] = Console.ReadLine();
File.Create(file).Close();
File.WriteAllLines(file, classNames);
Console.WriteLine("Writing names to file...");
winner = classNames[RandString.Next(0, classNames.Length)];
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\nThe winner of the randomiser is: {0} Congratulations! ", winner);
Thread.Sleep(3000);
Console.Write("Completed");
Thread.Sleep(1000);
}
}
}
There's two ways of doing this. You can either produce a RNG with a normal distribution targeting one number.
Or the simpler way is a translational step. Generate in the range 0-100 and then produce code which translates to the answer in a biased way e.g.
0-5 : Answer 1
6-10: Answer 2
11-90: Answer 3
91-95: Answer 4
96-100: Answer 5
This gives an 80% chance of picking Answer 3, the others only get a 5% chance
So where you currently have RandString.Next(0, classNames.Length) you can replace that with a function something like GetBiasedIndex(0, classNames.Length, 3)
The function would look something like this (with test code):
public Form1()
{
InitializeComponent();
int[] results = new int[5];
Random RandString = new Random();
for (int i = 0; i < 1000; i++)
{
var output = GetBiasedIndex(RandString, 0, 4, 3);
results[output]++;
}
StringBuilder builder = new StringBuilder();
for (int i = 0; i < 5; i++)
{
builder.AppendLine(results[i].ToString());
}
label1.Text = builder.ToString();
}
private int GetBiasedIndex(Random rng, int start, int end, int target)
{
//Number between 0 and 100 (Helps to think percentage)
var check = rng.Next(0, 100);
//There's a few ways to do this next bit, but I'll try to keep it simple
//Allocate x% to the target and split the remaining y% among all the others
int x = 80;//80% chance of target
int remaining = 100 - x;//20% chance of something else
//Take the check for the target out of the last x% (we can take it out of any x% chunk but this makes it simpler
if (check > (100 - x))
{
return target;
}
else
{
//20% left if there's 4 names remaining that's 5% each
var perOther = (100 - x) / ((end - start) - 1);
//result is now in the range 0..4
var result = check / perOther;
//avoid hitting the target in this section
if (result >= target)
{
//adjust the index we are returning since we have already accounted for the target
result++;
}
//return the index;
return result;
}
}
and the output:
52
68
55
786
39
If you're going to call this function repeatedly you'll need to pass in the instance of the RNG so that you don't reset the seed each call.
If you want to target a name instead of an index you just need to look up that name first and have an else condition for when that name isn't found
Related
I have a code that generates a user specified amount of random number from a range of also user specified numbers without repeats. The problem I have is that for a while the program works fine and then it just keeps giving the same number it seems, which breaks it.
Here's the whole code, some of it is in hungarian but if you want to try it for yourself to see what the problem really is: the program first asks you the range of numbers(first the minimum then the maximum number) and then the amount of numbers you want generated. After that you will see the your numbers generated... or not if it decides to not work. If you did get your numbers if you press "1" it will generate again with the same data you provided, pressing "2" will allow you to give different parameters and pressing "0" will quit the program.
If needed I can translate the program to help solve it.
using System;
using System.Collections.Generic;
namespace numgen
{
class Program
{
static int quantity, min, max, temp;
static bool check = false;
static int control = 2;
public static void Main(string[] args)
{
var gen = new List<int>();
Random rnd = new Random();
while(control > 0) {
Console.Clear();
if(control == 2) {
Console.Write("Add meg a minimum számot: ");
min = int.Parse(Console.ReadLine());
Console.Write("Add meg a maximum számot: ");
max = int.Parse(Console.ReadLine());
Console.Write("Add meg a hány számot kérsz: ");
quantity = int.Parse(Console.ReadLine());
}
gen.Clear();
//gen.Add(rnd.Next(min, max));
while(gen.Count < quantity) {
temp = rnd.Next(min, max);
foreach(int num in gen) {
if(temp == num) {
check = true;
break;
}
}
if(!check) {
gen.Add(temp);
//Uncomment to see number getting generated and when the program breaks
//Console.WriteLine(gen[gen.Count-1]);
}
}
gen.Sort();
foreach(int num in gen){
Console.WriteLine(num);
}
Console.WriteLine("\n[2] Új adatok megadása");
Console.WriteLine("[1] Számok újragenerálása");
Console.WriteLine("[0] Kilépés");
control = int.Parse(Console.ReadLine());
}
}
}
}
A HashSet is an excellent choice for this task. Elements are added as a key--so it will refuses any value that is already in the set.
Try something like this:
private static void TestHashSet(int min, int max, int quantity)
{
var gen = new HashSet<int>();
Random rnd = new Random();
while (gen.Count < quantity)
{
var temp = rnd.Next(min, max);
gen.Add(temp);
}
gen.AsEnumerable().OrderBy(s => s).ToList().ForEach(x => Console.WriteLine(x));
}
You are not setting check back to false, so your code breaks after the first duplicate is found.
You're not setting check back to false at any point in time.
Once you get the first collision you cease adding value to the list, so while (gen.Count < quantity) will never be true.
You might want to consider using a keyed Dictionary which will handle the indexing for you rather than some logic.
Also you're not checking if quantity is greater than the provided range. If it is you will always get collisions.
Consider a keyed dictionary, to take the logic checking away and let the Dictionary object take care of it.
var gen = new Dictionary<int, int>();
if (quantity > (max - min))
throw new ArgumentOutOfRangeException("Quantiy must be smaller than range");
while (gen.Count < quantity)
{
temp = rnd.Next(min, max);
if(!gen.ContainsKey(temp))
gen.Add(temp, temp);
}
foreach (int num in gen.Values)
{
Console.WriteLine(num);
}
Issues found in your script
When you set check = true;, you Never change it to false. gen.Add() never happens.
if someone selects minimum 2, max 3 and ask for 10, it will always have duplicate values.. Your code doesnt allow that and will run infinite loop. Calculate total numbers that can be generated between the two numbers and see if it can produce required quantity.
Not an issue but I would recommend using gen = new List<int>(); instead of Clear()
You can use Contains method to quickly check if the value is part of the list or not (instead of iterating over the entire list each time).
Updated your code and got it working like this,
while (control > 0)
{
Console.Clear();
if (control == 2)
{
Console.Write("Add meg a minimum számot: ");
min = int.Parse(Console.ReadLine());
Console.Write("Add meg a maximum számot: ");
max = int.Parse(Console.ReadLine());
Console.Write("Add meg a hány számot kérsz: ");
quantity = int.Parse(Console.ReadLine());
}
else if (control == 1)
gen = new List<int>();
if (max - min < quantity)
Console.WriteLine($"You cannot generate {quantity} between [{min} and {max}]");
else
{
while (gen.Count < quantity)
{
temp = rnd.Next(min, max);
if (!gen.Contains(temp))
gen.Add(temp);
}
gen.Sort();
gen.ForEach(x => Console.WriteLine(x));
}
Console.WriteLine("\n[2] Új adatok megadása");
Console.WriteLine("[1] Számok újragenerálása");
Console.WriteLine("[0] Kilépés");
control = int.Parse(Console.ReadLine());
}
What I need is:
e.g:
Sum should be equal to:
120 (user input)
Number of numbers/items:
80 (user input)
Range of numbers to be used in set(from):
0 (user input)
Range of numbers to be used in set(to):
4 (user input)
Output:
1,1,3,2,1,1,0,0,1,1,2,1,0,2,3,3,1,2,0,0,0,1,3,2,3,1,0,0,2,3,2,3,2,2,1,1,0,0,2,0,1,0,1,1,3,3,1,3,1,0,0,3,2,1,0,0,2,1,2,3,0,3,1,1,3,3,2,2,1,1,3,1,3,3,3,3,3,1,2,0
These are all numbers that are between 0 and 4, their sum is 120 and are 80 in total.
What i've done is:
static void Main(string[] args)
{
bool loopOn = true;
Program p = new Program();
Console.WriteLine("____________________________________________________________________________");
Console.WriteLine("");
Console.WriteLine("Sum should be equal to:");
int sum = int.Parse(Console.ReadLine());
Console.WriteLine("Number of items:");
int items = int.Parse(Console.ReadLine());
Console.WriteLine("Range(from):");
int from = int.Parse(Console.ReadLine());
Console.WriteLine("Range(to):");
int to = int.Parse(Console.ReadLine());
while (loopOn == true)
{
List<int> number_list = p.createNumberSet(items, from, to);
if (number_list.Sum() == sum)
{
loopOn = false;
Console.WriteLine("____________________________________________________________________________");
Console.WriteLine("Start");
number_list.ForEach(Console.WriteLine);
Console.WriteLine("Stop");
Console.WriteLine("____________________________________________________________________________");
}
}
Console.WriteLine("Press any key to exit....");
Console.ReadLine();
}
public List<int> createNumberSet(int itemNumber, int range_from, int range_to)
{
List<int> number_set = new List<int>();
Random r = new Random();
for (int i = 0; itemNumber > i; i++)
{
number_set.Add(r.Next(range_from, range_to));
}
return number_set;
}
But this seems extremely in-efficent and doesn't seem to work with a lot of other examples. Does anyone have a better way of doing this?
Well, I am a bit lazy right now, so this is just an idea
Keep the first part:
bool loopOn = true;
Program p = new Program();
Console.WriteLine("____________________________________________________________________________");
Console.WriteLine("");
Console.WriteLine("Sum should be equal to:");
int sum = int.Parse(Console.ReadLine());
Console.WriteLine("Number of items:");
int items = int.Parse(Console.ReadLine());
Console.WriteLine("Range(from):");
int from = int.Parse(Console.ReadLine());
Console.WriteLine("Range(to):");
int to = int.Parse(Console.ReadLine());
Now, first of all, check is a solution exists:
if (from * items > sum) {
// There is no solution, handle accordingly
}
Let's focus on the interesting part now:
First create the list of necessary items
int[] number_set = new int[items];
for(int i = 0; i < items; i++) {
number_set[i] = from;
}
Find the difference between the wanted sum and the current sum of the list
int left_to_add = sum - from * items;
int idx = 0;
Random r = new Random();
while(left_to_add > 0) {
int toAdd = 0;
if (left_to_add < range_to - range_from) {
toAdd = r.Next(1, left_to_add);
} else {
toAdd = r.Next(1, range_to - range_from);
}
left_to_add -= toAdd;
number_set[idx] += toAdd;
idx++;
}
What's left to do is, convert the array to a list and shuffle it.
(I forgot that you actually can access list items by index, so there is no need to use an array as I did here)
At the algorithm level, this is what I would try:
Determine the number of each element, n[0], n[1], n[2], n[3] in your example (i.e. number of 0, number of 1 ...) and then generate a simple sequence by concatenating n[0] "0", n[1] "1", n[2] "2" and n[3] "3". Finally, a random sequence is obtained by performing a random permutation on this simple sequence.
The problem is therefore to determine the n[i].
The first step is to determine the average values of these n[i]. It your example, it is simple, as we can take average n_av[i]=20 for all index i.
In a more general case, we have to insure that
sum_i n_av[i]*i = sum_target (120 here) (1)
knowing that
sum_i (n[i]) = n = 80 here. (2)
In the general case, there is no necessary one unique good solution. I will try to propose an example of solution here if you provide an example of a difficult scenario.
The second step consists in selecting some random n[i] values around these average values. One possibility is to generate rounded Gaussian variables: we already know the averages, we just need to determine the variances. One possibility is to consider the variance that we will get if we were generating directly the random values, i.e. by considering the variance of the corresponding binomial variable :
var = n p(1-p). Here p[i] = n_av[i]/n
The last step consists in adjusting the values of the n[i] such that the sum of the n[i] is equal to the target. This is simply obtained by slightly increasing or decreasing some n[i] values.
I'm trying to take array of numbers (0 or 1 only) and repeatedly transform it as follows:
first and last numbers are always 0;
all the other numbers are
derived from previous state of array: if array[n] and its two neighbours
(array[n-1], array[n] and array[n+1]) have the same value (three 0s or three 1s) then newarray[n] should be be 1, otherwise it should be 0 (that produces nice patterns).
For example, if the array size is 10 and it starts with all zeroes, then program should output this:
0000000000
0111111110
0011111100
0001111000
0100110010
0000000000
0111111110
...and so on
I wrote a code that should do this and it doesn't work as intended. It always does first transformation perfectly but then begins some wrong, asymmetric, crazy stuff:
0000000000
0111111110
0000000000
0101010100
0000000010
0101010000
What makes my code behave the way it does?
Here is the code:
namespace test
{
class Program
{
public static void Main(string[] args)
{
const int leng = 10; //length of array
int[] arrayone = new int[leng];
int[] arraytwo = new int[leng];
for (int i = 0; i<=leng-1; i++)
{
arrayone[i] = 0;
arraytwo[i] = 0;
} //making two arrays and filling them with zeroes
for (int i = 0; i<=leng-1; i++)
{
Console.Write(arrayone[i]);
}
Console.WriteLine(' '); //printing the first state of array
for (int st=1; st<=16; st++) //my main loop
{
arraytwo[0]=0;
arraytwo[leng - 1]=0; //making sure that first and last elements are zero. I'm not sure I need this since I fill both arrays with zeroes in the beginning. But it probably doesn't hurt?
for (int i = 1; i<=leng-2; i++) //calculating new states for elements from second to second-to-last
{
if (((arrayone[i-1]) + (arrayone[i]) + (arrayone[i+1]) == 0) | ((arrayone[i-1]) + (arrayone[i]) + (arrayone[i+1]) == 3) == true)
arraytwo[i] = 1;
else
arraytwo[i] = 0;
}
//by now arraytwo contains transformed version of arrayone
for (int i = 0; i<=leng-1; i++)
{
Console.Write(arraytwo[i]);
} //printing result
arrayone = arraytwo; //copying content of arraytwo to arrayone for the next round of transformation;
Console.WriteLine(' ');
}
Console.Write(" ");
Console.ReadKey(true);
}
}
}
Tweakable version: https://dotnetfiddle.net/8htp9N
As pointed out you're talking an object and working on it, at the end of that you're assigning the reference not the values. one way to combat that would be
for your line: arrayone = arraytwo;
change it to : arrayone = (int[])arraytwo.Clone();
this will copy the values - for integers this will be sufficient.
Please, notice how your current code is complex and thus diffcult to debug. Make it simpler, extract a method:
using System.Linq;
...
private static IEnumerable<int[]> Generate(int width) {
int[] item = new int[width];
while (true) {
yield return item.ToArray(); // .ToArray() - return a copy of the item
int[] next = new int[width];
for (int i = 1; i < item.Length - 1; ++i)
if (item[i - 1] == item[i] && item[i] == item[i + 1])
next[i] = 1;
item = next;
}
}
Then you can put
public static void Main(string[] args) {
var result = Generate(10) // each line of 10 items
.Take(7) // 7 lines
.Select(item => string.Concat(item));
Console.Write(string.Join(Environment.NewLine, result));
}
Outcome:
0000000000
0111111110
0011111100
0001111000
0100110010
0000000000
0111111110
Alright so I am making a program to verify a 4 digit code.
The computer generates a 4 digit code
The user types in a 4 digit code. Their guess.
the computer tells them how many digits are
guessed correctly in the correct place and how many digits have
been guessed correctly but in the wrong place.
The user gets 12 guesses to either win – guess the right code. Or
lose – run out of guesses.
So basically, my program doesn't seem to actually verify whether the code is correct but i cant see why not because i have if and for loops for verification, please take a look.
class Program
{
public static Random random = new Random();
static void Main(string[] args)
{
int DigitOne = random.Next(0, 10);
int DigitTwo = random.Next(0, 10);
int DigitThree = random.Next(0, 10);
int DigitFour = random.Next(0, 10);
byte[] code = new byte[4];
code[0] = Convert.ToByte(DigitOne);
code[1] = Convert.ToByte(DigitTwo);
code[2] = Convert.ToByte(DigitThree);
code[3] = Convert.ToByte(DigitFour);
bool CodeCorrect = false;
Console.WriteLine(code[0] +""+ code[1] +""+ code[2]+""+code [3] );
Console.WriteLine("You have 12 guesses before you will be permenantly locked out.\n");
int AmountOfGuesses = 0;
while (AmountOfGuesses < 12 && !CodeCorrect)
{
Console.WriteLine("Enter 4 digit code to unlock the safe: ");
int[] UserCode = new int[4];
for (int i = 0; i < 4; i++)
{
UserCode[i] = Convert.ToInt32(Console.Read()) - 48;
}
if (UserCode.Length != 4)
{
Console.WriteLine("Error. Try Again.\n");
}
else
{
int UserDigitOne = UserCode[0];
int UserDigitTwo = UserCode[1];
int UserDigitThree = UserCode[2];
int UserDigitFour = UserCode[3];
for (int i = 0; i < 4; i++)
{
if (UserCode[i] == code[i])
{
Console.WriteLine("The digit at position " + (i + 1) + " is correct.");
}
}
if (UserCode[0] == code[0] && UserCode[1] == code[1] && UserCode[2] == code[2] && UserCode[3] == code[3])
{
CodeCorrect = true;
Console.WriteLine("Code Correct. Safe unlocked.");
}
}
AmountOfGuesses++;
}
if (AmountOfGuesses > 12)
{
Console.WriteLine("Code Incorrect. Safe Locked permenantly.");
}
Console.ReadLine();
}
If you step through the code after it generated the number 1246, and then input the same number from the command line, convert it to a char array, then convert each char to a byte, you'll get the following four bytes:
49 50 52 54
These correspond to the ASCII representations of each char, NOT the actual numbers.
Try something like this:
int[] input = new int[4];
for(int i = 0; i < 4; i++ )
{
input[i] = Convert.ToInt32(Console.Read()) - 48;
}
The -48 should turn your ASCII code into the actual numerical representation that was provided. Console.Read() reads individual characters rather than the full line.
Also, you don't have to say:
CodeCorrect == false
This is more simply represented as:
!CodeCorrect
Similarly, if it was set to true, it would just be:
CodeCorrect
I also suggest using a for loop to set multiple elements in an array rather than manually writing out each line of code. It's not a big deal for small arrays, but it's good practice.
UPDATE: Here's a revised version of the full program:
class Program
{
public static Random random = new Random();
static void Main(string[] args)
{
int[] randCombination = new int[4];
for (int i = 0; i < 4; i++)
{
randCombination[i] = random.Next(0, 10);
Console.Write(randCombination[i].ToString());
}
bool CodeCorrect = false;
Console.WriteLine("\nYou have 12 guesses before you will be permenantly locked out.\n");
int AmountOfGuesses = 0;
while(AmountOfGuesses < 12 && !CodeCorrect)
{
Console.WriteLine("Enter 4 digit code to unlock the safe: ");
int[] UserCode = new int[4];
string input = Console.ReadLine();
int n;
bool isNumeric = int.TryParse(input, out n);
int correctCount = 0;
if(input.Length != 4 || !isNumeric)
{
Console.WriteLine("Error. Input code was not a 4 digit number.\n");
}
else
{
for(int i = 0; i < 4; i++)
{
UserCode[i] = Convert.ToInt32(input[i]) - 48;
if(UserCode[i] == randCombination[i])
{
Console.WriteLine("The digit at position " + (i + 1) + " is correct.");
correctCount++;
}
}
if(correctCount == 4)
{
CodeCorrect = true;
Console.WriteLine("Code Correct. Safe unlocked.");
}
}
AmountOfGuesses++;
}
if(AmountOfGuesses >= 12)
{
Console.WriteLine("Code Incorrect. Safe Locked permenantly.");
}
Console.ReadLine();
}
}
A couple of things were changed:
Added a for loop at the top that generates a random number, enters it into an array of ints and then prints it to standard output.
I changed the way user input is read back to the Console.ReadLine(). The reason for this is to check if the user inputted a four digit integer. the int.TryParse statement makes sure the input is an int, and the Length property checks the length.
I also used a counter to count each correct guess. If 4 correct digit guesses were made, the safe is unlocked.
Your final if statement would never have evaluated because Amount of Guesses would equal 12, not be greater than it. Changed it to >= from >. Always be on the lookout for small things like this.
EDIT #2: For more information on int.TryParse, see the following:
http://www.dotnetperls.com/int-tryparse
How the int.TryParse actually works
You are comparing numbers with the character representation of a number. Each value of code[] represents an actual number. You then compare those values with the values in UserCode which is a string, meaning there is a character at each index. It is never the case that ((byte)'4') == ((byte)4) (using 4 as an example, but works for any numerical digit).
One way around this is to parse each user input character into a byte using the byte.Parse method.
For fun learning purposes look at the output from the following code:
for (char i = '0'; i <= '9'; i++)
{
Console.WriteLine("char: " + i + "; value: " + ((byte)i));
}
The output is actually:
char: 0; value: 48
char: 1; value: 49
char: 2; value: 50
char: 3; value: 51
char: 4; value: 52
char: 5; value: 53
char: 6; value: 54
char: 7; value: 55
char: 8; value: 56
char: 9; value: 57
This is due to string encoding.
I would also recommend that one you have your code working that you submit it to the fine folks at the Code Review site to review other aspects of the code which could use work.
In developing search for a site I am building, I decided to go the cheap and quick way and use Microsoft Sql Server's Full Text Search engine instead of something more robust like Lucene.Net.
One of the features I would like to have, though, is google-esque relevant document snippets. I quickly found determining "relevant" snippets is more difficult than I realized.
I want to choose snippets based on search term density in the found text. So, essentially, I need to find the most search term dense passage in the text. Where a passage is some arbitrary number of characters (say 200 -- but it really doesn't matter).
My first thought is to use .IndexOf() in a loop and build an array of term distances (subtract the index of the found term from the previously found term), then ... what? Add up any two, any three, any four, any five, sequential array elements and use the one with the smallest sum (hence, the smallest distance between search terms).
That seems messy.
Is there an established, better, or more obvious way to do this than what I have come up with?
Although it is implemented in Java, you can see one approach for that problem here:
http://rcrezende.blogspot.com/2010/08/smallest-relevant-text-snippet-for.html
I know this thread is way old, but I gave this a try last week and it was a pain in the back side. This is far from perfect, but this is what I came up with.
The snippet generator:
public static string SelectKeywordSnippets(string StringToSnip, string[] Keywords, int SnippetLength)
{
string snippedString = "";
List<int> keywordLocations = new List<int>();
//Get the locations of all keywords
for (int i = 0; i < Keywords.Count(); i++)
keywordLocations.AddRange(SharedTools.IndexOfAll(StringToSnip, Keywords[i], StringComparison.CurrentCultureIgnoreCase));
//Sort locations
keywordLocations.Sort();
//Remove locations which are closer to each other than the SnippetLength
if (keywordLocations.Count > 1)
{
bool found = true;
while (found)
{
found = false;
for (int i = keywordLocations.Count - 1; i > 0; i--)
if (keywordLocations[i] - keywordLocations[i - 1] < SnippetLength / 2)
{
keywordLocations[i - 1] = (keywordLocations[i] + keywordLocations[i - 1]) / 2;
keywordLocations.RemoveAt(i);
found = true;
}
}
}
//Make the snippets
if (keywordLocations.Count > 0 && keywordLocations[0] - SnippetLength / 2 > 0)
snippedString = "... ";
foreach (int i in keywordLocations)
{
int stringStart = Math.Max(0, i - SnippetLength / 2);
int stringEnd = Math.Min(i + SnippetLength / 2, StringToSnip.Length);
int stringLength = Math.Min(stringEnd - stringStart, StringToSnip.Length - stringStart);
snippedString += StringToSnip.Substring(stringStart, stringLength);
if (stringEnd < StringToSnip.Length) snippedString += " ... ";
if (snippedString.Length > 200) break;
}
return snippedString;
}
The function which will find the index of all keywords in the sample text
private static List<int> IndexOfAll(string haystack, string needle, StringComparison Comparison)
{
int pos;
int offset = 0;
int length = needle.Length;
List<int> positions = new List<int>();
while ((pos = haystack.IndexOf(needle, offset, Comparison)) != -1)
{
positions.Add(pos);
offset = pos + length;
}
return positions;
}
It's a bit clumsy in its execution. The way it works is by finding the position of all keywords in the string. Then checking that no keywords are closer to each other than the desired snippet length, so that snippets won't overlap (that's where it's a bit iffy...). And then grabs substrings of the desired length centered around the position of the keywords and stitches the whole thing together.
I know this is years late, but posting just in case it might help somebody coming across this question.
public class Highlighter
{
private class Packet
{
public string Sentence;
public double Density;
public int Offset;
}
public static string FindSnippet(string text, string query, int maxLength)
{
if (maxLength < 0)
{
throw new ArgumentException("maxLength");
}
var words = query.Split(' ').Where(w => !string.IsNullOrWhiteSpace(w)).Select(word => word.ToLower()).ToLookup(s => s);
var sentences = text.Split('.');
var i = 0;
var packets = sentences.Select(sentence => new Packet
{
Sentence = sentence,
Density = ComputeDensity(words, sentence),
Offset = i++
}).OrderByDescending(packet => packet.Density);
var list = new SortedList<int, string>();
int length = 0;
foreach (var packet in packets)
{
if (length >= maxLength || packet.Density == 0)
{
break;
}
string sentence = packet.Sentence;
list.Add(packet.Offset, sentence.Substring(0, Math.Min(sentence.Length, maxLength - length)));
length += packet.Sentence.Length;
}
var sb = new List<string>();
int previous = -1;
foreach (var item in list)
{
var offset = item.Key;
var sentence = item.Value;
if (previous != -1 && offset - previous != 1)
{
sb.Add(".");
}
previous = offset;
sb.Add(Highlight(sentence, words));
}
return String.Join(".", sb);
}
private static string Highlight(string sentence, ILookup<string, string> words)
{
var sb = new List<string>();
var ff = true;
foreach (var word in sentence.Split(' '))
{
var token = word.ToLower();
if (ff && words.Contains(token))
{
sb.Add("[[HIGHLIGHT]]");
ff = !ff;
}
if (!ff && !string.IsNullOrWhiteSpace(token) && !words.Contains(token))
{
sb.Add("[[ENDHIGHLIGHT]]");
ff = !ff;
}
sb.Add(word);
}
if (!ff)
{
sb.Add("[[ENDHIGHLIGHT]]");
}
return String.Join(" ", sb);
}
private static double ComputeDensity(ILookup<string, string> words, string sentence)
{
if (string.IsNullOrEmpty(sentence) || words.Count == 0)
{
return 0;
}
int numerator = 0;
int denominator = 0;
foreach(var word in sentence.Split(' ').Select(w => w.ToLower()))
{
if (words.Contains(word))
{
numerator++;
}
denominator++;
}
if (denominator != 0)
{
return (double)numerator / denominator;
}
else
{
return 0;
}
}
}
Example:
highlight "Optic flow is defined as the change of structured light in the image, e.g. on the retina or the camera’s sensor, due to a relative motion between the eyeball or camera and the scene. Further definitions from the literature highlight different properties of optic flow" "optic flow"
Output:
[[HIGHLIGHT]] Optic flow [[ENDHIGHLIGHT]] is defined as the change of structured
light in the image, e... Further definitions from the literature highlight diff
erent properties of [[HIGHLIGHT]] optic flow [[ENDHIGHLIGHT]]
Well, here's the hacked together version I made using the algorithm I described above. I don't think it is all that great. It uses three (count em, three!) loops an array and two lists. But, well, it is better than nothing. I also hardcoded the maximum length instead of turning it into a parameter.
private static string FindRelevantSnippets(string infoText, string[] searchTerms)
{
List<int> termLocations = new List<int>();
foreach (string term in searchTerms)
{
int termStart = infoText.IndexOf(term);
while (termStart > 0)
{
termLocations.Add(termStart);
termStart = infoText.IndexOf(term, termStart + 1);
}
}
if (termLocations.Count == 0)
{
if (infoText.Length > 250)
return infoText.Substring(0, 250);
else
return infoText;
}
termLocations.Sort();
List<int> termDistances = new List<int>();
for (int i = 0; i < termLocations.Count; i++)
{
if (i == 0)
{
termDistances.Add(0);
continue;
}
termDistances.Add(termLocations[i] - termLocations[i - 1]);
}
int smallestSum = int.MaxValue;
int smallestSumIndex = 0;
for (int i = 0; i < termDistances.Count; i++)
{
int sum = termDistances.Skip(i).Take(5).Sum();
if (sum < smallestSum)
{
smallestSum = sum;
smallestSumIndex = i;
}
}
int start = Math.Max(termLocations[smallestSumIndex] - 128, 0);
int len = Math.Min(smallestSum, infoText.Length - start);
len = Math.Min(len, 250);
return infoText.Substring(start, len);
}
Some improvements I could think of would be to return multiple "snippets" with a shorter length that add up to the longer length -- this way multiple parts of the document can be sampled.
This is a nice problem :)
I think I'd create an index vector: For each word, create an entry 1 if search term or otherwise 0. Then find the i such that sum(indexvector[i:i+maxlength]) is maximized.
This can actually be done rather efficiently. Start with the number of searchterms in the first maxlength words. then, as you move on, decrease your counter if indexvector[i]=1 (i.e. your about to lose that search term as you increase i) and increase it if indexvector[i+maxlength+1]=1. As you go, keep track of the i with the highest counter value.
Once you got your favourite i, you can still do finetuning like see if you can reduce the actual size without compromising your counter, e.g. in order to find sentence boundaries or whatever. Or like picking the right i of a number of is with equivalent counter values.
Not sure if this is a better approach than yours - it's a different one.
You might also want to check out this paper on the topic, which comes with yet-another baseline: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.4357&rep=rep1&type=pdf
I took another approach, perhaps it will help someone...
First it searches if it word appears in my case with IgnoreCase (you change this of course yourself).
Then I create a list of Regex matches on each separators and search for the first occurrence of the word (allowing partial case insensitive matches).
From that index, I get the 10 matches in front and behind the word, which makes the snippet.
public static string GetSnippet(string text, string word)
{
if (text.IndexOf(word, StringComparison.InvariantCultureIgnoreCase) == -1)
{
return "";
}
var matches = new Regex(#"\b(\S+)\s?", RegexOptions.Singleline | RegexOptions.Compiled).Matches(text);
var p = -1;
for (var i = 0; i < matches.Count; i++)
{
if (matches[i].Value.IndexOf(word, StringComparison.InvariantCultureIgnoreCase) != -1)
{
p = i;
break;
}
}
if (p == -1) return "";
var snippet = "";
for (var x = Math.Max(p - 10, 0); x < p + 10; x++)
{
snippet += matches[x].Value + " ";
}
return snippet;
}
If you use CONTAINSTABLE you will get a RANK back , this is in essence a density value - higher the RANK value, the higher the density. This way, you just run a query to get the results you want and dont have to result to massaging the data when its returned.
Wrote a function to do this just now. You want to pass in:
Inputs:
Document text
This is the full text of the document you're taking a snippet from. Most likely you will want to strip out any BBCode/HTML from this document.
Original query
The string the user entered as their search
Snippet length
Length of the snippet you wish to display.
Return Value:
Start index of the document text to take the snippet from. To get the snippet simply do documentText.Substring(returnValue, snippetLength). This has the advantage that you know if the snippet is take from the start/end/middle so you can add some decoration like ... if you wish at the snippet start/end.
Performance
A resolution set to 1 will find the best snippet but moves the window along 1 char at a time. Set this value higher to speed up execution.
Tweaks
You can work out score however you want. In this example I've done Math.pow(wordLength, 2) to favour longer words.
private static int GetSnippetStartPoint(string documentText, string originalQuery, int snippetLength)
{
// Normalise document text
documentText = documentText.Trim();
if (string.IsNullOrWhiteSpace(documentText)) return 0;
// Return 0 if entire doc fits in snippet
if (documentText.Length <= snippetLength) return 0;
// Break query down into words
var wordsInQuery = new HashSet<string>();
{
var queryWords = originalQuery.Split(' ');
foreach (var word in queryWords)
{
var normalisedWord = word.Trim().ToLower();
if (string.IsNullOrWhiteSpace(normalisedWord)) continue;
if (wordsInQuery.Contains(normalisedWord)) continue;
wordsInQuery.Add(normalisedWord);
}
}
// Create moving window to get maximum trues
var windowStart = 0;
double maxScore = 0;
var maxWindowStart = 0;
// Higher number less accurate but faster
const int resolution = 5;
while (true)
{
var text = documentText.Substring(windowStart, snippetLength);
// Get score of this chunk
// This isn't perfect, as window moves in steps of resolution first and last words will be partial.
// Could probably be improved to iterate words and not characters.
var words = text.Split(' ').Select(c => c.Trim().ToLower());
double score = 0;
foreach (var word in words)
{
if (wordsInQuery.Contains(word))
{
// The longer the word, the more important.
// Can simply replace with score += 1 for simpler model.
score += Math.Pow(word.Length, 2);
}
}
if (score > maxScore)
{
maxScore = score;
maxWindowStart = windowStart;
}
// Setup next iteration
windowStart += resolution;
// Window end passed document end
if (windowStart + snippetLength >= documentText.Length)
{
break;
}
}
return maxWindowStart;
}
Lots more you can add to this, for example instead of comparing exact words perhaps you might want to try comparing the SOUNDEX where you weight soundex matches less than exact matches.