A little help needed in code translation (Python to C#) - c#

Good night everyone,
This question leaves me a little embarassed because, of couse, I know I should be able to get the answer alone. However, my knowledge about Python is just a little bit more than nothing, so I need help from someone more experienced with it than me...
The following code comes from Norvig's "Natural Language Corpus Data" chapter in a recently edited book, and it's about transforming a sentence "likethisone" into "[like, this, one]" (that means, segmenting the word correctly)...
I have ported all of the code to C# (in fact, re-wrote the program by myself) except for the function segment, which I am having a lot of trouble even trying to understand it's syntax. Can someone please help me translating it to a more readable form in C#?
Thank you very much in advance.
################ Word Segmentation (p. 223)
#memo
def segment(text):
"Return a list of words that is the best segmentation of text."
if not text: return []
candidates = ([first]+segment(rem) for first,rem in splits(text))
return max(candidates, key=Pwords)
def splits(text, L=20):
"Return a list of all possible (first, rem) pairs, len(first)<=L."
return [(text[:i+1], text[i+1:])
for i in range(min(len(text), L))]
def Pwords(words):
"The Naive Bayes probability of a sequence of words."
return product(Pw(w) for w in words)
#### Support functions (p. 224)
def product(nums):
"Return the product of a sequence of numbers."
return reduce(operator.mul, nums, 1)
class Pdist(dict):
"A probability distribution estimated from counts in datafile."
def __init__(self, data=[], N=None, missingfn=None):
for key,count in data:
self[key] = self.get(key, 0) + int(count)
self.N = float(N or sum(self.itervalues()))
self.missingfn = missingfn or (lambda k, N: 1./N)
def __call__(self, key):
if key in self: return self[key]/self.N
else: return self.missingfn(key, self.N)
def datafile(name, sep='\t'):
"Read key,value pairs from file."
for line in file(name):
yield line.split(sep)
def avoid_long_words(key, N):
"Estimate the probability of an unknown word."
return 10./(N * 10**len(key))
N = 1024908267229 ## Number of tokens
Pw = Pdist(datafile('count_1w.txt'), N, avoid_long_words)

Let's tackle the first function first:
def segment(text):
"Return a list of words that is the best segmentation of text."
if not text: return []
candidates = ([first]+segment(rem) for first,rem in splits(text))
return max(candidates, key=Pwords)
It takes a word and returns the most likely list of words that it could be, so its signature will be static IEnumerable<string> segment(string text). Obviously if text is an empty string, its result should be an empty list. Otherwise, it creates a recursive list comprehension defining the possible candidate lists of words and returns the maximum based on its probability.
static IEnumerable<string> segment(string text)
{
if (text == "") return new string[0]; // C# idiom for empty list of strings
var candidates = from pair in splits(text)
select new[] {pair.Item1}.Concat(segment(pair.Item2));
return candidates.OrderBy(Pwords).First();
}
Of course, now we have to translate the splits function. Its job is to return a list of all possible tuples of the beginning and end of a word. It's fairly straightforward to translate:
static IEnumerable<Tuple<string, string>> splits(string text, int L = 20)
{
return from i in Enumerable.Range(1, Math.Min(text.Length, L))
select Tuple.Create(text.Substring(0, i), text.Substring(i));
}
Next is Pwords, which just calls the product function on the result of Pw on each word in its input list:
static double Pwords(IEnumerable<string> words)
{
return product(from w in words select Pw(w));
}
And product is pretty simple:
static double product(IEnumerable<double> nums)
{
return nums.Aggregate((a, b) => a * b);
}
ADDENDUM:
Looking at the full source code, it is apparent that Norvig intends the results of the segment function to be memoized for speed. Here's a version that provides this speed-up:
static Dictionary<string, IEnumerable<string>> segmentTable =
new Dictionary<string, IEnumerable<string>>();
static IEnumerable<string> segment(string text)
{
if (text == "") return new string[0]; // C# idiom for empty list of strings
if (!segmentTable.ContainsKey(text))
{
var candidates = from pair in splits(text)
select new[] {pair.Item1}.Concat(segment(pair.Item2));
segmentTable[text] = candidates.OrderBy(Pwords).First().ToList();
}
return segmentTable[text];
}

I don't know C# at all, but I can explain how the Python code works.
#memo
def segment(text):
"Return a list of words that is the best segmentation of text."
if not text: return []
candidates = ([first]+segment(rem) for first,rem in splits(text))
return max(candidates, key=Pwords)
The first line,
#memo
is a decorator. This causes the function, as defined in the subsequent lines, to be wrapped in another function. Decorators are commonly used to filter inputs and outputs. In this case, based on the name and the role of the function it's wrapping, I gather that this function memoizes calls to segment.
Next:
def segment(text):
"Return a list of words that is the best segmentation of text."
if not text: return []
Declares the function proper, gives a docstring, and sets the termination condition for this function's recursion.
Next is the most complicated line, and probably the one that gave you trouble:
candidates = ([first]+segment(rem) for first,rem in splits(text))
The outer parentheses, combined with the for..in construct, create a generator expression. This is an efficient way of iterating over a sequence, in this case splits(text). Generator expressions are sort of a compact for-loop that yields values. In this case, the values become the elements of the iteration candidates. "Genexps" are similar to list comprehensions, but achieve greater memory efficiency by not retaining each value that they produce.
So for each value in the iteration returned by splits(text), a list is produced by the generator expression.
Each of the values from splits(text) is a (first, rem) pair.
Each produced list starts with the object first; this is expressed by putting first inside a list literal, i.e. [first]. Then another list is added to it; that second list is determined by a recursive call to segment. Adding lists in Python concatenates them, i.e. [1, 2] + [3, 4] gives [1, 2, 3, 4].
Finally, in
return max(candidates, key=Pwords)
the recursively-determined list iteration and a key function are passed to max. The key function is called on each value in the iteration to get the value used to determine whether or not that list has the highest value in the iteration.

Related

How to create a list of a list of integers based on a tree of Or and And objects?

I want this result:
[[1,2],[1,3],[1,4,5,6],[1,4,5,7]]
from the following set of objects:
var andsOrs =
Ands(items: [
single(1),
Ors(items: [
single(2),
single(3),
Ands(items: [
single(4),
single(5),
Ors(items: [
single(6),
single(7),
]),
]),
]),
]);
I'm struggling to write a function that would produce the output, can anyone help? Any modern object oriented language would be fine really, c#, js, dart, c++.
If I understand correctly you are looking for functions named Ands, Ors and single that will give the desired result. The format of the desired result is a 2D array where the inner arrays list atomic values that are AND'ed, and the outer array lists items that are OR'ed.
Each of those functions should return that format, also single. So for instance, single(1) should return [[1]].
The Ors function should just concatenate the 2D items it gets as argument into a longer 2D array. So if for instance, Ors gets two items: [[1,2]] and [[3,4],[5,6]] then the return value should be [[1,2],[3,4],[5,6]].
The more complex one is Ands: it should perform a Cartesian product on the arguments it gets.
Here is an implementation in JavaScript. The notation of the input is a bit different, as items: as function argument would be invalid syntax. We can just pass each item as a separate argument and use spread syntax:
function single(i) {
return [[i]];
}
function Ands(first, ...rest) {
// Cartesian product, using recursion
if (!rest.length) return first;
let restResult = Ands(...rest);
let result = [];
for (let option of first) {
for (let restOption of restResult) {
result.push([...option, ...restOption]);
}
}
return result;
}
function Ors(...items) {
return items.flat(1); // Reduces 1 level of array depth (from 3 to 2)
// Alternative syntax to get the same result:
// return [].concat(...items)
}
var andsOrs =
Ands(
single(1),
Ors(
single(2),
single(3),
Ands(
single(4),
single(5),
Ors(
single(6),
single(7),
),
),
),
);
console.log(andsOrs);

Replace a string placeholder with consecutive elements from a list in c#

i know similar questions have been asked, but I couldn't find anything specifically fitting my need and I am supremely ignorant about Regex.
I have sentences of varying length like this one:
Provides a +$modifier% bonus to Maximum Quality and a +$modifier% chance for Special Traits when developing a Recipe.
so that $modifier is my placeholder for all of them. I have a list of floats that I will then replace accordingly to the order.
In this case I have a List values {5,0.5}. The replaced string should end up as
Provides a +5% bonus to Maximum Quality and a +0.5% chance for Special Traits when developing a Recipe.
I would like to avoid string.Replace as texts might get longer and i wouldn't like to loop multiple time over it. Could anyone suggest a good approach to do it?
Cheers and thanks
H
The method Regex.Replace has an overload where you can specify a callback method to provide the value to use for each replacement.
private string ReplaceWithList(string source, string placeHolder, IEnumerable<object> list)
{
// Escape placeholder so that it is a valid regular expression
placeHolder = Regex.Escape(placeHolder);
// Get enumerator for list
var enumerator = list.GetEnumerator();
// Use Regex engine to replace all occurences of placeholder
// with next entry from enumerator
string result = Regex.Replace(source, placeHolder, (m) =>
{
enumerator.MoveNext();
return enumerator.Current?.ToString();
});
return result;
}
Use like that:
string s = "Provides a +$modifier% bonus to Maximum Quality and a +$modifier% chance for Special Traits when developing a Recipe.";
List<object> list = new List<object> { 5, 0.5 };
s = ReplaceWithList(s, "$modifier", list);
Note that you need to add sensible error handling.
As strings are immutable in the CLR there's probably no sensible way around splitting your string and putting it back together in some way.
One would be to split at your desired marker string, insert your replacement values and afterwards concatenate your parts again:
var s = "Provides a +$modifier % bonus to Maximum Quality and a +$modifier % chance for Special Traits when developing a Recipe.";
var v = new List<float> { 5.0f, 0.5f };
var result = string.Concat(s.Split("$modifier").Select((s, i) => $"{s}{(i < v.Count ? v[i] : string.Empty)}"));

How to combine items in List<string> to make new items efficiently

I have a case where I have the name of an object, and a bunch of file names. I need to match the correct file name with the object. The file name can contain numbers and words, separated by either hyphen(-) or underscore(_). I have no control of either file name or object name. For example:
10-11-12_001_002_003_13001_13002_this_is_an_example.svg
The object name in this case is just a string, representing an number
10001
I need to return true or false if the file name is a match for the object name. The different segments of the file name can match on their own, or any combination of two segments. In the example above, it should be true for the following cases (not every true case, just examples):
10001
10002
10003
11001
11002
11003
12001
12002
12003
13001
13002
And, we should return false for this case (among others):
13003
What I've come up with so far is this:
public bool IsMatch(string filename, string objectname)
{
var namesegments = GetNameSegments(filename);
var match = namesegments.Contains(objectname);
return match;
}
public static List<string> GetNameSegments(string filename)
{
var segments = filename.Split('_', '-').ToList();
var newSegments = new List<string>();
foreach (var segment in segments)
{
foreach (var segment2 in segments)
{
if (segment == segment2)
continue;
var newToken = segment + segment2;
newSegments.Add(newToken);
}
}
return segments.Concat(newSegments).ToList();
}
One or two segments combined can make a match, and that is enought. Three or more segments combined should not be considered.
This does work so far, but is there a better way to do it, perhaps without nesting foreach loops?
First: don't change debugged, working, sufficiently efficient code for no reason. Your solution looks good.
However, we can make some improvements to your solution.
public static List<string> GetNameSegments(string filename)
Making the output a list puts restrictions on the implementation that are not required by the caller. It should be IEnumerable<String>. Particularly since the caller in this case only cares about the first match.
var segments = filename.Split('_', '-').ToList();
Why ToList? A list is array-backed. You've already got an array in hand. Just use the array.
Since there is no longer a need to build up a list, we can transform your two-loop solution into an iterator block:
public static IEnumerable<string> GetNameSegments(string filename)
{
var segments = filename.Split('_', '-');
foreach (var segment in segments)
yield return segment;
foreach (var s1 in segments)
foreach (var s2 in segments)
if (s1 != s2)
yield return s1 + s2;
}
Much nicer. Alternatively we could notice that this has the structure of a query and simply return the query:
public static IEnumerable<string> GetNameSegments(string filename)
{
var q1= filename.Split('_', '-');
var q2 = from s1 in q1
from s2 in q1
where s1 != s2
select s1 + s2;
return q1.Concat(q2);
}
Again, much nicer in this form.
Now let's talk about efficiency. As is often the case, we can achieve greater efficiency at a cost of increased complication. This code looks like it should be plenty fast enough. Your example has nine segments. Let's suppose that nine or ten is typical. Our solutions thus far consider the ten or so singletons first, and then the hundred or so combinations. That's nothing; this code is probably fine. But what if we had thousands of segments and were considering millions of possibilities?
In that case we should restructure the algorithm. One possibility would be this general solution:
public bool IsMatch(HashSet<string> segments, string name)
{
if (segments.Contains(name))
return true;
var q = from s1 in segments
where name.StartsWith(s1)
let s2 = name.Substring(s1.Length)
where s1 != s2
where segments.Contains(s2)
select 1; // Dummy. All we care about is if there is one.
return q.Any();
}
Your original solution is quadratic in the number of segments. This one is linear; we rely on the constant order contains operation. (This assumes of course that string operations are constant time because strings are short. If that's not true then we have a whole other kettle of fish to fry.)
How else could we extract wins in the asymptotic case?
If we happened to have the property that the collection was not a hash set but rather a sorted list then we could do even better; we could binary search the list to find the start and end of the range of possible prefix matches, and then pour the list into a hashset to do the suffix matches. That's still linear, but could have a smaller constant factor.
If we happened to know that the target string was small compared to the number of segments, we could attack the problem from the other end. Generate all possible combinations of partitions of the target string and check if both halves are in the segment set. The problem with this solution is that it is quadratic in memory usage in the size of the string. So what we'd want to do there is construct a special hash on character sequences and use that to populate the hash table, rather than the standard string hash. I'm sure you can see how the solution would go from there; I shan't spell out the details.
Efficiency is very much dependent on the business problem that you're attempting to solve. Without knowing the full context/usage it's difficult to define the most efficient solution. What works for one situation won't always work for others.
I would always advocate to write working code and then solve any performance issues later down the line (or throw more tin at the problem as it's usually cheaper!) If you're having specific performance issues then please do tell us more...
I'm going to go out on a limb here and say (hope) that you're only going to be matching the filename against the object name once per execution. If that's the case I reckon this approach will be just about the fastest. In a circumstance where you're matching a single filename against multiple object names then the obvious choice is to build up an index of sorts and match against that as you were already doing, although I'd consider different types of collection depending on your expected execution/usage.
public static bool IsMatch(string filename, string objectName)
{
var segments = filename.Split('-', '_');
for (int i = 0; i < segments.Length; i++)
{
if (string.Equals(segments[i], objectName)) return true;
for (int ii = 0; ii < segments.Length; ii++)
{
if (ii == i) continue;
if (string.Equals($"{segments[i]}{segments[ii]}", objectName)) return true;
}
}
return false;
}
If you are willing to use the MoreLINQ NuGet package then this may be worth considering:
public static HashSet<string> GetNameSegments(string filename)
{
var segments = filename.Split(new char[] {'_', '-'}, StringSplitOptions.RemoveEmptyEntries).ToList();
var matches = segments
.Cartesian(segments, (x, y) => x == y ? null : x + y)
.Where(z => z != null)
.Concat(segments);
return new HashSet<string>(matches);
}
StringSplitOptions.RemoveEmptyEntries handles adjacent separators (e.g. --). Cartesian is roughly equivalent to your existing nested for loops. The Where is to remove null entries (i.e. if x == y). Concat is the same as your existing Concat. The use of HashSet allows for your Contains calls (in IsMatch) to be faster.

Returning potential strings from a list of strings and their next chars performance

This is a question about returning efficiently strings and chars from a string array where:
The string in the string array starts with the user input supplied
The next letter of those strings as a collection of chars.
The idea is that when the user types a letter, the potential responses are displayed along with their next letters. Therefore response time is important, hence a performant algorithm is required.
E.g. If the string array contained:
string[] stringArray = new string[] { "Moose", "Mouse", "Moorhen", "Leopard", "Aardvark" };
If the user types in “Mo”, then “Moose”, “Mouse” and “Moorhen” should be returned along with chars “o” and “u” for the potential next letters.
This felt like a job for LINQ, so my current implementation as a static method is (I store the output to a Suggestions object which just has properties for the 2 returned lists):
public static Suggestions
GetSuggestions
(String userInput,
String[] stringArray)
{
// Get all possible strings based on the user input. This will always contain
// values which are the same length or longer than the user input.
IEnumerable<string> possibleStrings = stringArray.Where(x => x.StartsWith(userInput));
IEnumerable<char> nextLetterChars = null;
// If we have possible strings and we have some input, get the next letter(s)
if (possibleStrings.Any() &&
!string.IsNullOrEmpty(userInput))
{
// the user input contains chars, so lets find the possible next letters.
nextLetterChars =
possibleStrings.Select<string, char>
(x =>
{
// The input is the same as the possible string so return an empty char.
if (x == userInput)
{
return '\0';
}
else
{
// Remove the user input from the start of the possible string, then get
// the next character.
return x.Substring(userInput.Length, x.Length - userInput.Length)[0];
}
});
} // End if
I implemented a second version which actually stored all typing combinations to a list of dictionaries; one for each word, with key on combination and value as the actual animal required, e.g.:
Dictionary 1:
Keys Value
“M” “Moose”
“MO “Moose”
Etc.
Dictionary 2:
Keys Value
“M” “Mouse”
“MO” “Mouse”
Etc.
Since dictionary access has an O(1) retrieval time – I thought perhaps this would be a better approach.
So for loading the dictionaries at start up:
List<Dictionary<string, string>> animalCombinations = new List<Dictionary<string, string>>();
foreach (string animal in stringArray)
{
Dictionary<string, string> animalCombination = new Dictionary<string, string>();
string accumulatedAnimalString = string.Empty;
foreach (char character in animal)
{
accumulatedAnimalString += character;
animalCombination[accumulatedAnimalString] = animal;
}
animalCombinations.Add(animalCombination);
}
And then at runtime to get possible strings:
// Select value entries from the list of dictionaries which contain
// keys which match the user input and flatten into one list.
IEnumerable<string> possibleStrings =
animalCombinations.SelectMany
(animalCombination =>
{
return animalCombination.Values.Where(x =>
animalCombination.ContainsKey(userInput));
});
So questions are:
Which approach is better?
Is there a better approach to this which has better performance?
Are LINQ expressions expensive to process?
Thanks
Which approach is better?
Probably the dictionary approach, but you'll have to profile to find out.
Is there a better approach to this which has better performance?
Use a prefix tree.
Are LINQ expressions expensive to process?
Written correctly, they add very little overhead to imperative versions of the same code. Since they are easier to read and maintain and write, they are usually the way to go.

c# - BinarySearch StringList with wildcard

I have a sorted StringList and wanted to replace
foreach (string line3 in CardBase.cardList)
if (line3.ToLower().IndexOf((cardName + Config.EditionShortToLong(edition)).ToLower()) >= 0)
{
return true;
}
with a binarySearch, since the cardList ist rather large(~18k) and this search takes up around 80% of the time.
So I found the List.BinarySearch-Methode, but my problem is that the lines in the cardList look like this:
Brindle_Boar_(Magic_2012).c1p247924.prod
But I have no way to generate the c1p... , which is a problem cause the List.BinarySearch only finds exact matches.
How do I modify List.BinarySearch so that it finds a match if only a part of the string matches?
e. g.
searching for Brindle_Boar_(Magic_2012) should return the position of Brindle_Boar_(Magic_2012).c1p247924.prod
List.BinarySearch will return the ones complement of the index of the next item larger than the request if an exact match is not found.
So, you can do it like this (assuming you'll never get an exact match):
var key = (cardName + Config.EditionShortToLong(edition)).ToLower();
var list = CardBase.cardList;
var index = ~list.BinarySearch(key);
return index != list.Count && list[index].StartsWith(key);
BinarySearch() has an overload that takes an IComparer<T> has second parameter, implement a custom comparer and return 0 when you have a match within the string - you can use the same IndexOf() method there.
Edit:
Does a binary search make sense in your scenario? How do you determine that a certain item is "less" or "greater" than another item? Right now you only provide what would constitute a match. Only if you can answer this question, binary search applies in the first place.
You can take a look at the C5 Generic Collection Library (you can install it via NuGet also).
Use the SortedArray(T) type for your collection. It provides a handful of methods that could prove useful. You can even query for ranges of items very efficiently.
var data = new SortedArray<string>();
// query for first string greater than "Brindle_Boar_(Magic_2012)" an check if it starts
// with "Brindle_Boar_(Magic_2012)"
var a = data.RangeFrom("Brindle_Boar_(Magic_2012)").FirstOrDefault();
return a.StartsWith("Brindle_Boar_(Magic_2012)");
// query for first 5 items that start with "Brindle_Boar"
var b = data.RangeFrom("string").Take(5).Where(s => s.StartsWith("Brindle_Boar"));
// query for all items that start with "Brindle_Boar" (provided only ascii chars)
var c = data.RangeFromTo("Brindle_Boar", "Brindle_Boar~").ToList()
// query for all items that start with "Brindle_Boar", iterates until first non-match
var d = data.RangeFrom("Brindle_Boar").TakeWhile(s => s.StartsWith("Brindle_Boar"));
The RageFrom... methods perform a binary search, find the first element greater than or equal to your argument, that returns an iterator from that position

Categories