fastest starts with search algorithm - c#

I need to implement a search algorithm which only searches from the start of the string rather than anywhere within the string.
I am new to algorithms but from what I can see it seems as though they go through the string and find any occurrence.
I have a collection of strings (over 1 million) which need to be searched everytime the user types a keystroke.
EDIT:
This will be an incremental search. I currently have it implemented with the following code and my searches are coming back ranging between 300-700ms from over 1 million possible strings. The collection isnt ordered but there is no reason it couldnt be.
private ICollection<string> SearchCities(string searchString) {
return _cityDataSource.AsParallel().Where(x => x.ToLower().StartsWith(searchString)).ToArray();
}

I've adapted the code from this article from Visual Studio Magazine that implements a Trie.
The following program demonstrates how to use a Trie to do fast prefix searching.
In order to run this program, you will need a text file called "words.txt" with a large list of words. You can download one from Github here.
After you compile the program, copy the "words.txt" file into the same folder as the executable.
When you run the program, type a prefix (such as prefix ;)) and press return, and it will list all the words beginning with that prefix.
This should be a very fast lookup - see the Visual Studio Magazine article for more details!
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
namespace ConsoleApp1
{
class Program
{
static void Main()
{
var trie = new Trie();
trie.InsertRange(File.ReadLines("words.txt"));
Console.WriteLine("Type a prefix and press return.");
while (true)
{
string prefix = Console.ReadLine();
if (string.IsNullOrEmpty(prefix))
continue;
var node = trie.Prefix(prefix);
if (node.Depth == prefix.Length)
{
foreach (var suffix in suffixes(node))
Console.WriteLine(prefix + suffix);
}
else
{
Console.WriteLine("Prefix not found.");
}
Console.WriteLine();
}
}
static IEnumerable<string> suffixes(Node parent)
{
var sb = new StringBuilder();
return suffixes(parent, sb).Select(suffix => suffix.TrimEnd('$'));
}
static IEnumerable<string> suffixes(Node parent, StringBuilder current)
{
if (parent.IsLeaf())
{
yield return current.ToString();
}
else
{
foreach (var child in parent.Children)
{
current.Append(child.Value);
foreach (var value in suffixes(child, current))
yield return value;
--current.Length;
}
}
}
}
public class Node
{
public char Value { get; set; }
public List<Node> Children { get; set; }
public Node Parent { get; set; }
public int Depth { get; set; }
public Node(char value, int depth, Node parent)
{
Value = value;
Children = new List<Node>();
Depth = depth;
Parent = parent;
}
public bool IsLeaf()
{
return Children.Count == 0;
}
public Node FindChildNode(char c)
{
return Children.FirstOrDefault(child => child.Value == c);
}
public void DeleteChildNode(char c)
{
for (var i = 0; i < Children.Count; i++)
if (Children[i].Value == c)
Children.RemoveAt(i);
}
}
public class Trie
{
readonly Node _root;
public Trie()
{
_root = new Node('^', 0, null);
}
public Node Prefix(string s)
{
var currentNode = _root;
var result = currentNode;
foreach (var c in s)
{
currentNode = currentNode.FindChildNode(c);
if (currentNode == null)
break;
result = currentNode;
}
return result;
}
public bool Search(string s)
{
var prefix = Prefix(s);
return prefix.Depth == s.Length && prefix.FindChildNode('$') != null;
}
public void InsertRange(IEnumerable<string> items)
{
foreach (string item in items)
Insert(item);
}
public void Insert(string s)
{
var commonPrefix = Prefix(s);
var current = commonPrefix;
for (var i = current.Depth; i < s.Length; i++)
{
var newNode = new Node(s[i], current.Depth + 1, current);
current.Children.Add(newNode);
current = newNode;
}
current.Children.Add(new Node('$', current.Depth + 1, current));
}
public void Delete(string s)
{
if (!Search(s))
return;
var node = Prefix(s).FindChildNode('$');
while (node.IsLeaf())
{
var parent = node.Parent;
parent.DeleteChildNode(node.Value);
node = parent;
}
}
}
}

A couple of thoughts:
First, your million strings need to be ordered, so that you can "seek" to the first matching string and return strings until you no longer have a match...in order (seek via C# List<string>.BinarySearch, perhaps). That's how you touch the least number of strings possible.
Second, you should probably not try to hit the string list until there's a pause in input of at least 500 ms (give or take).
Third, your queries into the vastness should be async and cancelable, because it's certainly going to be the case that one effort will be superseded by the next keystroke.
Finally, any subsequent query should first check that the new search string is an append of the most recent search string...so that you can begin your subsequent seek from the last seek (saving lots of time).

I suggest using linq.
string x = "searchterm";
List<string> y = new List<string>();
List<string> Matches = y.Where(xo => xo.StartsWith(x)).ToList();
Where x is your keystroke search text term, y is your collection of strings to search, and Matches is the matches from your collection.
I tested this with the first 1 million prime numbers, here is the code adapted from above:
Stopwatch SW = new Stopwatch();
SW.Start();
string x = "2";
List<string> y = System.IO.File.ReadAllText("primes1.txt").Split(' ').ToList();
y.RemoveAll(xo => xo == " " || xo == "" || xo == "\r\r\n");
List <string> Matches = y.Where(xo => xo.StartsWith(x)).ToList();
SW.Stop();
Console.WriteLine("matches: " + Matches.Count);
Console.WriteLine("time taken: " + SW.Elapsed.TotalSeconds);
Console.Read();
Result is:
matches: 77025
time taken: 0.4240604
Of course this is testing against numbers and I don't know whether linq converts the values before, or if numbers make any difference.

Related

Group list of strings by subparts of item

Sorry for the ambiguous title...
I find explaining my issue difficult - let me know if you need to know more.
I've got a list that i'd like to be grouped by part of a string.
This string is also in the list.
This is the complete list, its not static, and will contain different values.
CookieMaker_TransportSettingsManual
CookieMaker_TransportSettingsParameters
Cookie_WrapperSettings
Cookie_WrapperSettingsManual
Cookie_WrapperSettingsParameters
Cookie_ProfileBendSettings
Cookie_ProfileBendSettingsParameters
Cookie_HopperSettings
Cookie_HopperSettingsManual
Cookie_HopperSettingsParameters
Cookie_CutterSettings
Cookie_CutterSettingsManual
Cookie_CutterSettingsParameters
General_SpeedSetting
General_SpeedSettingManual
General_SpeedSettingSettings
General_CalibrationSettings
General_CalibrationSettingsCalibration
Bonbon_Vertical
Bonbon_VerticalAligner
Bonbon_VerticalHopper
Bonbon_VerticalManual
Bonbon_VerticalTransporter
Bonbon_Horizontal
Bonbon_HorizontalHopper
Bonbon_HorizontalManual
Bonbon_HorizontalCookie
Bonbon_HorizontalTransporter
Bonbon_Bonbon
Bonbon_BonbonExhaust
Bonbon_BonbonManual
Bonbon_BonbonSection1
Bonbon_BonbonSection2
Bonbon_BonbonSection3
Bonbon_Compensator
Bonbon_CompensatorCarriage
Bonbon_CompensatorHopper
Bonbon_CompensatorManual
Bonbon_CollectingUnit
Bonbon_CollectingUnitManual
Bonbon_CollectingUnitTransporter
Bonbon_CollectingUnitTubeMaker
CookieMaker_TransportSettings
CookieMaker_TransportSettingsBonbon
CookieMaker_TransportSettingsPandora
The expected result would be a groups like so:
General_SpeedSetting
==> General_SpeedSettingManual
==> General_SpeedSettingSettings
Cookie_WrapperSettings
==> Cookie_WrapperSettingsManual
==> Cookie_WrapperSettingsParameters
The resulting datatype does not matter.
Also i don't mind linq.
Code / fiddle to get up and running quickly;
using System;
public class Program
{
public static void Main()
{
var inputString = "CookieMaker_TransportSettingsManual|CookieMaker_TransportSettingsParameters|Cookie_WrapperSettings|Cookie_WrapperSettingsManual|Cookie_WrapperSettingsParameters|Cookie_ProfileBendSettings|Cookie_ProfileBendSettingsParameters|Cookie_HopperSettings|Cookie_HopperSettingsManual|Cookie_HopperSettingsParameters|Cookie_CutterSettings|Cookie_CutterSettingsManual|Cookie_CutterSettingsParameters|General_SpeedSetting|General_SpeedSettingManual|General_SpeedSettingSettings|General_CalibrationSettings|General_CalibrationSettingsCalibration|Bonbon_Vertical|Bonbon_VerticalAligner|Bonbon_VerticalHopper|Bonbon_VerticalManual|Bonbon_VerticalTransporter|Bonbon_Horizontal|Bonbon_HorizontalHopper|Bonbon_HorizontalManual|Bonbon_HorizontalCookie|Bonbon_HorizontalTransporter|Bonbon_Bonbon|Bonbon_BonbonExhaust|Bonbon_BonbonManual|Bonbon_BonbonSection1|Bonbon_BonbonSection2|Bonbon_BonbonSection3|Bonbon_Compensator|Bonbon_CompensatorCarriage|Bonbon_CompensatorHopper|Bonbon_CompensatorManual|Bonbon_CollectingUnit|Bonbon_CollectingUnitManual|Bonbon_CollectingUnitTransporter|Bonbon_CollectingUnitTubeMaker|CookieMaker_TransportSettings|CookieMaker_TransportSettingsBonbon|CookieMaker_TransportSettingsPandora";
var inputList = inputString.Split('|');
var result = inputList; // Code here ;)
foreach(var r in result)
{ Console.WriteLine(r);}
}
}
https://dotnetfiddle.net/neCUEL
What about something like this?
using System;
using System.Collections.Generic;
using System.Linq;
public class Program
{
static List<string> myList = new List<string>(){
"CookieMaker_TransportSettingsManual",
"CookieMaker_TransportSettingsParameters",
"Cookie_WrapperSettings",
"Cookie_WrapperSettingsManual",
"Cookie_WrapperSettingsParameters",
"Cookie_ProfileBendSettings",
"Cookie_ProfileBendSettingsParameters",
"Cookie_HopperSettings",
"Cookie_HopperSettingsManual",
"Cookie_HopperSettingsParameters",
"Cookie_CutterSettings",
"Cookie_CutterSettingsManual",
"Cookie_CutterSettingsParameters",
"General_SpeedSetting",
"General_SpeedSettingManual",
"General_SpeedSettingSettings",
"General_CalibrationSettings",
"General_CalibrationSettingsCalibration",
"Bonbon_Vertical",
"Bonbon_VerticalAligner",
"Bonbon_VerticalHopper",
"Bonbon_VerticalManual",
"Bonbon_VerticalTransporter",
"Bonbon_Horizontal",
"Bonbon_HorizontalHopper",
"Bonbon_HorizontalManual",
"Bonbon_HorizontalCookie",
"Bonbon_HorizontalTransporter",
"Bonbon_Bonbon",
"Bonbon_BonbonExhaust",
"Bonbon_BonbonManual",
"Bonbon_BonbonSection1",
"Bonbon_BonbonSection2",
"Bonbon_BonbonSection3",
"Bonbon_Compensator",
"Bonbon_CompensatorCarriage",
"Bonbon_CompensatorHopper",
"Bonbon_CompensatorManual",
"Bonbon_CollectingUnit",
"Bonbon_CollectingUnitManual",
"Bonbon_CollectingUnitTransporter",
"Bonbon_CollectingUnitTubeMaker",
"CookieMaker_TransportSettings",
"CookieMaker_TransportSettingsBonbon",
"CookieMaker_TransportSettingsPandora"
};
static Dictionary<string, List<string>> results = new Dictionary<string, List<string>>();
//-------------------------------------------------------------------------//
public static void Main()
{
var orderedList = myList.OrderBy(i=>i).ToList();
int i = 0;
while(i < myList.Count){
var prefix = orderedList[i];
results[prefix] = new List<string>();
if(++i >= orderedList.Count) break;
while(orderedList[i].StartsWith(prefix)){
results[prefix].Add(orderedList[i]);
i++;
if(i >= orderedList.Count) {
Print();
return;
}
}//while
}//while
Print();
}//main
//-------------------------------------------------------------------------//
private static void Print(){
foreach (string prefix in results.Keys)
{
Console.WriteLine($"Prefix =>{prefix} - {results[prefix].Count}");
foreach (string result in results[prefix])
{
Console.WriteLine($" ======>{result}");
}//foreach;
}//foreach
}//Print
}//Cls
Fiddle:
https://dotnetfiddle.net/GTI4vV
I'm surprised you accepted a solution that pre-sorted the items. When I tried that, the Bonbon sections got terribly messed up.
My solution is a bit hacky - to get this to work the way I think you want it took a lot of special cases (and fixing off-by-one issues).
The code takes care of this kind of pattern:
CookieMaker_TransportSettingsManual
CookieMaker_TransportSettingsParameters
extracting CookieMaker_TransportSettings and putting both entries under it. It also copes with the fact that you have CookieMaker_TransportSettings at the beginning and the end of the file.
It also handles this:
Bonbon_BonbonSection1
Bonbon_BonbonSection2
Bonbon_BonbonSection3
Figuring that you want the three of those to be part of the Bonbon_Bonbon section and not a new Bonbon_BonbonSection section with three entries (1, 2 and 3).
It also deals with all the Cookie** and Bonbon** sections.
Here's the main code:
//get all the strings from somewhere
var inputStrings = File.ReadAllLines("DataFile.txt");
string lastTitle = null;
var results = new Dictionary<string, List<string>>();
string veryLastItem = string.Empty;
var currentItems = new List<string>();
for (var i = 0; i < inputStrings.Length - 1; ++i)
{
var commonPrefix = FindLongestCommonPrefix(inputStrings[i], inputStrings[i + 1]);
if (string.IsNullOrEmpty(commonPrefix) || (!string.IsNullOrEmpty(lastTitle) && commonPrefix != lastTitle))
{
if (string.IsNullOrEmpty(lastTitle))
{
throw new Exception("This isn't going to work - you need to have at least two common things in a row");
}
if (inputStrings[i].StartsWith(lastTitle) && inputStrings[i] != lastTitle)
{
currentItems.Add(inputStrings[i]);
}
AddResultsToDictionary(results, lastTitle, currentItems);
currentItems = new List<string>();
}
if (commonPrefix != inputStrings[i] &&
((commonPrefix == lastTitle && commonPrefix != inputStrings[i]) ||
(!string.IsNullOrEmpty(commonPrefix) && inputStrings[i].StartsWith(commonPrefix))))
{
currentItems.Add(inputStrings[i]);
}
lastTitle = commonPrefix;
veryLastItem = inputStrings[i + 1];
}
//ok, we're out of the loop:
//add the last item to the current list
currentItems.Add(veryLastItem);
//and add the last set of items to the dictionary
if (lastTitle != null)
{
AddResultsToDictionary(results, lastTitle, currentItems);
}
foreach (var result in results)
{
Debug.WriteLine(result.Key);
foreach (var item in result.Value)
{
Debug.WriteLine($" ==> {item}");
}
}
void AddResultsToDictionary(Dictionary<string, List<string>> dictionary, string s, List<string> list)
{
if (dictionary.TryGetValue(s, out var existingList))
{
existingList.AddRange(list);
}
else
{
dictionary.Add(s, list);
}
}
}
And it calls this function to determine the section headings:
private string FindLongestCommonPrefix(string s1, string s2)
{
var minLen = Math.Min(s1.Length, s2.Length);
for (var i = 0; i < minLen; ++i)
{
if (s1[i] != s2[i])
{
if (i == 0)
{
return string.Empty;
}
else
{
//if the common part is not s1, we need to find the last place where the following
// the last letter of the common part is a lower case letter followed by either
// an underscore or a capital letter
if (i == s1.Length)
{
return s1;
}
if (s1[i] == '_' || s1[i - 1] == '_' || s2[i] == '_' || s2[i - 1] == '_')
{
return string.Empty;
}
for (var j = i; j > 0; --j)
{
if (char.IsLower(s1[j-1]) && (char.IsUpper(s1[j]) /*|| s1[j] == '_'*/))
{
return s1.Substring(0, j);
}
}
//I shouldn't get here, but, if I do
return string.Empty;
}
}
}
//otherwise
return s1.Substring(0, minLen);
}
The result ends up looking like:
CookieMaker_TransportSettings
==> CookieMaker_TransportSettingsManual
==> CookieMaker_TransportSettingsParameters
==> CookieMaker_TransportSettingsBonbon
==> CookieMaker_TransportSettingsPandora
Cookie_WrapperSettings
==> Cookie_WrapperSettingsManual
==> Cookie_WrapperSettingsParameters
Cookie_ProfileBendSettings
==> Cookie_ProfileBendSettingsParameters
Cookie_HopperSettings
==> Cookie_HopperSettingsManual
==> Cookie_HopperSettingsParameters
Cookie_CutterSettings
==> Cookie_CutterSettingsManual
==> Cookie_CutterSettingsParameters
General_SpeedSetting
==> General_SpeedSettingManual
==> General_SpeedSettingSettings
General_CalibrationSettings
==> General_CalibrationSettingsCalibration
Bonbon_Vertical
==> Bonbon_VerticalAligner
==> Bonbon_VerticalHopper
==> Bonbon_VerticalManual
==> Bonbon_VerticalTransporter
Bonbon_Horizontal
==> Bonbon_HorizontalHopper
==> Bonbon_HorizontalManual
==> Bonbon_HorizontalCookie
==> Bonbon_HorizontalTransporter
Bonbon_Bonbon
==> Bonbon_BonbonExhaust
==> Bonbon_BonbonManual
==> Bonbon_BonbonSection1
==> Bonbon_BonbonSection2
==> Bonbon_BonbonSection3
Bonbon_Compensator
==> Bonbon_CompensatorCarriage
==> Bonbon_CompensatorHopper
==> Bonbon_CompensatorManual
Bonbon_CollectingUnit
==> Bonbon_CollectingUnitManual
==> Bonbon_CollectingUnitTransporter
==> Bonbon_CollectingUnitTubeMaker
Try following :
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace ConsoleApplication1
{
class Program
{
const string FILENAME = #"c:\temp\test.txt";
static void Main(string[] args)
{
List<string> lines = File.ReadLines(FILENAME).ToList();
lines = lines.OrderBy(x => x).ToList();
List<Group> groups = new List<Group>();
Group group = new Group();
groups.Add(group);
group.basename = lines[0].Trim();
List<List<string>> results = new List<List<string>>();
for (int i = 2; i < lines.Count; i++)
{
string line = lines[i].Trim();
if (!line.StartsWith(group.basename))
{
group = new Group();
groups.Add(group);
group.basename = line;
}
else
{
if(group.values == null) group.values = new List<string>();
group.values.Add(line.Substring(group.basename.Length));
}
}
}
}
public class Group
{
public string basename { get; set; }
public List<string> values { get; set; }
}
}

fill in delimited sequence

I have a file that contains a list of delimited sequence numbers as a record key. I need to fill in the missing sequence. So if I have
8
8.2
8.3.4.1
I need to add
8.1
8.3
8.3.1
8.3.2
8.3.3
8.3.4
I have come up with a few algorithms but they're all horribly complex and have too many cases. Is there an easy way to do this or do I have to plod through? I'm using c# but Java would do.
Not sure if my solution is easy to understand, but let's try. The idea is that we recursively insert missing sequences between existing ones.
First, you need to parse your file to create a List of items representing existing sequence. Every item should have reference to the next one (linked list idea).
public class Item
{
public int Value { get; set; }
public Item SubItem { get; set; }
public Item NextItem { get; set; }
public Item(int value, Item subItem)
{
Value = value;
SubItem = subItem;
}
public Item CreatePreviousItem()
{
if (SubItem == null)
{
return Value == 1 ? null : new Item(Value - 1, null);
}
return new Item(Value, SubItem.CreatePreviousItem());
}
public bool IsItemMissingPrior(Item item)
{
if (item == null)
{
return false;
}
return
item.Value - Value > 1
|| (SubItem == null && item.SubItem != null && item.SubItem.Value > 1) //edge case
|| (SubItem != null && SubItem.IsItemMissingPrior(item.SubItem));
}
public override string ToString()
{
return Value + (SubItem != null ? "." + SubItem : "");
}
}
Assuming that sequences are delimited by new line symbol, you can use the following Parse method.
private List<Item> Parse(string s)
{
var result = new List<Item>();
var numberLines = s.Split(new[] {Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries);
foreach (var numberLine in numberLines)
{
var numbers = numberLine.Split(new[] {'.'}).Reverse();
Item itemInstance = null;
foreach (var number in numbers)
{
itemInstance = new Item(Convert.ToInt32(number), itemInstance);
}
if (result.Count > 0)
{
result.Last().NextItem = itemInstance;
}
result.Add(itemInstance);
}
return result;
}
Here is a recursive method which inserts missing sequences between two existing ones
private void UpdateSequence(Item item)
{
if (item.IsItemMissingPrior(item.NextItem))
{
var inBetweenItem = item.NextItem.CreatePreviousItem();
inBetweenItem.NextItem = item.NextItem;
item.NextItem = inBetweenItem;
UpdateSequence(item);
}
}
And finally the use case:
var inputItems = Parse(inputString);
foreach (var item in inputItems)
{
UpdateSequence(item);
}
That's it. To see the result, you just need to get the first item from the list and keep moving forward using NextItem property. For example
var displayItem = inputItems.FirstOrDefault();
while (displayItem != null)
{
Console.WriteLine(displayItem.ToString());
displayItem = displayItem.NextItem;
}
Hope it helps.
One of the easy ways (though not optimal for some cases) will be maintaining a set of existing keys. For each key in your initial sequence you can add all the preceding keys to that set. This can be done in two loops: in inner loop you add a key to a set and decrease last number of a key by one while the number is more than zero, in outer loop you decrease the length of a key by one.
And then you just need to output the set in sorted order.

Grouping by an unknown initial prefix

Say I have the following array of strings as an input:
foo-139875913
foo-aeuefhaiu
foo-95hw9ghes
barbazabejgoiagjaegioea
barbaz8gs98ghsgh9es8h
9a8efa098fea0
barbaza98fyae9fghaefag
bazfa90eufa0e9u
bazgeajga8ugae89u
bazguea9guae
aifeaufhiuafhe
There are 3 different prefixes used here, "foo-", "barbaz" and "baz" - however these prefixes are not known ahead of time (they could be something completely different).
How could you establish what the different common prefixes are so that they could then be grouped by? This is made a bit tricky since in the data I've provided there's two that start with "bazg" and one that starts "bazf" where of course "baz" is the prefix.
What I've tried so far is sorting them into alphabetical order, and then looping through them in order and counting how many characters in a row are identical to the previous. If the number is different or when 0 characters are identical, it starts a new group. The problem with this is it falls over at the "bazg" and "bazf" problem I mentioned earlier and separates those into two different groups (one with just one element in it)
Edit: Alright, let's throw a few more rules in:
Longer potential groups should generally be preferred over shorter ones, unless there is a closely matching group of less than X characters difference in length. (So where X is 2, baz would be preferred over bazg)
A group must have at least Y elements in it or not be a group at all
It's okay to simply throw away elements that don't match any of the 'groups' to within the rules above.
To clarify the first rule in relation to the second, if X was 0 and Y was 2, then the two 'bazg' entries would be in a group, and the 'bazf' would be thrown away because its on its own.
Well, here's a quick hack, probably O(something_bad):
IEnumerable<Tuple<String, IEnumerable<string>>> GuessGroups(IEnumerable<string> source, int minNameLength=0, int minGroupSize=1)
{
// TODO: error checking
return InnerGuessGroups(new Stack<string>(source.OrderByDescending(x => x)), minNameLength, minGroupSize);
}
IEnumerable<Tuple<String, IEnumerable<string>>> InnerGuessGroups(Stack<string> source, int minNameLength, int minGroupSize)
{
if(source.Any())
{
var tuple = ExtractTuple(GetBestGroup(source, minNameLength), source);
if (tuple.Item2.Count() >= minGroupSize)
yield return tuple;
foreach (var element in GuessGroups(source, minNameLength, minGroupSize))
yield return element;
}
}
Tuple<String, IEnumerable<string>> ExtractTuple(string prefix, Stack<string> source)
{
return Tuple.Create(prefix, PopWithPrefix(prefix, source).ToList().AsEnumerable());
}
IEnumerable<string> PopWithPrefix(string prefix, Stack<string> source)
{
while (source.Any() && source.Peek().StartsWith(prefix))
yield return source.Pop();
}
string GetBestGroup(IEnumerable<string> source, int minNameLength)
{
var s = new Stack<string>(source);
var counter = new DictionaryWithDefault<string, int>(0);
while(s.Any())
{
var g = GetCommonPrefix(s);
if(!string.IsNullOrEmpty(g) && g.Length >= minNameLength)
counter[g]++;
s.Pop();
}
return counter.OrderBy(c => c.Value).Last().Key;
}
string GetCommonPrefix(IEnumerable<string> coll)
{
return (from len in Enumerable.Range(0, coll.Min(s => s.Length)).Reverse()
let possibleMatch = coll.First().Substring(0, len)
where coll.All(f => f.StartsWith(possibleMatch))
select possibleMatch).FirstOrDefault();
}
public class DictionaryWithDefault<TKey, TValue> : Dictionary<TKey, TValue>
{
TValue _default;
public TValue DefaultValue {
get { return _default; }
set { _default = value; }
}
public DictionaryWithDefault() : base() { }
public DictionaryWithDefault(TValue defaultValue) : base() {
_default = defaultValue;
}
public new TValue this[TKey key]
{
get { return base.ContainsKey(key) ? base[key] : _default; }
set { base[key] = value; }
}
}
Example usage:
string[] input = {
"foo-139875913",
"foo-aeuefhaiu",
"foo-95hw9ghes",
"barbazabejgoiagjaegioea",
"barbaz8gs98ghsgh9es8h",
"barbaza98fyae9fghaefag",
"bazfa90eufa0e9u",
"bazgeajga8ugae89u",
"bazguea9guae",
"9a8efa098fea0",
"aifeaufhiuafhe"
};
GuessGroups(input, 3, 2).Dump();
Ok, well as discussed, the problem wasn't initially well defined, but here is how I'd go about it.
Create a tree T
Parse the list, for each element:
for each letter in that element
if a branch labeled with that letter exists then
Increment the counter on that branch
Descend that branch
else
Create a branch labelled with that letter
Set its counter to 1
Descend that branch
This gives you a tree where each of the leaves represents a word in your input. Each of the non-leaf nodes has a counter representing how many leaves are (eventually) attached to that node. Now you need a formula to weight the length of the prefix (the depth of the node) against the size of the prefix group. For now:
S = (a * d) + (b * q) // d = depth, q = quantity, a, b coefficients you'll tweak to get desired behaviour
So now you can iterate over each of the non-leaf node and assign them a score S. Then, to work out your groups you would
For each non-leaf node
Assign score S
Insertion sort the node in to a list, so the head is the highest scoring node
Starting at the root of the tree, traverse the nodes
If the node is the highest scoring node in the list
Mark it as a prefix
Remove all nodes from the list that are a descendant of it
Pop itself off the front of the list
Return up the tree
This should give you a list of prefixes. The last part feels like some clever data structures or algorithms could speed it up (the last part of removing all the children feels particularly weak, but if you input size is small, I guess speed isn't too important).
I'm wondering if your requirements aren't off. It seems as if you are looking for a specific grouping size as opposed to specific key size requirements. I have below a program that will, based on a specified group size, break up the strings into the largest possible groups up too, and including the group size specified. So if you specify a group size of 5, then it will group items on the smallest key possible to make a group of size 5. In your example it would group foo- as f since there is no need to make a more complex key as an identifier.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace ConsoleApplication2
{
class Program
{
/// <remarks><c>true</c> in returned dictionary key are groups over <paramref name="maxGroupSize"/></remarks>
public static Dictionary<bool,Dictionary<string, List<string>>> Split(int maxGroupSize, int keySize, IEnumerable<string> items)
{
var smallItems = from item in items
where item.Length < keySize
select item;
var largeItems = from item in items
where keySize < item.Length
select item;
var largeItemsq = (from item in largeItems
let key = item.Substring(0, keySize)
group item by key into x
select new { Key = x.Key, Items = x.ToList() } into aGrouping
group aGrouping by aGrouping.Items.Count() > maxGroupSize into x2
select x2).ToDictionary(a => a.Key, a => a.ToDictionary(a_ => a_.Key, a_ => a_.Items));
if (smallItems.Any())
{
var smallestLength = items.Aggregate(int.MaxValue, (acc, item) => Math.Min(acc, item.Length));
var smallItemsq = (from item in smallItems
let key = item.Substring(0, smallestLength)
group item by key into x
select new { Key = x.Key, Items = x.ToList() } into aGrouping
group aGrouping by aGrouping.Items.Count() > maxGroupSize into x2
select x2).ToDictionary(a => a.Key, a => a.ToDictionary(a_ => a_.Key, a_ => a_.Items));
return Combine(smallItemsq, largeItemsq);
}
return largeItemsq;
}
static Dictionary<bool, Dictionary<string,List<string>>> Combine(Dictionary<bool, Dictionary<string,List<string>>> a, Dictionary<bool, Dictionary<string,List<string>>> b) {
var x = new Dictionary<bool,Dictionary<string,List<string>>> {
{ true, null },
{ false, null }
};
foreach(var condition in new bool[] { true, false }) {
var hasA = a.ContainsKey(condition);
var hasB = b.ContainsKey(condition);
x[condition] = hasA && hasB ? a[condition].Concat(b[condition]).ToDictionary(c => c.Key, c => c.Value)
: hasA ? a[condition]
: hasB ? b[condition]
: new Dictionary<string, List<string>>();
}
return x;
}
public static Dictionary<string, List<string>> Group(int maxGroupSize, IEnumerable<string> items, int keySize)
{
var toReturn = new Dictionary<string, List<string>>();
var both = Split(maxGroupSize, keySize, items);
if (both.ContainsKey(false))
foreach (var key in both[false].Keys)
toReturn.Add(key, both[false][key]);
if (both.ContainsKey(true))
{
var keySize_ = keySize + 1;
var xs = from needsFix in both[true]
select needsFix;
foreach (var x in xs)
{
var fixedGroup = Group(maxGroupSize, x.Value, keySize_);
toReturn = toReturn.Concat(fixedGroup).ToDictionary(a => a.Key, a => a.Value);
}
}
return toReturn;
}
static Random rand = new Random(unchecked((int)DateTime.Now.Ticks));
const string allowedChars = "aaabbbbccccc"; // "aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ";
static readonly int maxAllowed = allowedChars.Length - 1;
static IEnumerable<string> GenerateText()
{
var list = new List<string>();
for (int i = 0; i < 100; i++)
{
var stringLength = rand.Next(3,25);
var chars = new List<char>(stringLength);
for (int j = stringLength; j > 0; j--)
chars.Add(allowedChars[rand.Next(0, maxAllowed)]);
var newString = chars.Aggregate(new StringBuilder(), (acc, item) => acc.Append(item)).ToString();
list.Add(newString);
}
return list;
}
static void Main(string[] args)
{
// runs 1000 times over autogenerated groups of sample text.
for (int i = 0; i < 1000; i++)
{
var s = GenerateText();
Go(s);
}
Console.WriteLine();
Console.WriteLine("DONE");
Console.ReadLine();
}
static void Go(IEnumerable<string> items)
{
var dict = Group(3, items, 1);
foreach (var key in dict.Keys)
{
Console.WriteLine(key);
foreach (var item in dict[key])
Console.WriteLine("\t{0}", item);
}
}
}
}

C# Linq question

I have a text file in which I am storing entries for an address book.
The layout is like so:
Name:
Contact:
Product:
Quantity:
I have written some linq code to grab the name plus the next four lines, for a search by name feature.
I also want to be able to search by contact.
The challenge is to match the contact info, grab the next 3 lines, and also grab the line prior to the match.
That way if Search By Contact is used, the full list of info will be returned.
private void buttonSearch_Click(object sender, EventArgs e)
{
string[] lines = File.ReadAllLines("C:/AddressBook/Customers.txt");
string name = textBoxSearchName.Text;
string contact = textBoxContact.Text;
if (name == "" && contact == "")
{
return;
}
var byName = from line in lines
where line.Contains(name)
select lines.SkipWhile(f => f != line).Take(4);
//var byContact = from line in lines
// where line.Contains(name)
// select lines.SkipWhile(f => f != name).Take(4);
if (name != "")
{
foreach (var item in byName)
foreach (var line in item) { listBox2.Items.Add(line); }
listBox2.Items.Add("");
}
//if (contact != "")
//{
// foreach (var item in byContact)
// foreach (var line in item) { listBox2.Items.Add(line); }
//listBox2.Items.Add("");
}
}
Firstly i would recommend changing your data storage approach if you can.
Secondly i would recommend reading the file into an object, something like this:
public class Contact
{
public string Name {get; set;}
public string Contact {get; set;}
public string Product {get; set;}
public int Quantity {get; set;}
}
...
public IEnumerable<Contact> GetContacts()
{
//make this read line by line if it is big!
string[] lines = File.ReadAllLines("C:/AddressBook/Customers.txt");
for (int i=0;i<lines.length;i += 4)
{
//add error handling/validation!
yield return new Contact()
{
Name = lines[i],
Contact = lines[i+1],
Product = lines[i+2],
Quantity = int.Parse(lines[i+3]
};
}
}
private void buttonSearch_Click(object sender, EventArgs e)
{
...
var results = from c in GetContacts()
where c.Name == name ||
c.Contact == contact
select c;
...
}
See if this will work
var contactLinesList = lines.Where(l => l.Contains(name))
.Select((l, i) => lines.Skip(i - 1).Take(4)).ToList();
contactLinesList.ForEach(cl => listBox2.Items.Add(cl));
This is not the smallest code in earth but it shows how to do a couple of things. Although I don't recommend using it, because it is quite complex to understand. This is to be considered as a hobbyist, just learning code!!! I suggest you load the file in a well known structure, and do Linq on that... anyway... this is a C# Console Application that does what you proposed using Linq syntax, and one extension method:
using System;
using System.Collections.Generic;
using System.Linq;
namespace stackoverflow.com_questions_5826306_c_linq_question
{
public class Program
{
public static void Main()
{
string fileData = #"
Name: Name-1
Contact: Xpto
Product: Abc
Quantity: 12
Name: Name-2
Product: Xyz
Contact: Acme
Quantity: 16
Name: Name-3
Product: aammndh
Contact: YKAHHYTE
Quantity: 2
";
string[] lines = fileData.Replace("\r\n", "\n").Split('\n');
var result = Find(lines, "contact", "acme");
foreach (var item in result)
Console.WriteLine(item);
Console.WriteLine("");
Console.WriteLine("Press any key");
Console.ReadKey();
}
private static string[] Find(string[] lines, string searchField, string searchValue)
{
var result = from h4 in
from g4 in
from i in (0).To(lines.Length)
select ((from l in lines select l).Skip(i).Take(4))
where !g4.Contains("")
select g4
where h4.Any(
x => x.Split(new char[] { ':' }, 2)[0].Equals(searchField, StringComparison.OrdinalIgnoreCase)
&& x.Split(new char[] { ':' }, 2)[1].Trim().Equals(searchValue, StringComparison.OrdinalIgnoreCase))
select h4;
var list = result.FirstOrDefault();
return list.ToArray();
}
}
public static class NumberExtensions
{
public static IEnumerable<int> To(this int start, int end)
{
for (int it = start; it < end; it++)
yield return it;
}
}
}
If your text file is small enough, I'd recommend using regular expressions instead. This is exactly the sort of thing it's designed to do. Off the top of my head, the expression will look something like this:
(?im)^Name:(.*?)$ ^Contact:search_term$^Product:(.*?)$^Quantity:(.*?)$

Please critique my class

I've taken a few school classes along time ago on and to be honest i never really understood the concept of classes. I recently "got back on the horse" and have been trying to find some real world application for creating a class.
you may have seen that I'm trying to parse a lot of family tree data that is in an very old and antiquated format called gedcom
I created a Gedcom Reader class to read in the file , process it and make it available as two lists that contain the data that i found necessary to use
More importantly to me is i created a class to do it so I would very much like to get the experts here to tell me what i did right and what i could have done better ( I wont say wrong because the thing works and that's good enough for me)
Class:
using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
namespace GedcomReader
{
class Gedcom
{
private string GedcomText = "";
public struct INDI
{
public string ID;
public string Name;
public string Sex;
public string BDay;
public bool Dead;
}
public struct FAM
{
public string FamID;
public string Type;
public string IndiID;
}
public List<INDI> Individuals = new List<INDI>();
public List<FAM> Families = new List<FAM>();
public Gedcom(string fileName)
{
using (StreamReader SR = new StreamReader(fileName))
{
GedcomText = SR.ReadToEnd();
}
ReadGedcom();
}
private void ReadGedcom()
{
string[] Nodes = GedcomText.Replace("0 #", "\u0646").Split('\u0646');
foreach (string Node in Nodes)
{
string[] SubNode = Node.Replace("\r\n", "\r").Split('\r');
if (SubNode[0].Contains("INDI"))
{
Individuals.Add(ExtractINDI(SubNode));
}
else if (SubNode[0].Contains("FAM"))
{
Families.Add(ExtractFAM(SubNode));
}
}
}
private FAM ExtractFAM(string[] Node)
{
string sFID = Node[0].Replace("# FAM", "");
string sID = "";
string sType = "";
foreach (string Line in Node)
{
// If node is HUSB
if (Line.Contains("1 HUSB "))
{
sType = "PAR";
sID = Line.Replace("1 HUSB ", "").Replace("#", "").Trim();
}
//If node for Wife
else if (Line.Contains("1 WIFE "))
{
sType = "PAR";
sID = Line.Replace("1 WIFE ", "").Replace("#", "").Trim();
}
//if node for multi children
else if (Line.Contains("1 CHIL "))
{
sType = "CHIL";
sID = Line.Replace("1 CHIL ", "").Replace("#", "");
}
}
FAM Fam = new FAM();
Fam.FamID = sFID;
Fam.Type = sType;
Fam.IndiID = sID;
return Fam;
}
private INDI ExtractINDI(string[] Node)
{
//If a individual is found
INDI I = new INDI();
if (Node[0].Contains("INDI"))
{
//Create new Structure
//Add the ID number and remove extra formating
I.ID = Node[0].Replace("#", "").Replace(" INDI", "").Trim();
//Find the name remove extra formating for last name
I.Name = Node[FindIndexinArray(Node, "NAME")].Replace("1 NAME", "").Replace("/", "").Trim();
//Find Sex and remove extra formating
I.Sex = Node[FindIndexinArray(Node, "SEX")].Replace("1 SEX ", "").Trim();
//Deterine if there is a brithday -1 means no
if (FindIndexinArray(Node, "1 BIRT ") != -1)
{
// add birthday to Struct
I.BDay = Node[FindIndexinArray(Node, "1 BIRT ") + 1].Replace("2 DATE ", "").Trim();
}
// deterimin if there is a death tag will return -1 if not found
if (FindIndexinArray(Node, "1 DEAT ") != -1)
{
//convert Y or N to true or false ( defaults to False so no need to change unless Y is found.
if (Node[FindIndexinArray(Node, "1 DEAT ")].Replace("1 DEAT ", "").Trim() == "Y")
{
//set death
I.Dead = true;
}
}
}
return I;
}
private int FindIndexinArray(string[] Arr, string search)
{
int Val = -1;
for (int i = 0; i < Arr.Length; i++)
{
if (Arr[i].Contains(search))
{
Val = i;
}
}
return Val;
}
}
}
Implementation:
using System;
using System.Windows.Forms;
using GedcomReader;
namespace WindowsFormsApplication1
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
string path = #"C:\mostrecent.ged";
string outpath = #"C:\gedcom.txt";
Gedcom GD = new Gedcom(path);
GraphvizWriter GVW = new GraphvizWriter("Family Tree");
foreach(Gedcom.INDI I in GD.Individuals)
{
string color = "pink";
if (I.Sex == "M")
{
color = "blue";
}
GVW.ListNode(I.ID, I.Name, "filled", color, "circle");
if (I.ID == "ind23800")
{MessageBox.Show("stop");}
//"ind23800" [ label="Sarah Mandley",shape="circle",style="filled",color="pink" ];
}
foreach (Gedcom.FAM F in GD.Families)
{
if (F.Type == "par")
{
GVW.ConnNode(F.FamID, F.IndiID);
}
else if (F.Type =="chil")
{
GVW.ConnNode(F.IndiID, F.FamID);
}
}
string x = GVW.SB.ToString();
GVW.SaveFile(outpath);
MessageBox.Show("done");
}
}
I am particularly interested in if anything could be done about the structures i don't know if how i use them in the implementation is the greatest but again it works
Thanks alot
Quick thoughts:
Nested types should not be visible.
ValueTypes (structs) should be immutable.
Fields (class variables) should not be public. Expose them via properties instead.
Check passed arguments for invalid values, like null.
It might be more readable. It's hard to read and understand.
You may study SOLID principles (http://butunclebob.com/ArticleS.UncleBob.PrinciplesOfOod)
Robert C. Martin gave good presentation on Oredev 2008 about clean code (http://www.oredev.org/topmenu/video/agile/robertcmartincleancodeiiifunctions.4.5a2d30d411ee6ffd2888000779.html)
Some recomended books to read about code readability:
Kent Beck "Implemetation patterns"
Robert C Martin "Clean Code" Robert C
Martin "Agile Principles, Patterns
and Practices in C#"
I suggest you check this place out: http://refactormycode.com/.
For some quick things, your naming is the biggest thing I would start to change.
No need to use ALL-CAPS or abbreviated terms.
Also, FxCop will help with a lot of suggested changes. For example, FindIndexinArray would be named FindIndexInArray.
EDIT:
I don't know if this is a bug in your code or by-design, but in FindIndexinArray, you don't break from your loop once you find a match. Do you want the first (break) or last (no break) match in the array?

Categories