Data structure for storing keywords and synonyms in C#?

Data structure for storing keywords and synonyms in C#? - c#

I'm working on a project in C# that I need to store somewhere between 10 to 15 keywords and their synonyms for.
The first way I thought of to store these was using a 2d list something like List> so that it would look like:
keyword1 synonym1 synonym2
keyword2 synonym1
keyword3 synonym1 synonym2
etc.
What I started to think about was if i'm getting an input string and splitting it to search each word to see if its a keyword or a synonym of a keyword in the list will a 2d list be fine for this or will searching it be too slow?
Hopefully my question makes sense I can clarify anything if it's not clear just ask. Thanks!

will searching [the list] be too slow?
When you are talking about 10..15 keywords, it is hard to come up with an algorithm inefficient enough to make end-users notice the slowness. There's simply not enough data to slow down a modern CPU.
One approach would be to build a Dictionary<string,string> that maps every synonym to its "canonical" keyword. This would include the canonical version itself:
var keywords = new Dictionary<string,string> {
["keyword1"] = "keyword1"
, ["synonym1"] = "keyword1"
, ["synonym2"] = "keyword1"
, ["keyword2"] = "keyword2"
, ["synonym3"] = "keyword2"
, ["keyword3"] = "keyword3"
};
Note how both keywords and synonyms appear as keys, while only keywords appear as values. This lets you look up a keyword or synonym, and get back a guaranteed keyword.

I would probably use a Dictionary. Where the key is your synonym and the value is your key word. So you do a look up in the Dictionary for any word and get the actual key word you want. For example:
private Dictionary<string, string> synonymKeywordDict = new Dictionary<string, string>();
public SearchResult Search(IEnumerable<string> searchTerms)
{
var keywords = searchTerms.Select(x => synonymKeywordDict[x]).Distinct().ToList();
//keywords now contains your key words after being translated from any synonyms
}
Just in case I'm not clear enough the Dictionary would be loaded like so.
private void LoadDictionary()
{
//So our lookup doesn't fail on the key word itself.
synonymKeywordDict.Add("computer", "computer");
//Then all our synonyms
synonymKeywordDict.Add("desktop", "computer");
synonymKeywordDict.Add("PC", "computer");
}

Related

Parsing this special format file

I have a file that is formatted this way --
{2000}000000012199{3100}123456789*{3320}110009558*{3400}9876
54321*{3600}CTR{4200}D2343984*JOHN DOE*1232 STREET*DALLAS TX
78302**{5000}D9210293*JANE DOE*1234 STREET*SUITE 201*DALLAS
TX 73920**
Basically, the number in curly brackets denotes field, followed by the value for that field. For example, {2000} is the field for "Amount", and the value for it is 121.99 (implied decimal). {3100} is the field for "AccountNumber" and the value for it is 123456789*.
I am trying to figure out a way to split the file into "records" and each record would contain the record type (the value in the curly brackets) and record value, but I don't see how.
How do I do this without a loop going through each character in the input?

A different way to look at it.... The { character is a record delimiter, and the } character is a field delimiter. You can just use Split().
var input = #"{2000}000000012199{3100}123456789*{3320}110009558*{3400}987654321*{3600}CTR{4200}D2343984*JOHN DOE*1232 STREET*DALLAS TX78302**{5000}D9210293*JANE DOE*1234 STREET*SUITE 201*DALLASTX 73920**";
var rows = input.Split( new [] {"{"} , StringSplitOptions.RemoveEmptyEntries);
foreach (var row in rows)
{
var fields = row.Split(new [] { "}"}, StringSplitOptions.RemoveEmptyEntries);
Console.WriteLine("{0} = {1}", fields[0], fields[1]);
}
Output:
2000 = 000000012199
3100 = 123456789*
3320 = 110009558*
3400 = 987654321*
3600 = CTR
4200 = D2343984*JOHN DOE*1232 STREET*DALLAS TX78302**
5000 = D9210293*JANE DOE*1234 STREET*SUITE 201*DALLASTX 73920**
Fiddle

This regular expression should get you going:
Match a literal {
Match 1 or more digts ("a number")
Match a literal }
Match all characters that are not an opening {
\{\d+\}[^{]+
It assumes that the values itself cannot contain an opening curly brace. If that's the case, you need to be more clever, e.g. #"\{\d+\}(?:\\{|[^{])+" (there are likely better ways)
Create a Regex instance and have it match against the text. Each "field" will be a separate match
var text = #"{123}abc{456}xyz";
var regex = new Regex(#"\{\d+\}[^{]+", RegexOptions.Compiled);
foreach (var match in regex.Matches(text)) {
Console.WriteLine(match.Groups[0].Value);
}

This doesn't fully answer the question, but it was getting too long to be a comment, so I'm leaving it here in Community Wiki mode. It does, at least, present a better strategy that may lead to a solution:
The main thing to understand here is it's rare — like, REALLY rare — to genuinely encounter a whole new kind of a file format for which an existing parser doesn't already exist. Even custom applications with custom file types will still typically build the basic structure of their file around a generic format like JSON or XML, or sometimes an industry-specific format like HL7 or MARC.
The strategy you should follow, then, is to first determine exactly what you're dealing with. Look at the software that generates the file; is there an existing SDK, reference, or package for the format? Or look at the industry surrounding this data; is there a special set of formats related to that industry?
Once you know this, you will almost always find an existing parser ready and waiting, and it's usually as easy as adding a NuGet package. These parsers are genuinely faster, need less code, and will be less susceptible to bugs (because most will have already been found by someone else). It's just an all-around better way to address the issue.
Now what I see in the question isn't something I recognize, so it's just possible you genuinely do have a custom format for which you'll need to write a parser from scratch... but even so, it doesn't seem like we're to that point yet.

Here is how to do it in linq without slow regex
string x = "{2000}000000012199{3100}123456789*{3320}110009558*{3400}987654321*{3600}CTR{4200}D2343984*JOHN DOE*1232 STREET*DALLAS TX78302**{5000}D9210293*JANE DOE*1234 STREET*SUITE 201*DALLASTX 73920**";
var result =
x.Split('{',StringSplitOptions.RemoveEmptyEntries)
.Aggregate(new List<Tuple<string, string>>(),
(l, z) => { var az = z.Split('}');
l.Add(new Tuple<string, string>(az[0], az[1]));
return l;})
LinqPad output:

How to localize a string in unity where different languages may have different grammars

I'm translating a Unity game and some of the lines go like
Unlock at XXXX
where "XXXX" is replaced at runtime by an arbitrary substring. Easy enough to replace the wildcards, but to translate the quote, I can't simply concatenate a + b, as some languages will have the value before or inside the string. I figured I needed to, effectively, de-replace it, ie isolate and keep the substring and translate whatever's around it.
Problem is that while I can easily do the second part, I can't think of any avenues for the first. I know to get the character index of what I'm looking for, but the value takes up an arbitrary number of characters, and I can't use whitespace since some languages don't use it. Can't use digit detection since not all of the values are going to be numbers. I tried asking Google, but I couldn't translate "find whatever replaces a wildcard" into something keyword-searchable.
In short, what I'm looking for is a way to find the "XXXX" (the easy part) and then find whatever replaces it in the string (the less-easy part).
Thanks in advance.

I eventually found a workaround, thanks to everybody's kind advice. I stored the substring and referred to it in a special translation method that does take in a value. Thanks for your kind help, everybody.
public static string TranslateWithValue (string text, string value, int language) {
string sauce = text.Replace (value, "XXXX");
sauce = Translate (sauce, language);
sauce = sauce.Replace ("XXXX", value);
return sauce;
}

Usually, I use string.Format in such cases. In your case, I'd declare 2 localizeable strings:
string unlockFormat = "Unlock at {0}";
string unlockValue = "next level";
When you need the unlock condition displayed, you can combine the strings like that:
string unlockCondition = string.Format(unlockFormat, unlockValue);
which will produce the string "Unlock at next level".
Both unlockFormat and unlockValue can be translated, and the translator can move {0} wherever needed.

Load string collection from external data in C#

I have a class that has a collection of strings that is used to validate values before storing them. The declaration for the collection looks like this:
public readonly System.Collections.Specialized.StringCollection RetentionRange =
new System.Collections.Specialized.StringCollection()
{ "NR", "AR", "PE", "EX", "LI", "TE", "FR" };
I'd like to maintain the list of valid codes outside of the compiled class. How do I go about that? BTW, there's no requirement for the strings to be limited to two characters, they just happen to be in the current scheme.
EDIT: By "external data" I mean something like a config file.

You can store a StringCollection in the AppSettings.
Each value is separated by a new line.
Here's a screenshot (german IDE but it might be helpful anyway)
You can read it in this way:
var myStringCollection = Properties.Settings.Default.MyCollection;
foreach (String value in myStringCollection)
{
// do something
}

Storing it in a text file is an option.
How to store:
var dir = #"somedirectory\textfile.txt";
var text = string.Join(",",RetentionRange.Cast<string>());
File.WriteAllText(dir,text);
How to retrieve:
var text= File.ReadAllText(#"somedirectory\textfile.txt");
foreach(var str in text.Split(","))
RetentionRange.Add(text);

The right answer to your question depends a great deal on why you want to store the strings outside the assembly.
Assuming the reason you want this is because the collection of strings is expected to change over time, I would suggest you create your own System.Configuration.ConfigurationSection implementation that defines one of the elements as a System.Configuration.ConfigurationElementCollection instance. If this is a tad too complicated for your requirements, then an appSettings key with a value consisting of a comma separated list of strings (from which you would build your StringCollection at runtime) might actually be a better solution.
The answers to this question have examples of both approaches.

Highlighting keywords in text

I have a SQL Server 2008 database that has one table with a varchar(1000) field that contains a bunch of user input about books. I have another table that contains a bunch of keywords. When I render the user's info about the books, I want to highlight (or eventually create a hyperlink) on these keywords. I'm looking for suggestions on the most efficient method for scanning through the text and matching the keywords up. I wasn't sure if there's a way to do it right in SQL or it it needs to be in the code. Thanks.

I recommend doing it in code. It's business logic, and besides doing so lightens the load in the database - so if the database is in a machine other than the servers running your app, you don't hog that machine's resources.
I think regex would do the trick for you - it's the most efficient method of text matching ever invented, and the internal implementation in most technologies (not only .NET) is pretty much the best you can get. If you try to come up with something else you'll at best be reinventing the wheel.
So I'd do this: place every keyword in a hashtable or dictionary - which has the bonus of cutting off duplicates, then iterate through that. Then for each match on of the keyword in the main text, you can get the first and last index of the match and wrap that with markup for the highlight and the link.

Here's a quick LINQPad program that I wrote up that will replace values in a string with dictionary values that match up on the keys. Let me know if this is what you were looking for.
Side Note: I agree with everyone else that you should be doing this in the application layer for a variety of reasons.
void Main()
{
Dictionary<string, string> links = new Dictionary<string, string>();
links.Add("awesome", "link-to-awesome");
links.Add("okay", "link-to-okay");
string text = "This is some text about an okay book review of an otherwise awesome book.";
string result = links.Aggregate(text, (current, kvp) => current.Replace(kvp.Key, kvp.Value));
text.Dump();
result.Dump();
}
Results:
This is some text about an okay book review of an otherwise awesome book.
This is some text about an link-to-okay book review of an otherwise link-to-awesome book.
Edit: This isn't a perfect example. You'd have to strip out punctuation and what-not for a final version. Hopefully this gets you on the right track.

If the links are fixed then personally I would do this when WRITING to the DB as it will only be done once, then it required no extra effort to display.

How to enumerate Linq results?

Referring to this topic: Minimize LINQ string token counter
And using the following provided code:
string src = "for each character in the string, take the rest of the " +
"string starting from that character " +
"as a substring; count it if it starts with the target string";
var results = src.Split() // default split by whitespace
.GroupBy(str => str) // group words by the value
.Select(g => new
{
str = g.Key, // the value
count = g.Count() // the count of that value
});
I need to enumerate all values (keywords and number of occurrences) and load them into NameValueCollection. Due to my very limited Linq knowledge, can't figure out how to make it. Please advise. Thanks.

I won't guess why you want to put anything in a NameValueCollection, but is there some reason why
foreach (var result in results)
collection.Add(result.str, result.count.ToString());
is not sufficient?
(EDIT: Changed accessor to Add, which may be better for your use case.)
If the answer is "no, that works" you should probably stop and figure out what the hell the above code is doing before using it in your project.

Looks like your particular problem could just as easily use a Dictionary instead of a NameValueCollection. I forget if this is the correct ToDictionary syntax, but just google the ToDictionary() method:
Dictionary<string, int> useADictionary = results.ToDictionary(x => x.str, x => x.count);

You certainly want a Dictionary instead of NameValueCollection. The whole point is to show unique tokens (strings) with each token's occurrence count (an int), yes?
NameValueCollection is a special-purpose collection that requires string and key and value - Dictionary<string, int> is the mainstream .Net way to associate a unique string key with its corresponding int value.
Take a look at the various System.Collections namespaces to understand what each is intended to achieve. Typically these days, System.Collections.Generic is the most widely-seen, with System.Collections.Concurrent for multithreaded programs.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Data structure for storing keywords and synonyms in C#? - c#

Related

Parsing this special format file

How to localize a string in unity where different languages may have different grammars

Load string collection from external data in C#

Highlighting keywords in text

How to enumerate Linq results?

Categories

Resources