Finding Definition of an Acronym in a sentence - c#

I am working on a sample program in C# with Visual Studios 2013. I have some logic that will find a all uppercase acronym as shown below :
string docStr = "Made at Training And Doctrine (TAD)";
string allUpperRegStr = "\\([A-Z]{2,}\\)";
Match mUpper = Regex.Match(docStr, allUpperRegStr);
If (mUpper.Success)
{
string remWS = mUpper.Value.Trim();
}
So the above logic finds the (TAD) acronym, what I need is some way to parse the sentence and find a match for the definition of the acronym which is Training And Doctrine. Any help would be appreciated.

You should construct a new regex that will look something like (T[a-z]+\sA[a-z]+\sD[a-z]), and that should be able to capture "Training And Doctrine". You might have to consider scenarios where the definition contains punctuation characters or other variations (multiple spaces for example), and maybe adjust the regex string accordingly.
EDIT: Full Solution - EDIT2: Ignore case (this was not validated to work yet)
string docStr = "Made at Training And Doctrine (TAD)";
string allUpperRegStr = "\\([A-Z]{2,}\\)";
Match mUpper = Regex.Match(docStr, allUpperRegStr);
if (mUpper.Success)
{
string remWS = mUpper.Value.Trim();
char [] chars = remWS.toCharArray();
IEnumerable<string> lowerUpper = from l in chars
where l !='(' && l != ')'
select string.Format("[{0}{1}][a-z]+", Char.ToLower(l), Char.ToUpper(l));
string regex2 = string.Format("({0})", string.Join("\\s", lowerUpper));
Match mDefinition = Regex.Match(docStr, regex2);
if (mDefinition.Success)
{
string definition = mDefinition.Value.Trim();
}
}
Latest fix included:
Fixing target regex string (the last character has to be followed by [a-z] as well)
Strip '(' & ')' from lowerUpper
Working regex sample

Just use the words directly as your regex, like \Training And Doctrine\
A regex parser will recognize the pattern based on a string.

Related

C# Extract part of the string that starts with specific letters

I have a string which I extract from an HTML document like this:
var elas = htmlDoc.DocumentNode.SelectSingleNode("//a[#class='a-size-small a-link-normal a-text-normal']");
if (elas != null)
{
//
_extractedString = elas.Attributes["href"].Value;
}
The HREF attribute contains this part of the string:
gp/offer-listing/B002755TC0/
And I'm trying to extract the B002755TC0 value, but the problem here is that the string will vary by its length and I cannot simply use Substring method that C# offers to extract that value...
Instead I was thinking if there's a clever way to do this, to perhaps a match beginning of the string with what I search?
For example I know for a fact that each href has this structure like I've shown, So I would simply match these keywords:
offer-listing/
So I would find this keyword and start extracting the part of the string B002755TC0 until the next " / " sign ?
Can someone help me out with this ?
This is a perfect job for a regular expression :
string text = "gp/offer-listing/B002755TC0/";
Regex pattern = new Regex(#"offer-listing/(\w+)/");
Match match = pattern.Match(text);
string whatYouAreLookingFor = match.Groups[1].Value;
Explanation : we just match the exact pattern you need.
'offer-listing/'
followed by any combination of (at least one) 'word characters' (letters, digits, hyphen, etc...),
followed by a slash.
The parenthesis () mean 'capture this group' (so we can extract it later with match.Groups[1]).
EDIT: if you want to extract also from this : /dp/B01KRHBT9Q/
Then you could use this pattern :
Regex pattern = new Regex(#"/(\w+)/$");
which will match both this string and the previous. The $ stands for the end of the string, so this literally means :
capture the characters in between the last two slashes of the string
Though there is already an accepted answer, I thought of sharing another solution, without using Regex. Just find the position of your pattern in the input + it's lenght, so the wanted text will be the next character. to find the end, search for the first "/" after the begining of the wanted text:
string input = "gp/offer-listing/B002755TC0/";
string pat = "offer-listing/";
int begining = input.IndexOf(pat)+pat.Length;
int end = input.IndexOf("/",begining);
string result = input.Substring(begining,end-begining);
If your desired output is always the last piece, you could also use split and get the last non-empty piece:
string result2 = input.Split(new string[]{"/"},StringSplitOptions.RemoveEmptyEntries)
.ToList().Last();

Find all words without figures using RegEx

I found this code to get all words of a string,
static string[] GetWords(string input)
{
MatchCollection matches = Regex.Matches(input, #"\b[\w']*\b");
var words = from m in matches.Cast<Match>()
where !string.IsNullOrEmpty(m.Value)
select TrimSuffix(m.Value);
return words.ToArray();
}
static string TrimSuffix(string word)
{
int apostrapheLocation = word.IndexOf('\'');
if (apostrapheLocation != -1)
{
word = word.Substring(0, apostrapheLocation);
}
return word;
}
Please describe about the code.
How can I get words without figures?
2 How can I get words without figures?
You'll have to replace \w with [A-Za-z]
So that your RegEx becomes #"\b[A-Za-z']*\b"
And then you'll have to think about TrimSuffix(). The regEx allows apostrophes but TrimSuffix() will extract only the left part. So "it's" will become "it".
In
MatchCollection matches = Regex.Matches(input, #"\b[\w']*\b");
the code is using a regex that will look for any word; \b means border of word and \w is the alpha numerical POSIX class to get everything as letters(with or without graphical accents), numbers and sometimes underscore and the ' is just included in the list along with the alphaNum. So basically that is searching for the begining and the end of the word and selecting it.
then
var words = from m in matches.Cast<Match>()
where !string.IsNullOrEmpty(m.Value)
select TrimSuffix(m.Value);
is a LINQ syntax, where you can do SQL-Like queries inside your code. That code is getting every match from the regex and checking to see if the value is not empty and to get it without spaces. Its also where you can add your figure validation.
and This:
static string TrimSuffix(string word)
{
int apostrapheLocation = word.IndexOf('\'');
if (apostrapheLocation != -1)
{
word = word.Substring(0, apostrapheLocation);
}
return word;
}
is removing the ' of the words who have it and getting just the part that is before it
i.e. for don't word it will get only the don

Google-like search query tokenization & string splitting

I'm looking to tokenize a search query similar to how Google does it. For instance, if I have the following search query:
the quick "brown fox" jumps over the "lazy dog"
I would like to have a string array with the following tokens:
the
quick
brown fox
jumps
over
the
lazy dog
As you can see, the tokens preserve the spaces with in double quotes.
I'm looking for some examples of how I could do this in C#, preferably not using regular expressions, however if that makes the most sense and would be the most performant, then so be it.
Also I would like to know how I could extend this to handle other special characters, for example, putting a - in front of a term to force exclusion from a search query and so on.
So far, this looks like a good candidate for RegEx's. If it gets significantly more complicated, then a more complex tokenizing scheme may be necessary, but your should avoid that route unless necessary as it is significantly more work. (on the other hand, for complex schemas, regex quickly turns into a dog and should likewise be avoided).
This regex should solve your problem:
("[^"]+"|\w+)\s*
Here is a C# example of its usage:
string data = "the quick \"brown fox\" jumps over the \"lazy dog\"";
string pattern = #"(""[^""]+""|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
The real benefit of this method is it can be easily extened to include your "-" requirement like so:
string data = "the quick \"brown fox\" jumps over " +
"the \"lazy dog\" -\"lazy cat\" -energetic";
string pattern = #"(-""[^""]+""|""[^""]+""|-\w+|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
Now I hate reading Regex's as much as the next guy, but if you split it up, this one is quite easy to read:
(
-"[^"]+"
|
"[^"]+"
|
-\w+
|
\w+
)\s*
Explanation
If possible match a minus sign, followed by a " followed by everything until the next "
Otherwise match a " followed by everything until the next "
Otherwise match a - followed by any word characters
Otherwise match as many word characters as you can
Put the result in a group
Swallow up any following space characters
I was just trying to figure out how to do this a few days ago. I ended up using Microsoft.VisualBasic.FileIO.TextFieldParser which did exactly what I wanted (just set HasFieldsEnclosedInQuotes to true). Sure it looks somewhat odd to have "Microsoft.VisualBasic" in a C# program, but it works, and as far as I can tell it is part of the .NET framework.
To get my string into a stream for the TextFieldParser, I used "new MemoryStream(new ASCIIEncoding().GetBytes(stringvar))". Not sure if this is the best way to do it.
Edit: I don't think this would handle your "-" requirement, so maybe the RegEx solution is better
Go char by char to the string like this: (sort of pseudo code)
array words = {} // empty array
string word = "" // empty word
bool in_quotes = false
for char c in search string:
if in_quotes:
if c is '"':
append word to words
word = "" // empty word
in_quotes = false
else:
append c to word
else if c is '"':
in_quotes = true
else if c is ' ': // space
if not empty word:
append word to words
word = "" // empty word
else:
append c to word
// Rest
if not empty word:
append word to words
I was looking for a Java solution to this problem and came up with a solution using #Michael La Voie's. Thought I would share it here despite the question being asked for in C#. Hope that's okay.
public static final List<String> convertQueryToWords(String q) {
List<String> words = new ArrayList<>();
Pattern pattern = Pattern.compile("(\"[^\"]+\"|\\w+)\\s*");
Matcher matcher = pattern.matcher(q);
while (matcher.find()) {
MatchResult result = matcher.toMatchResult();
if (result != null && result.group() != null) {
if (result.group().contains("\"")) {
words.add(result.group().trim().replaceAll("\"", "").trim());
} else {
words.add(result.group().trim());
}
}
}
return words;
}

Problem creating regex to match filename

I am trying to create a regex in C# to extract the artist, track number and song title from a filename named like: 01.artist - title.mp3
Right now I can't get the thing to work, and am having problems finding much relevant help online.
Here is what I have so far:
string fileRegex = "(?<trackNo>\\d{1,3})\\.(<artist>[a-z])\\s-\\s(<title>[a-z])\\.mp3";
Regex r = new Regex(fileRegex);
Match m = r.Match(song.Name); // song.Name is the filname
if (m.Success)
{
Console.WriteLine("Artist is {0}", m.Groups["artist"]);
}
else
{
Console.WriteLine("no match");
}
I'm not getting any matches at all, and all help is appreciated!
You might want to put ?'s before the <> tags in all your groupings, and put a + sign after your [a-z]'s, like so:
string fileRegex = "(?<trackNo>\\d{1,3})\\.(?<artist>[a-z]+)\\s-\\s(?<title>[a-z]+)\\.mp3";
Then it should work. The ?'s are required so that the contents of the angled brackets <> are interpreted as a grouping name, and the +'s are required to match 1 or more repetitions of the last element, which is any character between (and including) a-z here.
Your artist and title groups are matching exactly one character. Try:
"(?<trackNo>\\d{1,3})\\.(?<artist>[a-z]+\\s-\\s(?<title>[a-z]+)\\.mp3"
I really recommend http://www.ultrapico.com/Expresso.htm for building regular expressions. It's brilliant and free.
P.S. i like to type my regex string literals like so:
#"(?<trackNo>\d{1,3})\.(?<artist>[a-z]+\s-\s(?<title>[a-z]+)\.mp3"
Maybe try:
"(?<trackNo>\\d{1,3})\\.(<artist>[a-z]*)\\s-\\s(<title>[a-z]*)\\.mp3";
CODE
String fileName = #"01. Pink Floyd - Another Brick in the Wall.mp3";
String regex = #"^(?<TrackNumber>[0-9]{1,3})\. ?(?<Artist>(.(?!= - ))+) - (?<Title>.+)\.mp3$";
Match match = Regex.Match(fileName, regex);
if (match.Success)
{
Console.WriteLine(match.Groups["TrackNumber"]);
Console.WriteLine(match.Groups["Artist"]);
Console.WriteLine(match.Groups["Title"]);
}
OUTPUT
01
Pink Floyd
Another Brick in the Wall

How write a regex with group matching?

Here is the data source, lines stored in a txt file:
servers[i]=["name1", type1, location3];
servers[i]=["name2", type2, location3];
servers[i]=["name3", type1, location7];
Here is my code:
string servers = File.ReadAllText("servers.txt");
string pattern = "^servers[i]=[\"(?<name>.*)\", (.*), (?<location>.*)];$";
Regex reg = new Regex(pattern, RegexOptions.IgnoreCase | RegexOptions.Multiline);
Match m;
for (m = reg.Match(servers); m.Success; m = m.NextMatch()) {
string name = m.Groups["name"].Value;
string location = m.Groups["location"].Value;
}
No lines are matching. What am I doing wrong?
If you don't care about anything except the servername and location, you don't need to specify the rest of the input in your regex. That lets you avoid having to escape the brackets, as Graeme correctly points out. Try something like:
string pattern = "\"(?<name>.+)\".+\s(?<location>[^ ]+)];$"
That's
\" = quote mark,
(?<name> = start capture group 'name',
.+ = match one or more chars (could use \w+ here for 1+ word chars)
) = end the capture group
\" = ending quote mark
.+\s = one or more chars, ending with a space
(?<location> = start capture group 'location',
[^ ]+ = one or more non-space chars
) = end the capture group
];$ = immediately followed by ]; and end of string
I tested this using your sample data in Rad Software's free Regex Designer, which uses the .NET regex engine.
I don't know if C# regex's are the same as perl, but if so, you probably want to escape the [ and ] characters. Also, there are extra characters in there. Try this:
string pattern = "^servers\[i\]=\[\"(?<name>.*)\", (.*), (?<location>.*)\];$";
Edited to add: After wondering why my answer was downvoted and then looking at Val's answer, I realized that the "extra characters" were there for a reason. They are what perl calls "named capture buffers", which I have never used but the original code fragment does. I have updated my answer to include them.
try this
string pattern = "servers[i]=[\"(?<name>.*)\", (.*), (?<location>.*)];$";

Categories