I'm trying to find out how to analyze the syntax of a sentence in C#.
In my case I have a syntax which every sentence has to follow.
The syntax looks like this:
A 'B' is a 'C'.
Every sentence has to contain five words. The first word of my sentence has to be 'A', the third 'is' and the fourth 'a'.
Now I would like to examine a test sentence if it matches my syntax.
Test sentence:
A Dog is no Cat.
In this example the test sentence would be wrong, because the fourth word is 'no' and not 'a' what it should be basend on the syntax.
I read about LINQ where I can query sentences that contain a specified set of words.
The code would look something like this:
//Notice the third sentence would have the correct syntax
string text = "A Dog is no Cat. My Dog is a Cat. A Dog is a Cat.";
//Splitting text into single sentences
string[] sentences = text.Split(new char[] { '.'});
//Defining the search terms
string[] wordToMatch ={"A", "is"};
//Find sentences that contain all terms I'm looking for
var sentenceQuery = from sentence in sentences
let w = sentence.Split(new Char[] {'.'})
where w.Distinct().Intersect(wordsToMatch).Count == wordsToMatch.Count()
select sentence;
With this code I could check if the sentences contain my terms I'm looking for, but the problem is it's not checking the position of the words in the sentence.
Is there a way I could check the position as well or maybe a better way to check the syntax of a sentence with C#?
Try using regular expressions, something like this:
using System.Text.RegularExpressions;
...
string source = "A Dog is no Cat.";
bool result = Regex.IsMatch(source, #"^A\s+[A-Za-z0-9]+\s+is\s+a\s+[A-Za-z0-9]+\.$");
Pattern explanation:
^ - start of the string (anchor)
A - Letter A
\s+ - one or more whitelines (spaces)
[A-Za-z0-9]+ - 1st word (can contain A..Z, a..z letters and 0..9 digits)
\s+ - one or more whitelines (spaces)
is - is
\s+ - one or more whitelines (spaces)
a - a
\s+ - one or more whitelines (spaces)
[A-Za-z0-9]+ - 2nd word (can contain A..Z, a..z letters and 0..9 digits)
\. - full stop
$ - end of the string (anchor)
You can slightly modify the code and obtain actual 1st and 2nd strings' values:
string source = "A Dog is a Cat."; // valid string
string pattern =
#"^A\s+(?<First>[A-Za-z0-9]+)\s+is\s+a\s+(?<Second>[A-Za-z0-9]+)\.$";
var match = Regex.Match(source, pattern);
if (match.Success) {
string first = match.Groups["First"].Value; // "Dog"
string second = match.Groups["Second"].Value; // "Cat"
...
}
A regular expression would work for this, and would be the most concise, but may not be the most readable solution. Here is a simple method that will return true if the sentence is valid:
private bool IsSentenceValid(string sentence)
{
// split the sentence into an array of words
char[] splitOn = new char[] {' '};
string[] words = sentence.ToLower().Split(splitOn); // make all chars lowercase for easy comparison
// check for 5 words.
if (words.Length != 5)
return false;
// check for required words
if (words[0] != "a" || words[2] != "is" || words[3] != "a")
return false;
// if we got here, we're fine!
return true;
}
Just want to throw ideas. I would write three classes for this:
SentenceManager: which gets string as a sentence and has a public method public string GetWord(word_index). for example GetWord(3) would return the 3rd word in the sentence that has been given to the class constructor.
SentenceSyntax: in this class, you can say how many words your sentence must have. what words must be known and you can set the index of those words too.
SyntaxChecker: this class gets a SentenceSyntax object and a SentenceManager object and has a function called Check which returns true if the syntax matches the sentence.
remember there can be thousands of ways to make something work. but there are some few ways to do it right.
You should definitely do this using Regex or something similar like Dmitry has answered
Just for kicks, I wanted to do it your way. This is how I would do if I was going nuts :)
//Notice the third sentence would have the correct syntax
string text = "A Dog is no Cat.My Dog is a Cat.A Dog is a Cat.";
//Splitting text into single sentences
string[] sentences = text.Split(new char[] { '.' });
string[] wordsToMatch = { "A", "*", "is", "a", "*" };
var sentenceQuery = from sentence in sentences
let words = sentence.Split(' ')
where words.Length == wordsToMatch.Length &&
wordsToMatch.Zip(words, (f, s) => f == "*" || f == s).All(p => p)
select sentence;
Using this code, you can also get flexibility like cases insensitive comparison, and trim space around the word, etc - of course you will have to code for that
Related
My task is to select first sentence from a text (I'm writing in C#). I suppose that the most appropriate way would be using regex but some troubles occurred. What regex pattern should I use to select the first sentence?
Several examples:
Input: "I am a lion and I want to be free. Do you see a lion when you look inside of me?" Expected result: "I am a lion and I want to be free."
Input: "I drink so much they call me Charlie 4.0 hands. Any text." Expected result: "I drink so much they call me Charlie 4.0 hands."
Input: "So take out your hands and throw the H.U. up. 'Now wave it around like you don't give a fake!'" Expected result: "So take out your hands and throw the H.U. up."
The third is really confusing me.
Since you aleready provided some assumptions:
sentences are divided by a whitespace
task is to select first sentence
You can use the following regex:
^.*?[.?!](?=\s+(?:$|\p{P}*\p{Lu}))
See RegexStorm demo
Regex breakdown:
^ - start of string (thus, only the first sentence will be matched)
.*? - any number of characters, as few as possible (use RegexOptions.Singleline to also match a newline with .)
[.?!] - a final punctuation symbol
(?=\s+(?:$|\p{P}*\p{Lu})) - a look-ahead making sure there is 1 or more whitespace symbols (\s+) right after before the end of string ($) or optional punctuation (\p{P}) and a capital letter (\p{Lu}).
UPDATE:
Since it turns out you can have single sentence input, and your sentences can start with any letter or digit, you can use
^.*?[.?!](?=\s+\p{P}*[\p{Lu}\p{N}]|\s*$)
See another demo
I came up with a regular expression that uses lots of negative look-aheads to exclude certain cases, e.g. a punctuation must not be followed by lowercase character, or a dot before a capital letter is not closing a sentence. This splits up all the text in their seperate sentences. If you are given a text, just take the first match.
[\s\S]*?(?![A-Z]+)(?:\.|\?|\!)(?!(?:\d|[A-Z]))(?! [a-z])/gm
Sentence separators should be searched with following scanner:
if it's sentence-finisher character (like [.!?])
it must be followed by space or allowed sequence of characters and then space:
like sequence of '.' for '.' (A sentence...)
...or sequence of '!' and/or '?' for '!' and '?' (Exclamation here!?)
then it must be followed by either:
capital character (ignore quotes, if any)
numeric
which must be followed by lowercase or another sentence-finister
dialog-starter character (Blah blah blah... - And what next, Elric?)
Tip: don't forget to add extra space character to input source string.
Upd:
Some wild pseudocode xD:
func sentence(inputString) {
finishers = ['.', '!', '?']
allowedSequences = ['.' => ['..'], '!' => ['!!', '?'], '?' => ['??', '!']]
input = inputString
result = ''
found = false
while input != '' {
finisherPos = min(pos(input, finishers))
if !finisherPos
return inputString
result += substr(input, 0, finisherPos + 1)
input = substr(input, finisherPos)
p = finisherPos
finisher = input[p]
p++
if input[p] != ' '
if match = testSequence(substr(input, p), allowedSequences[finisher]) {
result += match
found = true
break
} else {
continue
}
else {
p++
if input[p] in [A-Z] {
found = true
break
}
if input[p] in [0-9] {
p++
if input[p] in [a-z] or input[p] in finishers {
found = true
break
}
p--
}
if input[p] in ['-'] {
found = true;
break
}
}
}
if !found
return inputStr
return result
}
func testSequence(str, sequences) {
foreach (sequence: sequences)
if startsWith(str, sequence)
return sequence
return false
}
I have a long string composed of a number of different words.
I want to go through all of them, and if the word contains a special character or number (except '-'), or starts with a Capital letter, I want to delete it (the whole word not just that character). For all intents and purposes 'foreign' letters can count as special characters.
The obvious solution is to run a loop through each word (after splitting it) and then a loop through each character - but I'm hoping there's a faster way of doing it? Perhaps using Regex but I've almost no experience with it.
Thanks
ADDED:
(What I want for example:)
Input: "this Is an Example of 5 words in an input like-so from example.com"
Output: {this,an,of,words,in,an,input,like-so,from}
(What I've tried so far)
List<string> response = new List<string>();
string[] splitString = text.Split(' ');
foreach (string s in splitString)
{
bool add = true;
foreach (char c in s.ToCharArray())
{
if (!(c.Equals('-') || (Char.IsLetter(c) && Char.IsLower(c))))
{
add = false;
break;
}
if (add)
{
response.Add(s);
}
}
}
Edit 2:
For me a word should be a number of characters (a..z) seperated by a space. ,/./!/... at the end shouldn't count for the 'special character' condition (which is really mostly just to remove urls or the like)
So:
"I saw a dog. It was black!"
should result in
{saw,a,dog,was,black}
So you want to find all "words" that only contain characters a-z or -, for words that are separated by spaces?
A regex like this will find such words:
(?<!\S)[a-z-]+(?!\S)
To also allow for words that end with single punctuation, you could use:
(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))
Example (ideone):
var re = #"(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))";
var str = "this, Is an! Example of 5 words in an input like-so from example.com foo: bar?";
var m = Regex.Matches(str, re);
Console.WriteLine("Matched: ");
foreach (Match i in m)
Console.Write(i + " ");
Notice the punctuation in the string.
Output:
Matched:
this an of words in an input like-so from foo bar
How about this?
(?<=^|\s+)(?[a-z-]+)(?=$|\s+)
Edit: Meant (?<=^|\s+)(?<word>[a-z\-]+)(?=(?:\.|,|!|\.\.\.)?(?:$|\s+))
Rules:
Word can only be preceded by start of line or some number of whitespace characters
Word can only be followed by end of line or some number of whitespace characters (Edit supports words ending with periods, commas, exclamation points, and ellipses)
Word can only contain lower case (latin) letters and dashes
The named group containing each word is "word"
Have a look at Microsoft's How to: Search Strings Using Regular Expressions (C# Programming Guide) - it's about regexes in C#.
List<string> strings = new List<string>() {"asdf", "sdf-sd", "sdfsdf"};
for (int i = strings.Count-1; i > 0; i--)
{
if (strings[i].Contains("-"))
{
strings.Remove(strings[i]);
}
}
This could be a starting point. right now it just checks only for "." as a special char. This outputs : "this an of words in an like-so from"
string pattern = #"[A-Z]\w+|\w*[0-9]+\w*|\w*[\.]+\w*";
string line = "this Is an Example of 5 words in an in3put like-so from example.com";
System.Text.RegularExpressions.Regex r = new System.Text.RegularExpressions.Regex(pattern);
line = r.Replace(line,"");
You can do this in two ways, the white-list way and the black-list way. With a white-list you define the set of characters that you consider to be acceptable and with the black-list its the opposite.
Lets assume the white-list way and that you accept only characters a-z, A-Z and the - character. Additionally you have the rule that the first character of a word cannot be an upper case character.
With this you can do something like this:
string target = "This is a white-list example: (Foo, bar1)";
var matches = Regex.Matches(target, #"(?:\b)(?<Word>[a-z]{1}[a-zA-Z\-]*)(?:\b)");
string[] words = matches.Cast<Match>().Select(m => m.Value).ToArray();
Console.WriteLine(string.Join(", ", words));
Outputs:
// is, a, white-list, example
You can use look-aheads and look-behinds to do this. Here's a regex that matches your example:
(?<=\s|^)[a-z-]+(?=\s|$)
The explanation is: match one or more alphabetic characters (lowercase only, plus hyphen), as long as what comes before the characters is whitespace (or the start of the string), and as long as what comes after is whitespace or the end of the string.
All you need to do now is plug that into System.Text.RegularExpressions.Regex.Matches(input, regexString) to get your list of words.
Reference: http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet
I want to find number of letter "a" in only first sentence. The code below finds "a" in all sentences, but I want in only first sentence.
static void Main(string[] args)
{
string text; int k = 0;
text = "bla bla bla. something second. maybe last sentence.";
foreach (char a in text)
{
char b = 'a';
if (b == a)
{
k += 1;
}
}
Console.WriteLine("number of a in first sentence is " + k);
Console.ReadKey();
}
This will split the string into an array seperated by '.', then counts the number of 'a' char's in the first element of the array (the first sentence).
var count = Text.Split(new[] { '.', '!', '?', })[0].Count(c => c == 'a');
This example assumes a sentence is separated by a ., ? or !. If you have a decimal number in your string (e.g. 123.456), that will count as a sentence break. Breaking up a string into accurate sentences is a fairly complex exercise.
This is perhaps more verbose than what you were looking for, but hopefully it'll breed understanding as you read through it.
public static void Main()
{
//Make an array of the possible sentence enders. Doing this pattern lets us easily update
// the code later if it becomes necessary, or allows us easily to move this to an input
// parameter
string[] SentenceEnders = new string[] {"$", #"\.", #"\?", #"\!" /* Add Any Others */};
string WhatToFind = "a"; //What are we looking for? Regular Expressions Will Work Too!!!
string SentenceToCheck = "This, but not to exclude any others, is a sample."; //First example
string MultipleSentencesToCheck = #"
Is this a sentence
that breaks up
among multiple lines?
Yes!
It also has
more than one
sentence.
"; //Second Example
//This will split the input on all the enders put together(by way of joining them in [] inside a regular
// expression.
string[] SplitSentences = Regex.Split(SentenceToCheck, "[" + String.Join("", SentenceEnders) + "]", RegexOptions.IgnoreCase);
//SplitSentences is an array, with sentences on each index. The first index is the first sentence
string FirstSentence = SplitSentences[0];
//Now, split that single sentence on our matching pattern for what we should be counting
string[] SubSplitSentence = Regex.Split(FirstSentence, WhatToFind, RegexOptions.IgnoreCase);
//Now that it's split, it's split a number of times that matches how many matches we found, plus one
// (The "Left over" is the +1
int HowMany = SubSplitSentence.Length - 1;
System.Console.WriteLine(string.Format("We found, in the first sentence, {0} '{1}'.", HowMany, WhatToFind));
//Do all this again for the second example. Note that ideally, this would be in a separate function
// and you wouldn't be writing code twice, but I wanted you to see it without all the comments so you can
// compare and contrast
SplitSentences = Regex.Split(MultipleSentencesToCheck, "[" + String.Join("", SentenceEnders) + "]", RegexOptions.IgnoreCase | RegexOptions.Singleline);
SubSplitSentence = Regex.Split(SplitSentences[0], WhatToFind, RegexOptions.IgnoreCase | RegexOptions.Singleline);
HowMany = SubSplitSentence.Length - 1;
System.Console.WriteLine(string.Format("We found, in the second sentence, {0} '{1}'.", HowMany, WhatToFind));
}
Here is the output:
We found, in the first sentence, 3 'a'.
We found, in the second sentence, 4 'a'.
You didn't define "sentence", but if we assume it's always terminated by a period (.), just add this inside the loop:
if (a == '.') {
break;
}
Expand from this to support other sentence delimiters.
Simply "break" the foreach(...) loop when you encounter a "." (period)
Well, assuming you define a sentence as being ended with a '.''
Use String.IndexOf() to find the position of the first '.'. After that, searchin a SubString instead of the entire string.
find the place of the '.' in the text ( you can use split )
count the 'a' in the text from the place 0 to instance of the '.'
string SentenceToCheck = "Hi, I can wonder this situation where I can do best";
//Here I am giving several way to find this
//Using Regular Experession
int HowMany = Regex.Split(SentenceToCheck, "a", RegexOptions.IgnoreCase).Length - 1;
int i = Regex.Matches(SentenceToCheck, "a").Count;
// Simple way
int Count = SentenceToCheck.Length - SentenceToCheck.Replace("a", "").Length;
//Linq
var _lamdaCount = SentenceToCheck.ToCharArray().Where(t => t.ToString() != string.Empty)
.Select(t => t.ToString().ToUpper().Equals("A")).Count();
var _linqAIEnumareable = from _char in SentenceToCheck.ToCharArray()
where !String.IsNullOrEmpty(_char.ToString())
&& _char.ToString().ToUpper().Equals("A")
select _char;
int a =linqAIEnumareable.Count;
var _linqCount = from g in SentenceToCheck.ToCharArray()
where g.ToString().Equals("a")
select g;
int a = _linqCount.Count();
I have a text field that accepts user input in the form of delimeted lists of strings. I have two main delimeters, a space and a comma.
If an item in the list contains more than one word, a user can deliniate it by enclosing it in quotes.
Sample Input:
Apple, Banana Cat, "Dog starts with a D" Elephant Fox "G is tough", "House"
Desired Output:
Apple
Banana
Cat
Dog starts with a D
Elephant
Fox
G is a tough one
House
I've been working on getting a regex for this, and I can't figure out how to allow the commas. Here is what I have so far:
Regex.Matches(input, #"(?<match>\w+)|\""(?<match>[\w\s]*)""")
.Cast<Match>()
.Select(m => m.Groups["match"].Value.Replace("\"", ""))
.Where(x => x != "")
.Distinct()
.ToList()
That regex is pretty smart if it can turn "G is tough" into G is a tough one :-)
On a more serious note, code up a parser and don't try to rely on a singular regex to do this for you.
You'll find you learn more, the code will be more readable, and you won't have to concern yourself with edge cases that you haven't even figured out yet, like:
Apple, Banana Cat, "Dog, not elephant, starts with a D" Elephant Fox
A simple parser for that situation would be:
state = whitespace
word = ""
for each character in (string + " "):
if state is whitespace:
if character is not whitespace:
word = character
state = inword
else:
if character is whitespace:
process word
word = ""
state = whitespace
else:
word = word + character
and it's relatively easy to add support for quoting:
state = whitespace
quote = no
word = ""
for each character in (string + " "):
if state is whitespace:
if character is not whitespace:
word = character
state = inword
else:
if character is whitespace and quote is no:
process word
word = ""
state = whitespace
else:
if character is quote:
quote = not quote
else:
word = word + character
Note that I haven't tested these thoroughly but I've done these quite a bit in the past so I'm quietly confident. It's only a short step from there to one that can also allow escaping (for example, if you want quotes within quotes like "The \" character is inside").
To get a single regex capable of handling multiple separators isn't that hard, getting it to monitor state, such as when you're within quotes, so you can treat separators differently, is another level.
You should choose between using space or commas as delimeters. Using both is a bit confusing. If that choice is not yours to make, I would grab things between quotes first. When they are gone, you can just replace all commas with spaces and split the line on spaces.
You could perform two regexes. The first one to match the quoted sections, then remove them. With the second regex you could match the remaining words.
string pat = "\"(.*?)\"", pat2 = "(\\w+)";
string x = "Apple, Banana Cat, \"Dog starts with a D\" Elephant Fox \"G is tough\", \"House\"";
IEnumerable<Match> combined = Regex.Matches(Regex.Replace(x, pat, ""), pat2).OfType<Match>().Union(Regex.Matches(x, pat).OfType<Match>()).Where(m => m.Success);
foreach (Match m in combined)
Console.WriteLine(m.Groups[1].ToString());
Let me know if this isnt what you were looking for.
I like paxdiablo's parser, but if you want to use a single regex, then consider my modified version of a CSV regex parser.
Step 1: the original
string regex = "((?<field>[^\",\\r\\n]+)|\"(?<field>([^\"]|\"\")+)\")(,|(?<rowbreak>\\r\\n|\\n|$))";
Step 2: using multiple delimiters
char quoter = '"'; // quotation mark
string delimiter = " ,"; // either space or comma
string regex = string.Format("((?<field>[^\\r\\n{1}{0}]*)|[{1}](?<field>([^{1}]|[{1}][{1}])*)[{1}])([{0}]|(?<rowbreak>\\r\\n|\\n|$))", delimiter, quoter);
Using a simple loop to test:
Regex re = new Regex(regex);
foreach (Match m in re.Matches(input))
{
string field = m.Result("${field}").Replace("\"\"", "\"").Trim();
// string rowbreak = m.Result("${rowbreak}");
if (field != string.Empty)
{
// Print(field);
}
}
We get the output:
Apple
Banana
Cat
Dog starts with a D
Elephant
Fox
G is tough
House
That's it!
Look at the original CSV regex parser for ideas on handling the matched regex data. You might have to modify it slightly, but you'll get the idea.
Just for interest sake, if you are crazy enough to want to use multiple characters as a single delimiter, then consider this answer.
I'm looking to tokenize a search query similar to how Google does it. For instance, if I have the following search query:
the quick "brown fox" jumps over the "lazy dog"
I would like to have a string array with the following tokens:
the
quick
brown fox
jumps
over
the
lazy dog
As you can see, the tokens preserve the spaces with in double quotes.
I'm looking for some examples of how I could do this in C#, preferably not using regular expressions, however if that makes the most sense and would be the most performant, then so be it.
Also I would like to know how I could extend this to handle other special characters, for example, putting a - in front of a term to force exclusion from a search query and so on.
So far, this looks like a good candidate for RegEx's. If it gets significantly more complicated, then a more complex tokenizing scheme may be necessary, but your should avoid that route unless necessary as it is significantly more work. (on the other hand, for complex schemas, regex quickly turns into a dog and should likewise be avoided).
This regex should solve your problem:
("[^"]+"|\w+)\s*
Here is a C# example of its usage:
string data = "the quick \"brown fox\" jumps over the \"lazy dog\"";
string pattern = #"(""[^""]+""|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
The real benefit of this method is it can be easily extened to include your "-" requirement like so:
string data = "the quick \"brown fox\" jumps over " +
"the \"lazy dog\" -\"lazy cat\" -energetic";
string pattern = #"(-""[^""]+""|""[^""]+""|-\w+|\w+)\s*";
MatchCollection mc = Regex.Matches(data, pattern);
foreach(Match m in mc)
{
string group = m.Groups[0].Value;
}
Now I hate reading Regex's as much as the next guy, but if you split it up, this one is quite easy to read:
(
-"[^"]+"
|
"[^"]+"
|
-\w+
|
\w+
)\s*
Explanation
If possible match a minus sign, followed by a " followed by everything until the next "
Otherwise match a " followed by everything until the next "
Otherwise match a - followed by any word characters
Otherwise match as many word characters as you can
Put the result in a group
Swallow up any following space characters
I was just trying to figure out how to do this a few days ago. I ended up using Microsoft.VisualBasic.FileIO.TextFieldParser which did exactly what I wanted (just set HasFieldsEnclosedInQuotes to true). Sure it looks somewhat odd to have "Microsoft.VisualBasic" in a C# program, but it works, and as far as I can tell it is part of the .NET framework.
To get my string into a stream for the TextFieldParser, I used "new MemoryStream(new ASCIIEncoding().GetBytes(stringvar))". Not sure if this is the best way to do it.
Edit: I don't think this would handle your "-" requirement, so maybe the RegEx solution is better
Go char by char to the string like this: (sort of pseudo code)
array words = {} // empty array
string word = "" // empty word
bool in_quotes = false
for char c in search string:
if in_quotes:
if c is '"':
append word to words
word = "" // empty word
in_quotes = false
else:
append c to word
else if c is '"':
in_quotes = true
else if c is ' ': // space
if not empty word:
append word to words
word = "" // empty word
else:
append c to word
// Rest
if not empty word:
append word to words
I was looking for a Java solution to this problem and came up with a solution using #Michael La Voie's. Thought I would share it here despite the question being asked for in C#. Hope that's okay.
public static final List<String> convertQueryToWords(String q) {
List<String> words = new ArrayList<>();
Pattern pattern = Pattern.compile("(\"[^\"]+\"|\\w+)\\s*");
Matcher matcher = pattern.matcher(q);
while (matcher.find()) {
MatchResult result = matcher.toMatchResult();
if (result != null && result.group() != null) {
if (result.group().contains("\"")) {
words.add(result.group().trim().replaceAll("\"", "").trim());
} else {
words.add(result.group().trim());
}
}
}
return words;
}