I am trying to solve a problem in C#.
Here is the task:
If a word begins with a vowel (a, e, i, o, u or A, E, I, O, U), remove
the first letter and append it to the end, then add "che". If you have
the word “orange” It translates to “rangeoche”
If a word begins with a consonant (i.e. not a vowel), append "che" to the end of the word. For example, the word "chicken" becomes
"chickenche".
If the word has even number of letters append one more "e" to the end of it.
Print the translated sentence.
Example:
Hello there Amy
Output:
Helloche thereche myAche
Here is what I have done so far :
string Sentence = Console.ReadLine();
string[] output = Sentence.Split(' ');
char letter;
string che = "che";
StringBuilder sb = new StringBuilder(Sentence);
foreach (string s in output)
{
letter = s[0];
if (letter == 'a' || letter == 'A' || letter == 'e' || letter == 'E' || letter == 'i'
|| letter == 'I' || letter == 'o' || letter == 'O' || letter == 'u' || letter == 'U')
{
// Console.WriteLine("first char of the word is a vowel");
}
else
{
sb.Insert(s.Length,che);
// Console.WriteLine("first char of a word is a consonant");
}
if (s.Length % 2 == 0)
{
// Console.WriteLine("the word has even numbers of letters");
}
//Console.WriteLine(firstchar);
int currentWordLength = s.Length;
}
Console.WriteLine(sb);
The problem is I cannot add "che" or remove vowels of words because the index is moving due to those changes. I can only change the first word. My ifs are correct because if I uncomment the Console.Writelines they scan through each word.
I am just struggling with the adding/removing of each word. Can you please point me to the right direction?
I would suggest you to create StringBuilder object and append appropriate string into the IF condition. Try with the below code:
string Sentence = Console.ReadLine();
string[] output = Sentence.Split(' ');
char letter;
string che = "che";
StringBuilder sb = null;
Console.WriteLine("\n");
string strFinal = "";
foreach (string s in output)
{
letter = s[0];
sb = new StringBuilder(s);
if (letter == 'a' || letter == 'A' || letter == 'e' || letter == 'E' || letter == 'i'
|| letter == 'I' || letter == 'o' || letter == 'O' || letter == 'u' || letter == 'U')
{
// Console.WriteLine("first char of the word is a vowel");
string s1 = sb.Remove(0, 1).ToString();
sb.Insert(s1.Length, letter);
sb.Insert(sb.Length, che);
}
else
{
// Console.WriteLine("first char of a word is a consonant");
sb.Insert(s.Length, che);
}
if (s.Length % 2 == 0)
{
// Console.WriteLine("the word has even numbers of letters");
// sb.Insert(s.Length, "e");
sb.Insert(sb.Length, "e");
}
//Console.WriteLine(firstchar);
int currentWordLength = s.Length;
strFinal += sb + " ";
}
Console.WriteLine(strFinal);
Console.ReadKey();
First, I'd recommend changing your translation code to have a function that acts on words, instead of the whole sentence. So the code in foreach (string s in output) should be moved to another function that just acts on that string. And don't try to manipulate the string passed it, create a new one based on the logic you've listed. Once you've created the translated string, return it to the caller. The caller would then reconstruct the sentence from each returned translation.
Using the obvious extension methods:
public static class ExtensionMethods {
// ***
// *** int Extensions
// ***
public static bool IsEven(this int n) => n % 2 == 0;
// ***
// *** String Extensions
// ***
public static bool StartsWithOneOf(this string s, HashSet<char> starts) => starts.Contains(s[0]);
public static string Join(this IEnumerable<string> strings, string sep) => String.Join(sep, strings);
}
You can use LINQ to process the rules:
var vowels = "aeiouAEIOU".ToHashSet();
var ans = src.Split(' ')
.Select(w => (w.StartsWithOneOf(vowels) ? w.Substring(1)+w[0] : w)+"che"+(w.Length.IsEven() ? "e" : ""))
.Join(" ");
Let's start from splitting the initial problem into smaller ones, with a help of extract methods:
using using System.Text.RegularExpressions;
...
private static String ConvertWord(string word) {
//TODO: Solution here
return word; // <- Stub
}
private static String ConvertPhrase(string phrase) {
// Regex (not Split) to preserve punctuation:
// we convert each word (continued sequence of letter A..Z a..z or ')
// within the original phrase
return Regex.Replace(phrase, #"[A-Za-z']+", match => ConvertWord(match.Value));
// Or if there's guarantee, that space is the only separator:
// return string.Join(" ", phrase
// .Split(new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)
// .Select(item => ConvertWord(item)));
}
Now it's time to implement ConvertWord:
private static String ConvertWord(string word) {
// Do not forget of special cases - e.g. empty string
if (string.IsNullOrEmpty(word))
return "chee";
// If the word has even number of letters append one more "e" to the end of it.
string suffix = word.Length % 2 == 0 ? "chee" : "che";
// To cases: starting from vowel / consonant
char letter = char.ToUpper(word[0]);
if (letter == 'A' || letter == 'E' || letter == 'I' || letter == 'O' || letter == 'U')
return word.Substring(1) + word.Substring(0, 1) + suffix;
else
return word + suffix;
}
Finally
string Sentence = Console.ReadLine();
Console.Write(ConvertPhrase(Sentence));
For test input
"It's just a simple test (demo only): nothing more!"
Will get
t'sIchee justchee ache simplechee testchee (demochee nlyochee): nothingche morechee!
You mustn't change your word before the end of ifs. you need to test all of the conditions and save the result in another values, finally do changes on the word:
foreach (string s in output)
{
letter = char.ToLower(s[0]);
bool isVowel = false;
bool isEven = false;
if (letter == 'a' || letter == 'e' || letter == 'i'
|| letter == 'o' || letter == 'u')
{
isVowel = true;
}
if (s.Length % 2 == 0)
{
isEven = true;
}
//Now you can change the word
if (isVowel)
{
//Do What you want
}
if (isEven)
{
//Do What you want
}
//Console.WriteLine(firstchar);
int currentWordLength = s.Length;
}
I would say break the logic into specific units so you can write tests,
static void Main(string[] args)
{
var input = Console.ReadLine();
var inputs = input.Split(' ');
var sentence = string.Join(" ", inputs.Select(ConvertWord));
Console.Write(sentence);
Console.Read();
}
internal static string ConvertWord(string input)
{
const string che = "che";
var vowels = new List<string>
{
"a","A", "e", "E", "i", "I", "o", "O", "u", "U"
};
var firstChar = input.First();
var startsWithVowel = vowels.SingleOrDefault(a => a.Equals(firstChar));
string rule2String, output;
if (string.IsNullOrEmpty(startsWithVowel))
{
output = input + che;
}
else
{
output = input.Substring(1) + startsWithVowel + che;
}
rule2String = IsLengthEven(input)
? output + "e"
: output
;
return rule2String;
}
internal static bool IsLengthEven(string input)
{
return input.Length % 2 == 0;
}
Hope this Helps!
PS: I have not covered the edge cases
I would recommend to use String.Join using Linq with String.Format. Then the solution is quite easy:
private String convertSentence() {
var sentence = "Hello there Amy";
var vowels = "aeiou";
var che = "che";
return String.Join(" ", (
from s in sentence.Split(' ')
let moveFirstToEnd = vowels.Contains(s.ToLower()[0]) && s.Length > 1
select String.Format("{0}{1}{2}{3}"
, moveFirstToEnd ? s.Substring(1) : s
, moveFirstToEnd ? s.Substring(0, 1) : String.Empty
, che
, s.Length % 2 == 0 ? "e" : String.Empty
)
)
);
}
Related
I have a 256 wordlist with 8 digits like "DDUUDDUU", "DDDDUUUU", "DDUUUUUU" and I am having a hard time trying to match any combination of 2 or 3 consecutive letters like "UUDDUUUU", "DDDUUDDD"
foreach (var eachWord in AAAA.Values) {
int iCountU = 0;
int iCountD = 0;
char iLastChar = (char)106;
foreach (char letter in eachWord) {
if (letter == 'D') {
if (iCountD < 3) {
if (letter != iLastChar) {
iLastChar = letter;
iCountD = 1;
} else {
iCountD += 1;
}
}
}
if (letter == 'U') {
if (iCountU < 3) {
if (letter != iLastChar) {
iLastChar = letter;
iCountU = 1;
} else {
iCountU += 1;
}
}
}
}
if (iCountU > 2 && iCountD > 2) {
BBBB[eachWord] = eachWord;
}
}
Since this implementation counts the maximum consecutive occurrences for each character, the second word evaluates as {2,4} which overrides {2,3} and doesn't count as a match.
int matches = 0;
var wordList = new string[] { "DDUUUDUD", "DDUUUUDU" };
foreach (string word in wordList)
{
char? previous = null;
int count = 0;
var results = new Dictionary<char, int>();
foreach (char letter in word)
{
if (letter == previous)
results[letter] = Math.Max(results.ContainsKey(letter) ? results[letter] : 0, ++count);
else
count = 1;
previous = letter;
}
if (results.Values.SequenceEqual(new int[] {2,3}) || results.Values.SequenceEqual(new int[] {3,2}))
matches++;
}
Console.WriteLine(matches);
I don't have a C# compiler with me, here is a python code (with many element pretending in C# style)
AAAA=["DDUUDDUU", "DDDDUUUU"]
for word in AAAA:
isFirst=True
maxCon=0
currCon=1
for c in word:
if isFirst:
isFirst=False
else:
if c==prev:
currCon+=1
maxCon=max(maxCon,currCon)
else:
currCon=1
prev=c
if maxCon in (2,3):
print(word,maxCon)
This is surprisingly simple with a Regular Expression:
static bool testString(string test)
{
return Regex.Matches(test, #"([a-zA-Z])\1+").Any(x => x.Length == 2 || x.Length == 3);
}
The main trick is that the \1+ will create a group when it encounters a new character and add the next characters that match the first character to that match group.
Note on older .NET versions you may need to use Cast<Match>, as Regex.Matches(test, #"([a-zA-Z])\1+").Cast<Match>().Any(x => x.Length == 2 || x.Length == 3)
How to split text into words?
Example text:
'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'
The words in that line are:
Oh
you
can't
help
that
said
the
Cat
we're
all
mad
here
I'm
mad
You're
mad
Split text on whitespace, then trim punctuation.
var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
var punctuation = text.Where(Char.IsPunctuation).Distinct().ToArray();
var words = text.Split().Select(x => x.Trim(punctuation));
Agrees exactly with example.
First, Remove all special characeters:
var fixedInput = Regex.Replace(input, "[^a-zA-Z0-9% ._]", string.Empty);
// This regex doesn't support apostrophe so the extension method is better
Then split it:
var split = fixedInput.Split(' ');
For a simpler C# solution for removing special characters (that you can easily change), add this extension method (I added a support for an apostrophe):
public static string RemoveSpecialCharacters(this string str) {
var sb = new StringBuilder();
foreach (char c in str) {
if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '\'' || c == ' ') {
sb.Append(c);
}
}
return sb.ToString();
}
Then use it like so:
var words = input.RemoveSpecialCharacters().Split(' ');
You'll be surprised to know that this extension method is very efficient (surely much more efficient then the Regex) so I'll suggest you use it ;)
Update
I agree that this is an English only approach but to make it Unicode compatible all you have to do is replace:
(c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')
With:
char.IsLetter(c)
Which supports Unicode, .Net Also offers you char.IsSymbol and char.IsLetterOrDigit for the variety of cases
Just to add a variation on #Adam Fridental's answer which is very good, you could try this Regex:
var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
var matches = Regex.Matches(text, #"\w+[^\s]*\w+|\w");
foreach (Match match in matches) {
var word = match.Value;
}
I believe this is the shortest RegEx that will get all the words
\w+[^\s]*\w+|\w
If you don't want to use a Regex object, you could do something like...
string mystring="Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.";
List<string> words=mystring.Replace(",","").Replace(":","").Replace(".","").Split(" ").ToList();
You'll still have to handle the trailing apostrophe at the end of "that,'"
This is one of solution, i dont use any helper class or method.
public static List<string> ExtractChars(string inputString) {
var result = new List<string>();
int startIndex = -1;
for (int i = 0; i < inputString.Length; i++) {
var character = inputString[i];
if ((character >= 'a' && character <= 'z') ||
(character >= 'A' && character <= 'Z')) {
if (startIndex == -1) {
startIndex = i;
}
if (i == inputString.Length - 1) {
result.Add(GetString(inputString, startIndex, i));
}
continue;
}
if (startIndex != -1) {
result.Add(GetString(inputString, startIndex, i - 1));
startIndex = -1;
}
}
return result;
}
public static string GetString(string inputString, int startIndex, int endIndex) {
string result = "";
for (int i = startIndex; i <= endIndex; i++) {
result += inputString[i];
}
return result;
}
If you want to use the "for cycle" to check each char and save all punctuation in the input string I've create this class. The method GetSplitSentence() return a list of SentenceSplitResult. In this list there are saved all the words and all the punctuation & numbers. Each punctuation or numbers saved is an item in the list. The sentenceSplitResult.isAWord is used to check if is a word or not. [Sorry for my English]
public class SentenceSplitResult
{
public string word;
public bool isAWord;
}
public class StringsHelper
{
private readonly List<SentenceSplitResult> outputList = new List<SentenceSplitResult>();
private readonly string input;
public StringsHelper(string input)
{
this.input = input;
}
public List<SentenceSplitResult> GetSplitSentence()
{
StringBuilder sb = new StringBuilder();
try
{
if (String.IsNullOrEmpty(input)) {
Logger.Log(new ArgumentNullException(), "GetSplitSentence - input is null or empy");
return outputList;
}
bool isAletter = IsAValidLetter(input[0]);
// Each char i checked if is a part of a word.
// If is YES > I can store the char for later
// IF is NO > I Save the word (if exist) and then save the punctuation
foreach (var _char in input)
{
isAletter = IsAValidLetter(_char);
if (isAletter == true)
{
sb.Append(_char);
}
else
{
SaveWord(sb.ToString());
sb.Clear();
SaveANotWord(_char);
}
}
SaveWord(sb.ToString());
}
catch (Exception ex)
{
Logger.Log(ex);
}
return outputList;
}
private static bool IsAValidLetter(char _char)
{
if ((Char.IsPunctuation(_char) == true) || (_char == ' ') || (Char.IsNumber(_char) == true))
{
return false;
}
return true;
}
private void SaveWord(string word)
{
if (String.IsNullOrEmpty(word) == false)
{
outputList.Add(new SentenceSplitResult()
{
isAWord = true,
word = word
});
}
}
private void SaveANotWord(char _char)
{
outputList.Add(new SentenceSplitResult()
{
isAWord = false,
word = _char.ToString()
});
}
You could try using a regex to remove the apostrophes that aren't surrounded by letters (i.e. single quotes) and then using the Char static methods to strip all the other characters. By calling the regex first you can keep the contraction apostrophes (e.g. can't) but remove the single quotes like in 'Oh.
string myText = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
Regex reg = new Regex("\b[\"']\b");
myText = reg.Replace(myText, "");
string[] listOfWords = RemoveCharacters(myText);
public string[] RemoveCharacters(string input)
{
StringBuilder sb = new StringBuilder();
foreach (char c in input)
{
if (Char.IsLetter(c) || Char.IsWhiteSpace(c) || c == '\'')
sb.Append(c);
}
return sb.ToString().Split(' ');
}
Just had this Pig Latin problem as "homework". The conditions I have been given are:
For words that begin with consonant sounds, all letters before the initial vowel are placed at the end of the word sequence. Then, ay is added.
For words that begin with vowel sounds move the initial vowel(s) along with the first consonant or consonant cluster to the end of the word and add ay.
For words that have no consonant add way.
Tested with:
Write a method that will convert an English sentence into Pig Latin
That turned into
itewray away ethodmay atthay illway onvertcay anay ishenglay entencesay ointay igpay atinlay
It does what it should with one exception which is not in the rules but I thought about it and I have no idea how I can implement it. The method I created does exactly what the problem is asking but if I try to convert an all consonants word into piglatin it does not work. For example grrrrr into piglatin should be grrrrray.
public static string ToPigLatin(string sentencetext)
{
string vowels = "AEIOUaeiou";
//string cons = "bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ";
List<string> newWords = new List<string>();
foreach (string word in sentencetext.Split(' '))
{
if (word.Length == 1)
{
newWords.Add(word + "way");
}
if (word.Length == 2 && vowels.Contains(word[0]))
{
newWords.Add(word + "ay");
}
if (word.Length == 2 && vowels.Contains(word[1]) && !vowels.Contains(word[0]))
{
newWords.Add(word.Substring(1) + word.Substring(0, 1) + "ay");
}
if (word.Length == 2 && !vowels.Contains(word[1]) && !vowels.Contains(word[0]))
{
newWords.Add(word + "ay");
}
for (int i = 1; i < word.Length; i++)
{
if (vowels.Contains(word[i]) && (vowels.Contains(word[0])))
{
newWords.Add(word.Substring(i) + word.Substring(0, i) + "ay");
break;
}
}
for (int i = 0; i < word.Length; i++)
{
if (vowels.Contains(word[i]) && !(vowels.Contains(word[0])) && word.Length > 2)
{
newWords.Add(word.Substring(i) + word.Substring(0, i) + "ay");
break;
}
}
}
return string.Join(" ", newWords);
}
static void Main(string[] args)
{
//Console.WriteLine("Enter a sentence to convert to PigLatin:");
// string sentencetext = Console.ReadLine();
string pigLatin = ToPigLatin("Write a method that will convert an English sentence into Pig Latin");
Console.WriteLine(pigLatin);
Console.ReadKey();
}
Give this a go:
public static string ToPigLatin(string sentencetext)
{
string vowels = "AEIOUaeiou";
string cons = "bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ";
Func<string, string> toPigLatin = word =>
{
word = word.ToLower();
var result = word;
Func<string, string, (string, string)> split = (w, l) =>
{
var prefix = new string(w.ToArray().TakeWhile(x => l.Contains(x)).ToArray());
return (prefix, w.Substring(prefix.Length));
};
if (!word.Any(w => cons.Contains(w)))
{
result = word + "way";
}
else
{
var (s, e) = split(word, vowels);
var (s2, e2) = split(e, cons);
result = e2 + s + s2 + "ay";
}
return result;
};
return string.Join(" ", sentencetext.Split(' ').Select(x => toPigLatin(x)));
}
The code:
string pigLatin = ToPigLatin("Grrrr Write a method that will convert an English sentence into Pig Latin");
Console.WriteLine(pigLatin);
gives:
grrrray itewray away ethodmay atthay illway onvertcay anay ishenglay entencesay ointay igpay atinlay
EDIT : Here's my current code (21233664 chars)
string str = myInput.Text;
StringBuilder sb = new StringBuilder();
foreach (char c in str)
{
if ((c >= 'a' && c <= 'z') || c == '_' || c==' ')
{
sb.Append(c);
}
}
output.Text = sb.ToString();
Let's say I have a huge text file which contains special characters and normal expressions with underscores.
Here are a few examples of the strings that I'm looking for :
super_test
test
another_super_test
As you can see, only lower case letters are allowed with underscores.
Now, if I have those strings in a text file that looks like this :
> §> ˜;# ®> l? super_test D>ÿÿÿÿ “G? tI> €[> €? È
The problem I'm facing is that some lonely letters are still saved. In the example given above, the output would be :
l super_test t
To get ridden of those chars, I must go through the whole file again but here's my question : how can I know whether a letter is lonely or not?
I'm not sure I understand the possibilities with regex, so if anyone can give me a hint I'd really appreciate it.
You clearly need a regular expression. A simple one would be [a-z_]{2,}, which takes all strings of lowercase a to z letters and underscore that are at least 2 characters long.
Just be careful when you are parsing the big file. Being huge, I imagine you use some sort of buffers. You need to make sure you don't get half of a word in one buffer and the other in the next.
You can't treat the space just like the other acceptable characters. In addition to being acceptable, the space also serves as a delimiter for your lonesome characters. (This might be a problem with the proposed regular expressions as well; I couldn't say for sure.) Anyway, this does what (I think) you want:
string str = "> §> ˜;# ®> l? super_test D>ÿÿÿÿ “G? tI> €[> €? È";
StringBuilder sb = new StringBuilder();
char? firstLetterOfWord = null;
foreach (char c in str)
{
if ((c >= 'a' && c <= 'z') || c == '_')
{
int length = sb.Length;
if (firstLetterOfWord != null)
{
// c is the second character of a word
sb.Append(firstLetterOfWord);
sb.Append(c);
firstLetterOfWord = null;
}
else if (length == 0 || sb[length - 1] == ' ')
{
// c is the first character of a word; save for next iteration
firstLetterOfWord = c;
}
else
{
// c is part of a word; we're not first, and prev != space
sb.Append(c);
}
}
else if (c == ' ')
{
// If you want to eliminate multiple spaces in a row,
// this is the place to do so
sb.Append(' ');
firstLetterOfWord = null;
}
else
{
firstLetterOfWord = null;
}
}
Console.WriteLine(sb.ToString());
It works with singletons and full words at both start and end of string.
If your input contains something like one#two, the output will run together (onetwo with no intervening space). Assuming that's not what you want, and also assuming that you have no need for multiple spaces in a row:
StringBuilder sb = new StringBuilder();
bool previousWasSpace = true;
char? firstLetterOfWord = null;
foreach (char c in str)
{
if ((c >= 'a' && c <= 'z') || c == '_')
{
if (firstLetterOfWord != null)
{
sb.Append(firstLetterOfWord).Append(c);
firstLetterOfWord = null;
previousWasSpace = false;
}
else if (previousWasSpace)
{
firstLetterOfWord = c;
}
else
{
sb.Append(c);
}
}
else
{
firstLetterOfWord = null;
if (!previousWasSpace)
{
sb.Append(' ');
previousWasSpace = true;
}
}
}
Console.WriteLine(sb.ToString());
How to split text into words?
Example text:
'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'
The words in that line are:
Oh
you
can't
help
that
said
the
Cat
we're
all
mad
here
I'm
mad
You're
mad
Split text on whitespace, then trim punctuation.
var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
var punctuation = text.Where(Char.IsPunctuation).Distinct().ToArray();
var words = text.Split().Select(x => x.Trim(punctuation));
Agrees exactly with example.
First, Remove all special characeters:
var fixedInput = Regex.Replace(input, "[^a-zA-Z0-9% ._]", string.Empty);
// This regex doesn't support apostrophe so the extension method is better
Then split it:
var split = fixedInput.Split(' ');
For a simpler C# solution for removing special characters (that you can easily change), add this extension method (I added a support for an apostrophe):
public static string RemoveSpecialCharacters(this string str) {
var sb = new StringBuilder();
foreach (char c in str) {
if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '\'' || c == ' ') {
sb.Append(c);
}
}
return sb.ToString();
}
Then use it like so:
var words = input.RemoveSpecialCharacters().Split(' ');
You'll be surprised to know that this extension method is very efficient (surely much more efficient then the Regex) so I'll suggest you use it ;)
Update
I agree that this is an English only approach but to make it Unicode compatible all you have to do is replace:
(c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z')
With:
char.IsLetter(c)
Which supports Unicode, .Net Also offers you char.IsSymbol and char.IsLetterOrDigit for the variety of cases
Just to add a variation on #Adam Fridental's answer which is very good, you could try this Regex:
var text = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
var matches = Regex.Matches(text, #"\w+[^\s]*\w+|\w");
foreach (Match match in matches) {
var word = match.Value;
}
I believe this is the shortest RegEx that will get all the words
\w+[^\s]*\w+|\w
If you don't want to use a Regex object, you could do something like...
string mystring="Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.";
List<string> words=mystring.Replace(",","").Replace(":","").Replace(".","").Split(" ").ToList();
You'll still have to handle the trailing apostrophe at the end of "that,'"
This is one of solution, i dont use any helper class or method.
public static List<string> ExtractChars(string inputString) {
var result = new List<string>();
int startIndex = -1;
for (int i = 0; i < inputString.Length; i++) {
var character = inputString[i];
if ((character >= 'a' && character <= 'z') ||
(character >= 'A' && character <= 'Z')) {
if (startIndex == -1) {
startIndex = i;
}
if (i == inputString.Length - 1) {
result.Add(GetString(inputString, startIndex, i));
}
continue;
}
if (startIndex != -1) {
result.Add(GetString(inputString, startIndex, i - 1));
startIndex = -1;
}
}
return result;
}
public static string GetString(string inputString, int startIndex, int endIndex) {
string result = "";
for (int i = startIndex; i <= endIndex; i++) {
result += inputString[i];
}
return result;
}
If you want to use the "for cycle" to check each char and save all punctuation in the input string I've create this class. The method GetSplitSentence() return a list of SentenceSplitResult. In this list there are saved all the words and all the punctuation & numbers. Each punctuation or numbers saved is an item in the list. The sentenceSplitResult.isAWord is used to check if is a word or not. [Sorry for my English]
public class SentenceSplitResult
{
public string word;
public bool isAWord;
}
public class StringsHelper
{
private readonly List<SentenceSplitResult> outputList = new List<SentenceSplitResult>();
private readonly string input;
public StringsHelper(string input)
{
this.input = input;
}
public List<SentenceSplitResult> GetSplitSentence()
{
StringBuilder sb = new StringBuilder();
try
{
if (String.IsNullOrEmpty(input)) {
Logger.Log(new ArgumentNullException(), "GetSplitSentence - input is null or empy");
return outputList;
}
bool isAletter = IsAValidLetter(input[0]);
// Each char i checked if is a part of a word.
// If is YES > I can store the char for later
// IF is NO > I Save the word (if exist) and then save the punctuation
foreach (var _char in input)
{
isAletter = IsAValidLetter(_char);
if (isAletter == true)
{
sb.Append(_char);
}
else
{
SaveWord(sb.ToString());
sb.Clear();
SaveANotWord(_char);
}
}
SaveWord(sb.ToString());
}
catch (Exception ex)
{
Logger.Log(ex);
}
return outputList;
}
private static bool IsAValidLetter(char _char)
{
if ((Char.IsPunctuation(_char) == true) || (_char == ' ') || (Char.IsNumber(_char) == true))
{
return false;
}
return true;
}
private void SaveWord(string word)
{
if (String.IsNullOrEmpty(word) == false)
{
outputList.Add(new SentenceSplitResult()
{
isAWord = true,
word = word
});
}
}
private void SaveANotWord(char _char)
{
outputList.Add(new SentenceSplitResult()
{
isAWord = false,
word = _char.ToString()
});
}
You could try using a regex to remove the apostrophes that aren't surrounded by letters (i.e. single quotes) and then using the Char static methods to strip all the other characters. By calling the regex first you can keep the contraction apostrophes (e.g. can't) but remove the single quotes like in 'Oh.
string myText = "'Oh, you can't help that,' said the Cat: 'we're all mad here. I'm mad. You're mad.'";
Regex reg = new Regex("\b[\"']\b");
myText = reg.Replace(myText, "");
string[] listOfWords = RemoveCharacters(myText);
public string[] RemoveCharacters(string input)
{
StringBuilder sb = new StringBuilder();
foreach (char c in input)
{
if (Char.IsLetter(c) || Char.IsWhiteSpace(c) || c == '\'')
sb.Append(c);
}
return sb.ToString().Split(' ');
}