Match and split string with regex

Match and split string with regex - c#

I want to validate an input string against a regular expression and then split it.
The input string can be any combination of the letter A and letter A followed by an exclamation mark. For example these are valid input strings: A, A!, AA, AA!, A!A, A!A!, AAA, AAA!, AA!A, A!AA, ... Any other characters should yield an invalid match.
My code would probably look something like this:
public string[] SplitString(string s)
{
Regex regex = new Regex(#"...");
if (!regex.IsMatch(s))
{
throw new ArgumentException("Wrong input string!");
}
return regex.Split(s);
}
How should my regex look like?
Edit - some examples:
input string "AAA", function should return an array of 3 strings ("A", "A", "A")
input string "A!AAA!", function should return an array of 4 strings ("A!", "A", "A", "A!")
input string "AA!b", function should throw an ArgumentException

Doesn't seem like a Regex is a good plan here. Have a look at this:
private bool ValidString(string myString)
{
char[] validChars = new char[] { 'A', '!' };
if (!myString.StartsWith("A"))
return false;
if (myString.Contains("!!"))
return false;
foreach (char c in myString)
{
if (!validChars.Contains(c))
return false;
}
return true;
}
private List<string> SplitMyString(string myString)
{
List<string> resultList = new List<string>();
if (ValidString(myString))
{
string resultString = "";
foreach (char c in myString)
{
if (c == 'A')
resultString += c;
if (c == '!')
{
resultString += c;
resultList.Add(string.Copy(resultString));
resultString = "";
}
}
}
return resultList;
}
The reason for Regex not being a good plan is that you can write the logic out in a few simple if-statements that compile and function a lot faster and cheaper. Also Regex isn't so good at repeating patterns for an unlimited length string. You'll either end up writing a long Regex or something illegible.
EDIT
At the end of my code you will either have a List<string> with the split input string like in your question. Or an empty List<string>. You can adjust it a little to throw an ArgumentException if that requirement is very important to you. Alternatively you can do a Count on the list to see if it was successful.

Regex regex = new Regex(#"^(A!|A)+$");
Edit:
Use something like http://gskinner.com/RegExr/ to play with Regular Expressions
Edit after comment:
Ok, you have made it a bit more clear what you want. Don't approach it like that. Because in what you are doing, you cannot expect to match the entire input and then split as it would be the entire input. Either use separate regular expression for the split part, or use groups to get the matched values.
Example:
//Initial match part
Regex regex2 = new Regex(#"(A!)|(A)");
return regex2.Split(s);
And again, regular expressions are not always the answer. See how this might impact your application.

You could try something like:
Regex regex = new Regex(#"^[A!]+$");

((A+!?)+)
Try looking at Espresso http://www.ultrapico.com/Expresso.htm or Rad Software Regular Expression Designer http://www.radsoftware.com.au/regexdesigner/ for designing and testing RE's.

I think I have a solution that satisfies all examples. I've had to break it into two regular expressions (which I don't like)...
public string[] SplitString(string s)
{
Regex regex = new Regex(#"^[A!]+$");
if (!regex.IsMatch(s))
{
throw new ArgumentException("Wrong input string!");
}
return Regex.Split(s, #"(A!?)").Where(x => !string.IsNullOrEmpty(x)).ToArray();
}
Note the use of linq - required to remove the empty matches.

Related

Remove list of words from string

I have a list of words that I want to remove from a string I use the following method
string stringToClean = "The.Flash.2014.S07E06.720p.WEB-DL.HEVC.x265.RMTeam";
string[] BAD_WORDS = {
"720p", "web-dl", "hevc", "x265", "Rmteam", "."
};
var cleaned = string.Join(" ", stringToClean.Split(' ').Where(w => !BAD_WORDS.Contains(w, StringComparer.OrdinalIgnoreCase)));
but it is not working And the following text is output
The.Flash.2014.S07E06.720p.WEB-DL.HEVC.x265.RMTeam

For this it would be a good idea to create a reusable method that splits a string into words. I'll do this as an extension method of string. If you are not familiar with extension methods, read extension methods demystified
public static IEnumerable<string> ToWords(this string text)
{
// TODO implement
}
Usage will be as follows:
string text = "This is some wild text!"
List<string> words = text.ToWords().ToList();
var first3Words = text.ToWords().Take(3);
var lastWord = text.ToWords().LastOrDefault();
Once you've got this method, the solution to your problem will be easy:
IEnumerable<string> badWords = ...
string inputText = ...
IEnumerable<string> validWords = inputText.ToWords().Except(badWords);
Or maybe you want to use Except(badWords, StringComparer.OrdinalIgnoreCase);
The implementation of ToWords depends on what you would call a word: everything delimited by a dot? or do you want to support whitespaces? or maybe even new-lines?
The implementation for your problem: A word is any sequence of characters delimited by a dot.
public static IEnumerable<string> ToWords(this string text)
{
// find the next dot:
const char dot = '.';
int startIndex = 0;
int dotIndex = text.IndexOf(dot, startIndex);
while (dotIndex != -1)
{
// found a Dot, return the substring until the dot:
int wordLength = dotIndex - startIndex;
yield return text.Substring(startIndex, wordLength;
// find the next dot
startIndex = dotIndex + 1;
dotIndex = text.IndexOf(dot, startIndex);
}
// read until the end of the text. Return everything after the last dot:
yield return text.SubString(startIndex, text.Length);
}
TODO:
Decide what you want to return if text starts with a dot ".ABC.DEF".
Decide what you want to return if the text ends with a dot: "ABC.DEF."
Check if the return value is what you want if text is empty.

Your split/join don't match up with your input.
That said, here's a quick one-liner:
string clean = BAD_WORDS.Aggregate(stringToClean, (acc, word) => acc.Replace(word, string.Empty));
This is basically a "reduce". Not fantastically performant but over strings that are known to be decently small I'd consider it acceptable. If you have to use a really large string or a really large number of "words" you might look at another option but it should work for the example case you've given us.
Edit: The downside of this approach is that you'll get partials. So for example in your token array you have "720p" but the code I suggested here will still match on "720px" but there are still ways around it. For example instead of using string's implementation of Replace you could use a regex that will match your delimiters something like Regex.Replace(acc, $"[. ]{word}([. ])", "$1") (regex not confirmed but should be close and I added a capture for the delimiter in order to put it back for the next pass)

Get only Whole Words from a .Contains() statement

I've used .Contains() to find if a sentence contains a specific word however I found something weird:
I wanted to find if the word "hi" was present in a sentence which are as follows:
The child wanted to play in the mud
Hi there
Hector had a hip problem
if(sentence.contains("hi"))
{
//
}
I only want the SECOND sentence to be filtered however all 3 gets filtered since CHILD has a 'hi' in it and hip has a 'hi' in it. How do I use the .Contains() such that only whole words get picked out?

Try using Regex:
if (Regex.Match(sentence, #"\bhi\b", RegexOptions.IgnoreCase).Success)
{
//
};
This works just fine for me on your input text.

Here's a Regex solution:
Regex has a Word Boundary Anchor using \b
Also, if the search string might come from user input, you might consider escaping the string using Regex.Escape
This example should filter a list of strings the way you want.
string findme = "hi";
string pattern = #"\b" + Regex.Escape(findme) + #"\b";
Regex re = new Regex(pattern,RegexOptions.IgnoreCase);
List<string> data = new List<string> {
"The child wanted to play in the mud",
"Hi there",
"Hector had a hip problem"
};
var filtered = data.Where(d => re.IsMatch(d));
DotNetFiddle Example

You could split your sentence into words - you could split at each space and then trim any punctuation. Then check if any of these words are 'hi':
var punctuation = source.Where(Char.IsPunctuation).Distinct().ToArray();
var words = sentence.Split().Select(x => x.Trim(punctuation));
var containsHi = words.Contains("hi", StringComparer.OrdinalIgnoreCase);
See a working demo here: https://dotnetfiddle.net/AomXWx

You could write your own extension method for string like:
static class StringExtension
{
public static bool ContainsWord(this string s, string word)
{
string[] ar = s.Split(' ');
foreach (string str in ar)
{
if (str.ToLower() == word.ToLower())
return true;
}
return false;
}
}

c# trying to change first letter to uppercase but doesn't work

I have to convert the first letter of every word the user inputs into uppercase. I don't think I'm doing it right so it doesn't work but I'm not sure where has gone wrong D: Thank you in advance for your help! ^^
static void Main(string[] args)
{
Console.Write("Enter anything: ");
string x = Console.ReadLine();
string pattern = "^";
Regex expression = new Regex(pattern);
var regexp = new System.Text.RegularExpressions.Regex(pattern);
Match result = expression.Match(x);
Console.WriteLine(x);
foreach(var match in x)
{
Console.Write(match);
}
Console.WriteLine();
}

If your exercise isn't regex operations, there are built-in utilities to do what you are asking:
System.Globalization.TextInfo ti = System.Globalization.CultureInfo.CurrentCulture.TextInfo;
string titleString = ti.ToTitleCase("this string will be title cased");
Console.WriteLine(titleString);
Prints:
This String Will Be Title Cased
If you operation is for regex, see this previous StackOverflow answer: Sublime Text: Regex to convert Uppercase to Title Case?

First of all, your Regex "^" matches the start of a line. If you need to match each word in a multi-word line, you'll need a different Regex, e.g. "[A-Za-z]".
You're also not doing anything to actually change the first letter to upper case. Note that strings in C# are immutable (they cannot be changed after creation), so you will need to create a new string which consists of the first letter of the original string, upper cased, followed by the rest of the string. Give that part a try on your own. If you have trouble, post a new question with your attempt.

string pattern = "(?:^|(?<= ))(.)"
^ doesnt capture anything by itself.You can replace by uppercase letters by applying function to $1.See demo.
https://regex101.com/r/uE3cC4/29

I would approach this using Model Extensions.
PHP has a nice method called ucfirst.
So I translated that into C#
public static string UcFirst(this string s)
{
var stringArr = s.ToCharArray(0, s.Length);
var char1ToUpper = char.Parse(stringArr[0]
.ToString()
.ToUpper());
stringArr[0] = char1ToUpper;
return string.Join("", stringArr);
}
Usage:
[Test]
public void UcFirst()
{
string s = "john";
s = s.UcFirst();
Assert.AreEqual("John", s);
}
Obviously you would still have to split your sentence into a list and call UcFirst for each item in the list.
Google C# Model Extensions if you need help with what is going on.

One more way to do it with regex:
string input = "this string will be title cased, even if there are.cases.like.that";
string output = Regex.Replace(input, #"(?<!\w)\w", m => m.Value.ToUpper());

I hope this may help
public static string CapsFirstLetter(string inputValue)
{
char[] values = new char[inputValue.Length];
int count = 0;
foreach (char f in inputValue){
if (count == 0){
values[count] = Convert.ToChar(f.ToString().ToUpper());
}
else{
values[count] = f;
}
count++;
}
return new string(values);
}

Extract sentence with a keyword

So I've got this question.
Write a program that extracts from a text all sentences that contain a particular word.
We accept that the sentences are separated from each other by the character "." and the words are separated from one another by a character which is not a letter.
Sample text:
We are living in a yellow submarine. We don't have anything else. Inside the submarine is very tight. So we are drinking all the day. We will move out of it in 5 days.
Sample result:
We are living in a yellow submarine.
We will move out of it in 5 days.
This my code so far.
public static string Extract(string str, string keyword)
{
string[] arr = str.Split('.');
string answer = string.Empty;
foreach(string sentence in arr)
{
var iter = sentence.GetEnumerator();
while(iter.MoveNext())
{
if(iter.Current.ToString() == keyword)
answer += sentence;
}
}
return answer;
}
Well it does not work. I call it with this code:
string example = "We are living in a yellow submarine. We don't have anything else. Inside the submarine is very tight. So we are drinking all the day. We will move out of it in 5 days.";
string keyword = "in";
string answer = Extract(example, keyword);
Console.WriteLine(answer);
which does not output anything. It's probably the iterator part since I'm not familiar with iterators.
Anyhow, the hint for the question says we should use split and IndexOf methods.

sentence.GetEnumerator() is returning a CharEnumerator, so you're examining each character in each sentence. A single character will never be equal to the string "in", which is why it isn't working. You'll need to look at each word in each sentence and compare with the term you're looking for.

Try:
public static string Extract(string str, string keyword)
{
string[] arr = str.Split('.');
string answer = string.Empty;
foreach(string sentence in arr)
{
//Add any other required punctuation characters for splitting words in the sentence
string[] words = sentence.Split(new char[] { ' ', ',' });
if(words.Contains(keyword)
{
answer += sentence;
}
}
return answer;
}

Your code goes through each sentence character by character using the iterator. Unless the keyword is a single-character word (e.g. "I" or "a") there will be no match.
One way of solving this is to use LINQ to check if a sentence has the keyword, like this:
foreach(string sentence in arr)
{
if(sentence.Split(' ').Any(w => w == keyword))
answer += sentence+". ";
}
Demo on ideone.
Another approach would be using regular expressions to check for matches only on word boundaries. Note that you cannot use a plain Contains method, because doing so results in "false positives" (i.e. finding sentences where the keyword is embedded inside a longer word).
Another thing to note is the use of += for concatenation. This approach is inefficient, because many temporary throw-away objects get created. A better way of achieving the same result is using StringBuilder.

string input = "We are living in a yellow submarine. We don't have anything else. Inside the submarine is very tight. So we are drinking all the day. We will move out of it in 5 days.";
var lookup = input.Split('.')
.Select(s => s.Split().Select(w => new { w, s }))
.SelectMany(x => x)
.ToLookup(x => x.w, x => x.s);
foreach(var sentence in lookup["in"])
{
Console.WriteLine(sentence);
}

I would split the input at the periods and followed by searching each sentence for the given word.
string metin = "We are living in a yellow submarine. We don't have anything else. Inside the submarine is very tight. So we are drinking all the day. We will move out of it in 5 days.";
string[] metinDizisi = metin.Split('.');
string answer = string.Empty;
for (int i = 0; i < metinDizisi.Length; i++)
{
if (metinDizisi[i].Contains(" in "))
{
answer += metinDizisi[i];
}
}
Console.WriteLine(answer);

You can use sentence.Contains(keyword) to check if the string has the word you are looking for.
public static string Extract(string str, string keyword)
{
string[] arr = str.Split('.');
string answer = string.Empty;
foreach(string sentence in arr)
if(sentence.Contains(keyword))
answer+=sentence;
return answer;
}

You could split on the period to get a collection of sentences, then filter those with a regex containing the keyword.
var results = example.Split('.')
.Where(s => Regex.IsMatch(s, String.Format(#"\b{0}\b", keyword)));

Using regex or string manipulation when creating permalinks

I have following method(and looks expensive too) for creating permalinks but it's lacking few stuff that are quite important for nice permalink:
public string createPermalink(string text)
{
text = text.ToLower().TrimStart().TrimEnd();
foreach (char c in text.ToCharArray())
{
if (!char.IsLetterOrDigit(c) && !char.IsWhiteSpace(c))
{
text = text.Replace(c.ToString(), "");
}
if (char.IsWhiteSpace(c))
{
text = text.Replace(c, '-');
}
}
if (text.Length > 200)
{
text = text.Remove(200);
}
return text;
}
Few stuff that it is lacking:
if someone enters text like this:
"My choiches are:foo,bar" would get returned as "my-choices-arefoobar"
and it should be like: "my-choiches-are-foo-bar"
and If someone enters multiple white spaces it would get returned as "---" which is not nice to have in url.
Is there some better way to do this in regex(I really only used it few times)?
UPDATE:
Requirement was:
Any non digit or letter chars at beginning or end are not allowed
Any non digit or letter chars should be replaced by "-"
When replaced with "-" chars should not reapeat like "---"
And finally stripping string at index 200 to ensure it's not too long

Change to
public string createPermalink(string text)
{
text = text.ToLower();
StringBuilder sb = new StringBuilder(text.Length);
// We want to skip the first hyphenable characters and go to the "meat" of the string
bool lastHyphen = true;
// You can enumerate directly a string
foreach (char c in text)
{
if (char.IsLetterOrDigit(c))
{
sb.Append(c);
lastHyphen = false;
}
else if (!lastHyphen)
{
// We use lastHyphen to not put two hyphens consecutively
sb.Append('-');
lastHyphen = true;
}
if (sb.Length == 200)
{
break;
}
}
// Remove the last hyphen
if (sb.Length > 0 && sb[sb.Length - 1] == '-')
{
sb.Length--;
}
return sb.ToString();
}
If you really want to use regexes, you can do something like this (based on the code of Justin)
Regex rgx = new Regex(#"^\W+|\W+$");
Regex rgx2 = new Regex(#"\W+");
return rgx2.Replace(rgx.Replace(text.ToLower(), string.Empty), "-");
The first regex searches for non-word characters (1 or more) at the beginning (^) or at the end of the string ($) and removes them. The second one replaces one or more non-word characters with -.

This should solve the problem that you have explained. Please let me know if it needs any further explanation.
Just as an FYI, the regex makes use of lookarounds to get it done in one run
//This will find any non-character word, lumping them in one group if more than 1
//It will ignore non-character words at the beginning or end of the string
Regex rgx = new Regex(#"(?!\W+$)\W+(?<!^\W+)");
//This will then replace those matches with a -
string result = rgx.Replace(input, "-");
To keep the string from going beyond 200 characters, you will have to use substring. If you do this before the regex, then you will be ok, but if you do it after, then you run the risk of having a trailing dash again, FYI.
example:
myString.Substring(0,200)

I use an iterative approach for this - because in some cases you might want certain characters to be turned into words instead of having them turned into '-' characters - e.g. '&' -> 'and'.
But when you're done you'll also end up with a string that potentially contains multiple '-' - so you have a final regex that collapses all multiple '-' characters into one.
So I would suggest using an ordered list of regexes, and then run them all in order. This code is written to go in a static class that is then exposed as a single extension method for System.String - and is probably best merged into the System namespace.
I've hacked it from code I use, which had extensibility points (e.g. you could pass in a MatchEvaluator on construction of the replacement object for more intelligent replacements; and you could pass in your own IEnumerable of replacements, as the class was public), and therefore it might seem unnecessarily complicated - judging by the other answers I'm guessing everybody will think so (but I have specific requirements for the SEO of the strings that are created).
The list of replacements I use might not be exactly correct for your uses - if not, you can just add more.
private class SEOSymbolReplacement
{
private Regex _rx;
private string _replacementString;
public SEOSymbolReplacement(Regex r, string replacement)
{
//null-checks required.
_rx = r;
_replacementString = replacement;
}
public string Execute(string input)
{
/null-check required
return _rx.Replace(input, _replacementString);
}
}
private static readonly SEOSymbolReplacement[] Replacements = {
new SEOSymbolReplacement(new Regex(#"#", RegexOptions.Compiled), "Sharp"),
new SEOSymbolReplacement(new Regex(#"\+", RegexOptions.Compiled), "Plus"),
new SEOSymbolReplacement(new Regex(#"&", RegexOptions.Compiled), " And "),
new SEOSymbolReplacement(new Regex(#"[|:'\\/,_]", RegexOptions.Compiled), "-"),
new SEOSymbolReplacement(new Regex(#"\s+", RegexOptions.Compiled), "-"),
new SEOSymbolReplacement(new Regex(#"[^\p{L}\d-]",
RegexOptions.IgnoreCase | RegexOptions.Compiled), ""),
new SEOSymbolReplacement(new Regex(#"-{2,}", RegexOptions.Compiled), "-")};
/// <summary>
/// Transforms the string into an SEO-friendly string.
/// </summary>
/// <param name="str"></param>
public static string ToSEOPathString(this string str)
{
if (str == null)
return null;
string toReturn = str;
foreach (var replacement in DefaultReplacements)
{
toReturn = replacement.Execute(toReturn);
}
return toReturn;
}

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Match and split string with regex - c#

You could try something like: Regex regex = new Regex(#"^[A!]+$");

((A+!?)+) Try looking at Espresso http://www.ultrapico.com/Expresso.htm or Rad Software Regular Expression Designer http://www.radsoftware.com.au/regexdesigner/ for designing and testing RE's.

Related

Remove list of words from string

Get only Whole Words from a .Contains() statement

c# trying to change first letter to uppercase but doesn't work

Extract sentence with a keyword

Using regex or string manipulation when creating permalinks

Categories

Resources