Remove list of words from string

Remove list of words from string - c#

I have a list of words that I want to remove from a string I use the following method
string stringToClean = "The.Flash.2014.S07E06.720p.WEB-DL.HEVC.x265.RMTeam";
string[] BAD_WORDS = {
"720p", "web-dl", "hevc", "x265", "Rmteam", "."
};
var cleaned = string.Join(" ", stringToClean.Split(' ').Where(w => !BAD_WORDS.Contains(w, StringComparer.OrdinalIgnoreCase)));
but it is not working And the following text is output
The.Flash.2014.S07E06.720p.WEB-DL.HEVC.x265.RMTeam

For this it would be a good idea to create a reusable method that splits a string into words. I'll do this as an extension method of string. If you are not familiar with extension methods, read extension methods demystified
public static IEnumerable<string> ToWords(this string text)
{
// TODO implement
}
Usage will be as follows:
string text = "This is some wild text!"
List<string> words = text.ToWords().ToList();
var first3Words = text.ToWords().Take(3);
var lastWord = text.ToWords().LastOrDefault();
Once you've got this method, the solution to your problem will be easy:
IEnumerable<string> badWords = ...
string inputText = ...
IEnumerable<string> validWords = inputText.ToWords().Except(badWords);
Or maybe you want to use Except(badWords, StringComparer.OrdinalIgnoreCase);
The implementation of ToWords depends on what you would call a word: everything delimited by a dot? or do you want to support whitespaces? or maybe even new-lines?
The implementation for your problem: A word is any sequence of characters delimited by a dot.
public static IEnumerable<string> ToWords(this string text)
{
// find the next dot:
const char dot = '.';
int startIndex = 0;
int dotIndex = text.IndexOf(dot, startIndex);
while (dotIndex != -1)
{
// found a Dot, return the substring until the dot:
int wordLength = dotIndex - startIndex;
yield return text.Substring(startIndex, wordLength;
// find the next dot
startIndex = dotIndex + 1;
dotIndex = text.IndexOf(dot, startIndex);
}
// read until the end of the text. Return everything after the last dot:
yield return text.SubString(startIndex, text.Length);
}
TODO:
Decide what you want to return if text starts with a dot ".ABC.DEF".
Decide what you want to return if the text ends with a dot: "ABC.DEF."
Check if the return value is what you want if text is empty.

Your split/join don't match up with your input.
That said, here's a quick one-liner:
string clean = BAD_WORDS.Aggregate(stringToClean, (acc, word) => acc.Replace(word, string.Empty));
This is basically a "reduce". Not fantastically performant but over strings that are known to be decently small I'd consider it acceptable. If you have to use a really large string or a really large number of "words" you might look at another option but it should work for the example case you've given us.
Edit: The downside of this approach is that you'll get partials. So for example in your token array you have "720p" but the code I suggested here will still match on "720px" but there are still ways around it. For example instead of using string's implementation of Replace you could use a regex that will match your delimiters something like Regex.Replace(acc, $"[. ]{word}([. ])", "$1") (regex not confirmed but should be close and I added a capture for the delimiter in order to put it back for the next pass)

Related

c# trying to change first letter to uppercase but doesn't work

I have to convert the first letter of every word the user inputs into uppercase. I don't think I'm doing it right so it doesn't work but I'm not sure where has gone wrong D: Thank you in advance for your help! ^^
static void Main(string[] args)
{
Console.Write("Enter anything: ");
string x = Console.ReadLine();
string pattern = "^";
Regex expression = new Regex(pattern);
var regexp = new System.Text.RegularExpressions.Regex(pattern);
Match result = expression.Match(x);
Console.WriteLine(x);
foreach(var match in x)
{
Console.Write(match);
}
Console.WriteLine();
}

If your exercise isn't regex operations, there are built-in utilities to do what you are asking:
System.Globalization.TextInfo ti = System.Globalization.CultureInfo.CurrentCulture.TextInfo;
string titleString = ti.ToTitleCase("this string will be title cased");
Console.WriteLine(titleString);
Prints:
This String Will Be Title Cased
If you operation is for regex, see this previous StackOverflow answer: Sublime Text: Regex to convert Uppercase to Title Case?

First of all, your Regex "^" matches the start of a line. If you need to match each word in a multi-word line, you'll need a different Regex, e.g. "[A-Za-z]".
You're also not doing anything to actually change the first letter to upper case. Note that strings in C# are immutable (they cannot be changed after creation), so you will need to create a new string which consists of the first letter of the original string, upper cased, followed by the rest of the string. Give that part a try on your own. If you have trouble, post a new question with your attempt.

string pattern = "(?:^|(?<= ))(.)"
^ doesnt capture anything by itself.You can replace by uppercase letters by applying function to $1.See demo.
https://regex101.com/r/uE3cC4/29

I would approach this using Model Extensions.
PHP has a nice method called ucfirst.
So I translated that into C#
public static string UcFirst(this string s)
{
var stringArr = s.ToCharArray(0, s.Length);
var char1ToUpper = char.Parse(stringArr[0]
.ToString()
.ToUpper());
stringArr[0] = char1ToUpper;
return string.Join("", stringArr);
}
Usage:
[Test]
public void UcFirst()
{
string s = "john";
s = s.UcFirst();
Assert.AreEqual("John", s);
}
Obviously you would still have to split your sentence into a list and call UcFirst for each item in the list.
Google C# Model Extensions if you need help with what is going on.

One more way to do it with regex:
string input = "this string will be title cased, even if there are.cases.like.that";
string output = Regex.Replace(input, #"(?<!\w)\w", m => m.Value.ToUpper());

I hope this may help
public static string CapsFirstLetter(string inputValue)
{
char[] values = new char[inputValue.Length];
int count = 0;
foreach (char f in inputValue){
if (count == 0){
values[count] = Convert.ToChar(f.ToString().ToUpper());
}
else{
values[count] = f;
}
count++;
}
return new string(values);
}

Using Regex.Replace to keep characters that can be vary

I have the following:
string text = "version=\"1,0\"";
I want to replace the comma for a dot, while keeping the 1 and 0, BUT keeping in mind that they be different in different situations! It could be version="2,3" .
The smart ass and noob-unworking way to do it would be:
for (int i = 0; i <= 9; i++)
{
for (int z = 0; z <= 9; z++)
{
text = Regex.Replace(text, "version=\"i,z\"", "version=\"i.z\"");
}
}
But of course.. it's a string, and I dont want i and z be behave as a string in there.
I could also try the lame but working way:
text = Regex.Replace(text, "version=\"1,", "version=\"1.");
text = Regex.Replace(text, "version=\"2,", "version=\"2.");
text = Regex.Replace(text, "version=\"3,", "version=\"3.");
And so on.. but it would be lame.
Any hints on how to single-handedly handle this?
Edit: I have other commas that I don't wanna replace, so text.Replace(",",".") can't do

You need a regex like this to locate the comma
Regex reg = new Regex("(version=\"[0-9]),([0-9]\")");
Then do the repacement:
text = reg.Replace(text, "$1.$2");
You can use $1, $2, etc. to refer to the matching groups in order.

(?<=version=")(\d+),
You can try this.See demo.Replace by $1.
https://regex101.com/r/sJ9gM7/52

You can perhaps use capture groups to keep the numbers in front and after for replacement afterwards for a more 'traditional way' to do it:
string text = "version=\"1,0\"";
var regex = new Regex(#"version=""(\d*),(\d*)""");
var result = regex.Replace(text, "version=\"$1.$2\"");
Using parens like the above in a regex is to create a capture group (so the matched part can be accessed later when needed) so that in the above, the digits before and after the comma will be stored in $1 and $2 respectively.
But I decided to delve a little bit further and let's consider the case if there are more than one comma to replace in the version, i.e. if the text was version="1,1,0". It would actually be tedious to do the above, and you would have to make one replace for each 'type' of version. So here's one solution that is sometimes called a callback in other languages (not a C# dev, but I fiddled around lambda functions and it seems to work :)):
private static string SpecialReplace(string text)
{
var result = text.Replace(',', '.');
return result;
}
public static void Main()
{
string text = "version=\"1,0,0\"";
var regex = new Regex(#"version=""[\d,]*""");
var result = regex.Replace(text, x => SpecialReplace(x.Value));
Console.WriteLine(result);
}
The above gives version="1.0.0".
"version=""[\d,]*""" will first match any sequence of digits and commas within version="...", then pass it to the next line for the replace.
The replace takes the matched text, passes it to the lambda function which takes it to the function SpecialReplace, where a simple text replace is carried out only on the matched part.
ideone demo

Remove last occurrence of a string in a string

I have a string that is of nature
RTT(50)
RTT(A)(50)
RTT(A)(B)(C)(50)
What I want to is to remove the last () occurrence from the string. That is if the string is - RTT(50), then I want RTT only returned. If it is RTT(A)(50), I want RTT(A) returned etc.
How do I achieve this? I currently use a substring method that takes out any occurrence of the () regardless. I thought of using:
Regex.Matches(node.Text, "( )").Count
To count the number of occurrences so I did something like below.
if(Regex.Matches(node.Text, "( )").Count > 1)
//value = node.Text.Remove(Regex.//Substring(1, node.Text.IndexOf(" ("));
else
value = node.Text.Substring(0, node.Text.IndexOf(" ("));
The else part will do what I want. However, how to remove the last occurrence in the if part is where I am stuck.

The String.LastIndexOf method does what you need - returns the last index of a char or string.
If you're sure that every string will have at least one set of parentheses:
var result = node.Text.Substring(0, node.Text.LastIndexOf("("));
Otherwise, you could test the result of LastIndexOf:
var lastParenSet = node.Text.LastIndexOf("(");
var result =
node.Text.Substring(0, lastParenSet > -1 ? lastParenSet : node.Text.Count());

This should do what you want :
your_string = your_string.Remove(your_string.LastIndexOf(string_to_remove));
It's that simple.

There are a couple of different options to consider.
LastIndexOf
Get the last index of the ( character and take the substring up to that index. The downside of this approach is an additional last index check for ) would be needed to ensure that the format is correct and that it's a pair with the closing parenthesis occurring after the opening parenthesis (I did not perform this check in the code below).
var index = input.LastIndexOf('(');
if (index >= 0)
{
var result = input.Substring(0, index);
Console.WriteLine(result);
}
Regex with RegexOptions.RightToLeft
By using RegexOptions.RightToLeft we can grab the last index of a pair of parentheses.
var pattern = #"\(.+?\)";
var match = Regex.Match(input, pattern, RegexOptions.RightToLeft);
if (match.Success)
{
var result = input.Substring(0, match.Index);
Console.WriteLine(result);
}
else
{
Console.WriteLine(input);
}
Regex depending on numeric format
If you're always expecting the final parentheses to have numeric content, similar to your example values where (50) is getting removed, we can use a pattern that matches any numbers inside parentheses.
var patternNumeric = #"\(\d+\)";
var result = Regex.Replace(input, patternNumeric, "");
Console.WriteLine(result);

It's very simple. You can easily achieve like this:
string a=RTT(50);
string res=a.substring (0,a.LastIndexOf("("))

As an extention:
namespace CustomExtensions
{
public static class StringExtension
{
public static string ReplaceLastOf(this string str, string fromStr, string toStr)
{
int lastIndexOf = str.LastIndexOf(fromStr);
if (lastIndexOf < 0)
return str;
string leading = str.Substring(0, lastIndexOf);
int charsToEnd = str.Length - (lastIndexOf + fromStr.Length);
string trailing = str.Substring(lastIndexOf+fromStr.Length, charsToEnd);
return leading + toStr + trailing;
}
}
}
Use:
string myFavColor = "My favourite color is blue";
string newFavColor = myFavColor.ReplaceLastOf("blue", "red");

try something a function this:
public static string ReplaceLastOccurrence(string source, string find, string replace)
{
int place = source.LastIndexOf(find);
return source.Remove(place, find.Length).Insert(place, replace);
}
It will remove the last occurrence of a string string and replace to another one, and use:
string result = ReplaceLastOccurrence(value, "(", string.Empty);
In this case, you find ( string inside the value string, and replace the ( to a string.Empty. It also could be used to replace to another information.

regex replace matches with function and delete other matches

I have a string like the one below and I want to replace the FieldNN instances with the ouput from a function.
So far I have been able to replace the NN instances with the output from the function. But I am not sure how I can delete the static "field" portion with the same regex.
input string:
(Field30="2010002257") and Field1="yuan" not Field28="AAA"
required output:
(IncidentId="2010002257") and Author="yuan" not Recipient="AAA"
This is the code I have so far:
public string translateSearchTerm(string searchTerm) {
string result = "";
result = Regex.Replace(searchTerm.ToLower(), #"(?<=field).*?(?=\=)", delegate(Match Match) {
string fieldId = Match.ToString();
return String.Format("_{0}", getFieldName(Convert.ToInt64(fieldId)));
});
log.Info(String.Format("result={0}", result));
return result;
}
which gives:
(field_IncidentId="2010002257") and field_Author="yuan" not field_Recipient="aaa"
The issues I would like to resolve are:
Remove the static "field" prefixes from the output.
Make the regex case-insenitive on the "FieldNN" parts and not lowercase the quoted text portions.
Make the regex more robust so that the quoted string parts an use either double or single quotes.
Make the regex more robust so that spaces are ignored: FieldNN = "AAA" vs. FieldNN="AAA"
I really only need to address the first issue, the other three would be a bonus but I could probably fix those once I have discovered the right patterns for whitespace and quotes.
Update
I think the pattern below solves issues 2. and 4.
result = Regex.Replace(searchTerm, #"(?<=\b(?i:field)).*?(?=\s*\=)", delegate(Match Match)

To fix first issue use groups instead of positive lookbehind:
public string translateSearchTerm(string searchTerm) {
string result = "";
result = Regex.Replace(searchTerm.ToLower(), #"field(.*?)(?=\=)", delegate(Match Match) {
string fieldId = Match.Groups[1].Value;
return getFieldName(Convert.ToInt64(fieldId));
});
log.Info(String.Format("result={0}", result));
return result;
}
In this case "field" prefix will be included in each match and will be replaced.

Find all substrings between two strings

I need to get all substrings from string.
For ex:
StringParser.GetSubstrings("[start]aaaaaa[end] wwwww [start]cccccc[end]", "[start]", "[end]");
that returns 2 string "aaaaaa" and "cccccc"
Suppose we have only one level of nesting.
Not sure about regexp, but I think it will be userful.

private IEnumerable<string> GetSubStrings(string input, string start, string end)
{
Regex r = new Regex(Regex.Escape(start) + "(.*?)" + Regex.Escape(end));
MatchCollection matches = r.Matches(input);
foreach (Match match in matches)
yield return match.Groups[1].Value;
}

Here's a solution that doesn't use regular expressions and doesn't take nesting into consideration.
public static IEnumerable<string> EnclosedStrings(
this string s,
string begin,
string end)
{
int beginPos = s.IndexOf(begin, 0);
while (beginPos >= 0)
{
int start = beginPos + begin.Length;
int stop = s.IndexOf(end, start);
if (stop < 0)
yield break;
yield return s.Substring(start, stop - start);
beginPos = s.IndexOf(begin, stop+end.Length);
}
}

You can use a regular expression, but remember to call Regex.Escape on your arguments:
public static IEnumerable<string> GetSubStrings(
string text,
string start,
string end)
{
string regex = string.Format("{0}(.*?){1}",
Regex.Escape(start),
Regex.Escape(end));
return Regex.Matches(text, regex, RegexOptions.Singleline)
.Cast<Match>()
.Select(match => match.Groups[1].Value);
}
I also added the SingleLine option so that it will match even if there are new-lines in your text.

You're going to need to better define the rules that govern your matching needs. When building any kind of matching or search code you need to be vary clear about what inputs you anticipate and what outputs you need to produce. It's very easy to produce buggy code if you don't take these questions into close consideration. That said...
You should be able to use regular expressions. Nesting may make it slightly more complicated but still doable (depending on what you expect to match in nested scenarios). Something like should get you started:
var start = "[start]";
var end = "[end]";
var regEx = new Regex(String.Format("{0}(.*){1}", Regex.Escape(start), Regex.Escape(end)));
var source = "[start]aaaaaa[end] wwwww [start]cccccc[end]";
var matches = regEx.Match( source );
It should be trivial to wrap the code above into a function appropriate for your needs.

I was bored, and thus I made a useless micro benchmark which "proves" (on my dataset, which has strings up to 7k of characters and <b> tags for start/end parameters) my suspicion that juharr's solution is the fastest of the three overall.
Results (1000000 iterations * 20 test cases):
juharr: 6371ms
Jake: 6825ms
Mark Byers: 82063ms
NOTE: Compiled regex didn't speed things up much on my dataset.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Remove list of words from string - c#

Related

c# trying to change first letter to uppercase but doesn't work

Using Regex.Replace to keep characters that can be vary

Remove last occurrence of a string in a string

regex replace matches with function and delete other matches

Find all substrings between two strings

Categories

Resources