I have a
string word = "degree/NN";
What I want is to remove the "/NN" part of the word and take only the word "degree".
I have following conditions:
The length of the word can be different in different occasions. (can be any word therefore the length is not fixed)
But the word will contain the "/NN" part at the end always.
How can I do this in C# .NET?
Implemented as an extension method:
static class StringExtension
{
public static string RemoveTrailingText(this string text, string textToRemove)
{
if (!text.EndsWith(textToRemove))
return text;
return text.Substring(0, text.Length - textToRemove.Length);
}
}
Usage:
string whatever = "degree/NN".RemoveTrailingText("/NN");
This takes into account that the unwanted part "/NN" is only removed from the end of the word, as you specified. A simple Replace would remove every occurrence of "/NN". However, that might not be a problem in your special case.
You can shorten the input string by three characters using String.Remove like this:
string word = "degree/NN";
string result = word.Remove(word.Length - 3);
If the part after the slash has variable length, you can use String.LastIndexOf to find the slash:
string word = "degree/NN";
string result = word.Remove(word.LastIndexOf('/'));
Simply use
word = word.Replace(#"/NN","");
edit
Forgot to add word =. Fixed that in my example.
Try this -
string.replace();
if you need to replace patterns use regex replace
Regex rgx = new Regex("/NN");
string result = rgx.Replace("degree/NN", string.Empty);
Related
I have a list of words that I want to remove from a string I use the following method
string stringToClean = "The.Flash.2014.S07E06.720p.WEB-DL.HEVC.x265.RMTeam";
string[] BAD_WORDS = {
"720p", "web-dl", "hevc", "x265", "Rmteam", "."
};
var cleaned = string.Join(" ", stringToClean.Split(' ').Where(w => !BAD_WORDS.Contains(w, StringComparer.OrdinalIgnoreCase)));
but it is not working And the following text is output
The.Flash.2014.S07E06.720p.WEB-DL.HEVC.x265.RMTeam
For this it would be a good idea to create a reusable method that splits a string into words. I'll do this as an extension method of string. If you are not familiar with extension methods, read extension methods demystified
public static IEnumerable<string> ToWords(this string text)
{
// TODO implement
}
Usage will be as follows:
string text = "This is some wild text!"
List<string> words = text.ToWords().ToList();
var first3Words = text.ToWords().Take(3);
var lastWord = text.ToWords().LastOrDefault();
Once you've got this method, the solution to your problem will be easy:
IEnumerable<string> badWords = ...
string inputText = ...
IEnumerable<string> validWords = inputText.ToWords().Except(badWords);
Or maybe you want to use Except(badWords, StringComparer.OrdinalIgnoreCase);
The implementation of ToWords depends on what you would call a word: everything delimited by a dot? or do you want to support whitespaces? or maybe even new-lines?
The implementation for your problem: A word is any sequence of characters delimited by a dot.
public static IEnumerable<string> ToWords(this string text)
{
// find the next dot:
const char dot = '.';
int startIndex = 0;
int dotIndex = text.IndexOf(dot, startIndex);
while (dotIndex != -1)
{
// found a Dot, return the substring until the dot:
int wordLength = dotIndex - startIndex;
yield return text.Substring(startIndex, wordLength;
// find the next dot
startIndex = dotIndex + 1;
dotIndex = text.IndexOf(dot, startIndex);
}
// read until the end of the text. Return everything after the last dot:
yield return text.SubString(startIndex, text.Length);
}
TODO:
Decide what you want to return if text starts with a dot ".ABC.DEF".
Decide what you want to return if the text ends with a dot: "ABC.DEF."
Check if the return value is what you want if text is empty.
Your split/join don't match up with your input.
That said, here's a quick one-liner:
string clean = BAD_WORDS.Aggregate(stringToClean, (acc, word) => acc.Replace(word, string.Empty));
This is basically a "reduce". Not fantastically performant but over strings that are known to be decently small I'd consider it acceptable. If you have to use a really large string or a really large number of "words" you might look at another option but it should work for the example case you've given us.
Edit: The downside of this approach is that you'll get partials. So for example in your token array you have "720p" but the code I suggested here will still match on "720px" but there are still ways around it. For example instead of using string's implementation of Replace you could use a regex that will match your delimiters something like Regex.Replace(acc, $"[. ]{word}([. ])", "$1") (regex not confirmed but should be close and I added a capture for the delimiter in order to put it back for the next pass)
I have a string which I extract from an HTML document like this:
var elas = htmlDoc.DocumentNode.SelectSingleNode("//a[#class='a-size-small a-link-normal a-text-normal']");
if (elas != null)
{
//
_extractedString = elas.Attributes["href"].Value;
}
The HREF attribute contains this part of the string:
gp/offer-listing/B002755TC0/
And I'm trying to extract the B002755TC0 value, but the problem here is that the string will vary by its length and I cannot simply use Substring method that C# offers to extract that value...
Instead I was thinking if there's a clever way to do this, to perhaps a match beginning of the string with what I search?
For example I know for a fact that each href has this structure like I've shown, So I would simply match these keywords:
offer-listing/
So I would find this keyword and start extracting the part of the string B002755TC0 until the next " / " sign ?
Can someone help me out with this ?
This is a perfect job for a regular expression :
string text = "gp/offer-listing/B002755TC0/";
Regex pattern = new Regex(#"offer-listing/(\w+)/");
Match match = pattern.Match(text);
string whatYouAreLookingFor = match.Groups[1].Value;
Explanation : we just match the exact pattern you need.
'offer-listing/'
followed by any combination of (at least one) 'word characters' (letters, digits, hyphen, etc...),
followed by a slash.
The parenthesis () mean 'capture this group' (so we can extract it later with match.Groups[1]).
EDIT: if you want to extract also from this : /dp/B01KRHBT9Q/
Then you could use this pattern :
Regex pattern = new Regex(#"/(\w+)/$");
which will match both this string and the previous. The $ stands for the end of the string, so this literally means :
capture the characters in between the last two slashes of the string
Though there is already an accepted answer, I thought of sharing another solution, without using Regex. Just find the position of your pattern in the input + it's lenght, so the wanted text will be the next character. to find the end, search for the first "/" after the begining of the wanted text:
string input = "gp/offer-listing/B002755TC0/";
string pat = "offer-listing/";
int begining = input.IndexOf(pat)+pat.Length;
int end = input.IndexOf("/",begining);
string result = input.Substring(begining,end-begining);
If your desired output is always the last piece, you could also use split and get the last non-empty piece:
string result2 = input.Split(new string[]{"/"},StringSplitOptions.RemoveEmptyEntries)
.ToList().Last();
I came across How to search and replace exact matching strings only. However, it doesn't work when there are words that start with #. My fiddle here https://dotnetfiddle.net/9kgW4h
string textToFind = string.Format(#"\b{0}\b", "#bob");
Console.WriteLine(Regex.Replace("#bob!", textToFind, "me"));// "#bob!" instead of "me!"
Also, in addition to that what I would like to do is that, if a word starts with \# say for example \#myname and if I try to find and replace #myname, it shouldn't do the replace.
I suggest replacing the leading and trailing word boundaries with unambiguous lookaround-based boundaries that will require whitespace chars or start/end of string on both ends of the search word, (?<!\S) and (?!\S). Besides, you need to use $$ in the replacement pattern to replace with a literal $.
I suggest:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
string text = #"It is #google.com or #google w#google \#google \\#google";
string result = SafeReplace(text,"#google", "some domain", true);
Console.WriteLine(result);
}
public static string SafeReplace(string input, string find, string replace, bool matchWholeWord)
{
string textToFind = matchWholeWord ? string.Format(#"(?<!\S){0}(?!\S)", Regex.Escape(find)) : find;
return Regex.Replace(input, textToFind, replace.Replace("$","$$"));
}
}
See the C# demo.
The Regex.Escape(find) is only necessary if you expect special regex metacharacters in the find variable value.
The regex demo is available at regexstorm.net.
I have a list which contains file names (without their full path)
List<string> list=new List<string>();
list.Add("File1.doc");
list.Add("File2.pdf");
list.Add("File3.xls");
foreach(var item in list) {
var val=item.Split('.');
var ext=val[1];
}
I don't want to use String.Split, how will I get the extension of the file with regex?
You don't need to use regex for that. You can use Path.GetExtension method.
Returns the extension of the specified path string.
string name = "notepad.exe";
string ext = Path.GetExtension(name).Replace(".", ""); // exe
Here is a DEMO.
To get the extension using regex:
foreach (var item in list) {
var ext = Regex.Match( item, "[^.]+$" ).Value;
}
Or if you want to make sure there is a dot:
#"(?<=\.)[^.]+$"
You could use Path.GetExtension().
Example (also removes the dot):
string filename = "MyAwesomeFileName.ext";
string extension = Path.GetExtension(filename).Replace(".", "");
// extension now contains "ext"
The regex is
\.([A-Za-z0-9]+)$
Escaped period, 1 or more alpha-numeric characters, end of string
You could also use LastIndexOf(".")
int delim = fileName.LastIndexOf(".");
string ext = fileName.Substring(delim >= 0 ? delim : 0);
But using the built in function is always more convenient.
For the benefit of googlers -
I was dealing with bizarre filenames e.g. FirstPart.SecondPart.xml, with the extension being unknown.
In this case, Path.GetFileExtension() got confused by the extra dots.
The regex I used was
\.[A-z]{3,4}$
i.e. match the last instance of 3 or 4 characters with a dot in front only. You can test it here at Regexr. Not a prize winner, but did the trick.
The obvious flaw is that if the second part were 3-4 chars and the file had no extension, it would pick that up, however I knew that was not a situation I would encounter.
"\\.[^\\.]+" matches anything that starts with . character followed by 1 or more no . characters.
By the way the others are right, regex is overkill here.
I need to write a string replace function with custom wildcards support. I also should be able to escape these wildcards. I currently have a wildcard class with Usage, Value and Escape properties.
So let's say I have a global list called Wildcards. Wildcards has only one member added here:
Wildcards.Add(new Wildcard
{
Usage = #"\Break",
Value = Enviorement.NewLine,
Escape = #"\\Break"
});
So I need a CustomReplace method to do the trick. I should replace the specified parameter in a given string with another one just like the string.Replace. The only difference here that it must use my custom wildcards.
string test = CustomReplace("Hi there! What's up?", "! ", "!\\Break");
// Value of the test variable should be: "Hi there!\r\nWhat's up?"
// Because \Break is specified in a custom wildcard in Wildcards
// But if I use the value of the wildcard's Escape member,
// it should be replaced with the value of Usage member.
test = CustomReplace("Hi there! What's up?", "! ", "!\\\\Break");
// Value of the test variable should be: "Hi there!\\BreakWhat's up?"
My current method doesn't support escape strings.
It also can't be good when it comes to performance since I call string.Replace two times and each one searches the whole string, I guess.
// My current method. Has no support for escape strings.
CustomReplace(string text, string oldValue, string newValue)
{
string done = text.Replace(oldValue, newValue);
foreach (Wildcard wildcard in Wildcards)
{
// Doing this:
// done = done.Replace(wildcard.Escape, wildcard.Usage);
// ...would cause trouble when Escape contains Usage.
done = done.Replace(wildcard.Usage, wildcard.Value);
}
return done;
}
So, do I have to write a replace method which searches the string char by char with the logic to find and seperate both Usage and Escape values, then replace Escape with Usage while replacing Usage with another given string?
Or do you know an already written one?
Can I use regular expressions in this scenerio?
If I can, how? (Have no experience in this, a pattern would be nice)
If I do, would it be faster or slower than char by char searching?
Sorry for the long post, I tried to keep it clear and sorry for any typos and such; it's not my primary language. Thanks in advance.
You can try this:
public string CustomReplace(string text, string oldValue, string newValue)
{
string done = text.Replace(oldValue, newValue);
var builder = new StringBuilder();
foreach (var wildcard in Wildcards)
{
builder.AppendFormat("({0}|{1})|", Regex.Escape(wildcard.Usage),
Regex.Escape(wildcard.Escape));
}
builder.Length = builder.Length - 1; // Remove the last '|' character
return Regex.Replace(done, builder.ToString(), WildcardEvaluator);
}
private string WildcardEvaluator(Match match)
{
var wildcard = Wildcards.Find(w => w.Usage == match.Value);
if (wildcard != null)
return wildcard.Value;
else
return match.Value;
}
I think this is the easiest and fastest solution as there is only one Replace method call for all wildcards.
So if you are happy to just use Regex to fulfil your needs then you should check out this link. It has some great info for using in .Net. The website also has loads of examples on who to construct Regex patterns for many different needs.
A basic example of a Replace on a string with wildcards might look like this...
string input = "my first regex replace";
string result = System.Text.RegularExpressions.Regex.Replace(input, "rep...e", "result");
//result is now "my first regex result"
notice how the second argument in the Replace function takes a regex pattern string. In this case, the dots are acting as a wildcard character, they basically mean "match any single character"
Hopefully this will help you get what you need.
If you define a pattern for both your wildcard and your escape method, you can create a Regex which will find all the wildcards in your text. You can then use a MatchEvaluator to replace them.
class Program
{
static Dictionary<string, string> replacements = new Dictionary<string, string>();
static void Main(string[] args)
{
replacements.Add("\\Break", Environment.NewLine);
string template = #"This is an \\Break escaped newline and this should \Break contain a newline.";
// (?<=($|[^\\])(\\\\){0,}) will handle double escaped items
string outcome = Regex.Replace(template, #"(?<=($|[^\\])(\\\\){0,})\\\w+\b", ReplaceMethod);
}
public static string ReplaceMethod(Match m)
{
string replacement = null;
if (replacements.TryGetValue(m.Value, out replacement))
{
return replacement;
}
else
{
//return string.Empty?
//throw new FormatException()?
return m.Value;
}
}
}