Remove Adjacent Space near a Special Character using regex - c#

Using regex want to remove adjacent Space near replacement Character
replacementCharcter = '-'
this._adjacentSpace = new Regex($#"\s*([\{replacementCharacter}])\s*");
MatchCollection replaceCharacterMatch = this._adjacentSpace.Matches(extractedText);
foreach (Match replaceCharacter in replaceCharacterMatch)
{
if (replaceCharacter.Success)
{
cleanedText = Extactedtext.Replace(replaceCharacter.Value, replaceCharacter.Value.Trim());
}
}
Extractedtext = - whi, - ch
cleanedtext = -whi, -ch
expected result : cleanedtext = -whi,-ch

You can use
var Extactedtext = "- whi, - ch";
var replacementCharacter = "-";
var _adjacentSpace = new Regex($#"\s*({Regex.Escape(replacementCharacter)})\s*");
var cleanedText = _adjacentSpace.Replace(Extactedtext, "$1");
Console.WriteLine(cleanedText); // => -whi,-ch
See the C# demo.
NOTE:
replacementCharacter is of type string in the code above
$#"\s*({Regex.Escape(replacementCharacter)})\s*" will create a regex like \s*-\s*, Regex.Escape() will escape any regex-special char (like +, (, etc.) correctly to be used in a regex pattern, and the whole regex simply matches (and captured into Group 1 with the capturing parentheses) the replacementCharacter enclosed with zero or more whitespaces
No need using Regex.Matches, just replace all matches if there are any, that is how Regex.Replace works.
_adjacentSpace is the compiled Regex object, to replace, just call the .Replace() method of the regex object instance
The replacement is a backreference to the Group 1 value, the - char here.

Related

Find hashtags in string

I am working on a Xamarin.Forms PCL project in C# and would like to detect all the hashtags.
I tried splitting at spaces and checking if the word begins with an # but the problem is if the post contains two spaces like "Hello #World Test" it would lose that the double space
string body = "Example string with a #hashtag in it";
string newbody = "";
foreach (var word in body.Split(' '))
{
if (word.StartsWith("#"))
newbody += "[" + word + "]";
newbody += word;
}
Goal output:
Example string with a [#hashtag] in it
I also only want it to have A-Z a-z 0-9 and _ stopping at any other character
Test #H3ll0_W0rld$%Test => Test [#H3ll0_W0rld]$%Test
Other Stack questions try to detect the string and extract it, I would like it work with it and put it back in the string without losing anything that methods such as splitting by certain characters would lose.
You can use Regex with #\w+ and $&
Explanation
# matches the character # literally (case sensitive)
\w+ matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
$& Includes a copy of the entire match in the replacement string.
Example
var input = "asdads sdfdsf #burgers, #rabbits dsfsdfds #sdf #dfgdfg";
var regex = new Regex(#"#\w+");
var matches = regex.Matches(input);
foreach (var match in matches)
{
Console.WriteLine(match);
}
or
var result = regex.Replace(input, "[$&]" );
Console.WriteLine(result);
Ouput
#burgers
#rabbits
#sdf
#dfgdfg
asdads sdfdsf [#burgers], [#rabbits] dsfsdfds [#sdf] [#dfgdfg]
Updated Demo here
Another Example
Use a regular expression: \#\w*
string pattern = "\#\w*";
Regex rgx = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matches = rgx.Matches(input);

search string for everything before a set of characters in C#

I'm looking for a way to search a string for everything before a set of characters in C#. For Example, if this is my string value:
This is is a test.... 12345
I want build a new string with all of the characters before "12345".
So my new string would equal "This is is a test.... "
Is there a way to do this?
I've found Regex examples where you can focus on one character but not a sequence of characters.
You don't need to use a Regex:
public string GetBitBefore(string text, string end)
{
var index = text.IndexOf(end);
if (index == -1) return text;
return text.Substring(0, index);
}
You can use a lazy quantifier to match anything, followed by a lookahead:
var match = Regex.Match("This is is a test.... 12345", #".*?(?=\d{5})");
where:
.*? lazily matches everything (up to the lookahead)
(?=…) is a positive lookahead: the pattern must be matched, but is not included in the result
\d{5} matches exactly five digits. I'm assuming this is your lookahead; you can replace it
You can do so with help of regex lookahead.
.*(?=12345)
Example:
var data = "This is is a test.... 12345";
var rxStr = ".*(?=12345)";
var rx = new System.Text.RegularExpressions.Regex (rxStr,
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
var match = rx.Match(data);
if (match.Success) {
Console.WriteLine (match.Value);
}
Above code snippet will print every thing upto 12345:
This is is a test....
For more detail about see regex positive lookahead
This should get you started:
var reg = new Regex("^(.+)12345$");
var match = reg.Match("This is is a test.... 12345");
var group = match.Groups[1]; // This is is a test....
Of course you'd want to do some additional validation, but this is the basic idea.
^ means start of string
$ means end of string
The asterisk tells the engine to attempt to match the preceding token zero or more times. The plus tells the engine to attempt to match the preceding token once or more
{min,max} indicate the minimum/maximum number of matches.
\d matches a single character that is a digit, \w matches a "word character" (alphanumeric characters plus underscore), and \s matches a whitespace character (includes tabs and line breaks).
[^a] means not so exclude a
The dot matches a single character, except line break characters
In your case there many way to accomplish the task.
Eg excluding digit: ^[^\d]*
If you know the set of characters and they are not only digit, don't use regex but IndexOf(). If you know the separator between first and second part as "..." you can use Split()
Take a look at this snippet:
class Program
{
static void Main(string[] args)
{
string input = "This is is a test.... 12345";
// Here we call Regex.Match.
MatchCollection matches = Regex.Matches(input, #"(?<MySentence>(\w+\s*)*)(?<MyNumberPart>\d*)");
foreach (Match item in matches)
{
Console.WriteLine(item.Groups["MySentence"]);
Console.WriteLine("******");
Console.WriteLine(item.Groups["MyNumberPart"]);
}
Console.ReadKey();
}
}
You could just split, not as optimal as the indexOf solution
string value = "oiasjdoiasj12345";
string end = "12345";
string result = value.Split(new string[] { end }, StringSplitOptions.None)[0] //Take first part of the result, not the quickest but fairly simple

Retrieve Alphabet with white space

I would like to retrieve the alphabet only but the code is not enough to make it.
What am I missing?
[A-Öa-ö]+$
16440 dallas
23941 cityO < You also have white space after "O"
931 00 Texas
10581 New Orleans
It's because you specify a sequence from the ASCII character table. And åäö is not directly after Z in the ascii table.
You can see it here: http://www.asciitable.com/
So what you need is a regex that specifies those separately:
[A-Za-zåäöÅÄÖ]+$
So the complete regex is:
var re = new Regex("([A-Za-zåäöÅÄÖ]+)$", RegexOptions.Multiline);
var matches = re.Matches(data);
Console.WriteLine(matches[0].Groups[1].Value);
However, since you want to allow white spaces within the name (as for "New Orleans") you need to allow it, simply include it in the regex:
var re = new Regex("([A-Za-zåäöÅÄÖ ]+)$", RegexOptions.Multiline);
Unfortunately that also includes white spaces in the beginning and the end:
" New Orleans "
To fix that you start by specifying the regex as greedy, i.e. tell it to use less characters:
new Regex("([A-Za-zåäöÅÄÖ ]+?)$", RegexOptions.Multiline)
The problem with that is that it do not take other lines than New orleans. Don't ask me why. To fix that I told the regex that it must have a space between the digits and the text and that there may be a space after the text:
var re = new Regex("\\s([A-Za-zåäöÅÄÖ ]+?)[\\s]*$", RegexOptions.Multiline);
which works with all lines.
Regex breakdown:
\\s A single whitespace (which should not be included in the match since it's not in the parenthesis expression)
([A-Za-zåäöÅÄÖ ]+?)
Find a character which either is in the alphabet or space
+ there must be one or more
? use greedy search.
[\\s]*
[\\s] Find a white space character
* There must be zero or more if it
Alternative
As an alternative to regex you can do something like this:
public IEnumerable<string> GetCodes(string data)
{
var lines = data.Split(new[] { Environment.NewLine }, StringSplitOptions.None);
foreach (var line in lines)
{
for (var i = 0; i < line.Length; i++)
{
if (!char.IsLetter(line[i]))
continue;
var text = line.Substring(i).TrimEnd(' ');
yield return text;
break;
}
}
}
Which is invoked like:
var codes = GetCodes(yourData).ToList();
In C#, you can use \p{L} Unicode category class to match all Unicode characters. You may match zero or more whitespace characters with \s*. End of string is $ (or \Z or \z). The word you need can be captured and this capture can easily be retrieved from the match result via GroupCollection.
Thus, you can use
(\p{L}+)\s*$
or - if you plan to match specific Finnish, etc. letters:
(?i)([A-ZÅÄÖ]+)\s*$
See the regex demo
C# demo:
var strs = new string[] {"16440 dallas", "23941 cityO ", "931 00 Texas", "10581 New Orleans"};
foreach (var s in strs) {
var match = Regex.Match(s, #"(\p{L}+)\s*$");
if (match.Success)
{
Console.WriteLine(match.Groups[1].Value);
}
}

Single Regex to match URL

Any ideas how I can use a single regular expression to validate a single url and also match urls in a text block?
var x = "http://myurl.com";
var t = "http://myurl.com ref";
var y = "some text that contains a url http://myurl.com some where";
var expression = "\b(https?|ftp|file)://[-A-Z0-9+&##/%?=~_|!:,.;]*[A-Z0-9+&##/%=~_|]";
Regex.IsMatch(x, expression, RegexOptions.IgnoreCase); // returns true;
Regex.IsMatch(t, expression, RegexOptions.IgnoreCase); // returns false;
Regex.Matches(y, expression, RegexOptions.IgnoreCase); // returns http://myurl.com;
First of all you have to escape correctly. Use "\\b..." instead of "\b...". IsMatch will also be true for partial matches. You can check if the whole input is matching by doing this:
Match match = Regex.Match(x, expression, RegexOptions.IgnoreCase);
if (match.Success && match.Length == x.Length))
// full match
With this check and the escape fix, your expression will work as it is. You also can write a helper method for it:
private bool FullMatch(string input, string pattern, RegexOptions options)
{
Match match = Regex.Match(input, pattern, options);
return match.Success && match.Length == input.Length;
}
Your code will change to this:
var x = "http://myurl.com";
var t = "http://myurl.com ref";
var y = "some text that contains a url http://myurl.com some where";
var expression = "\\b(https?|ftp|file)://[-A-Z0-9+&##/%?=~_|!:,.;]*[A-Z0-9+&##/%=~_|]";
FullMatch(x, expression, RegexOptions.IgnoreCase); // returns true;
FullMatch(t, expression, RegexOptions.IgnoreCase); // returns false;
Regex.Matches(y, expression, RegexOptions.IgnoreCase); // returns http://myurl.com;
i think the word boundary is getting you; it will not match for non-word characters.
try this:
var expression = #"(^|\s)(https?|ftp|file)://[-A-Z0-9+&##/%?=~_|!:,.;]*[A-Z0-9+&##/%=~_|]($|\s)";
this will bind the start of the match to the beginning of the string or space, and the end of the match to the end of the string or space.
more info: http://www.regular-expressions.info/wordboundaries.html
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character. After the last character in the string, if the last
character is a word character. Between two characters in the string,
where one is a word character and the other is not a word character.
Simply put: \b allows you to perform a "whole words only" search using
a regular expression in the form of \bword\b. A "word character" is a
character that can be used to form words. All characters that are not
"word characters" are "non-word characters".

Regex help with sample pattern. C#

I decided to use Regex, now I have two problems :)
Given the input string "hello world [2] [200] [%8] [%1c] [%d]",
What would be an approprite pattern to match the instances of "[%8]" "[%1c]" + "[%d]" ? (So a percentage sign, followed by any length alphanumeric, all enclosed in square brackets).
for the "[2]" and [200], I already use
Regex.Matches(input, "(\\[)[0-9]*?\\]");
Which works fine.
Any help would be appreicated.
MatchCollection matches = null;
try {
Regex regexObj = new Regex(#"\[[%\w]+\]");
matches = regexObj.Matches(input);
if (matches.Count > 0) {
// Access individual matches using matches.Item[]
} else {
// Match attempt failed
}
} catch (ArgumentException ex) {
// Syntax error in the regular expression
}
The Regex needed to match this pattern of "[%anyLengthAlphaNumeric]" in a string is this "[(%\w+)]"
The leading "[" is escaped with the "\" then you are creating a grouping of characters with the (...). This grouping is defined as %\w+. The \w is a shortcut for all word characters including letters and digits no spaces. The + matches one or more instances of the previous symbol, character or group. Then the trailing "]" is escaped with a "\" and catches the closing bracket.
Here is a basic code example:
string input = #"hello world [2] [200] [%8] [%1c] [%d]";
Regex example = new Regex(#"\[(%\w+)\]");
MatchCollection matches = example.Matches(input);
Try this:
Regex.Matches(input, "\\[%[0-9a-f]+\\]");
Or as a combined regular expression:
Regex.Matches(input, "\\[(\\d+|%[0-9a-f]+)\\]");
How about #"\[%[0-9a-f]*?\]"?
string input = "hello world [2] [200] [%8] [%1c] [%d]";
MatchCollection matches = Regex.Matches(input, #"\[%[0-9a-f]*?\]");
matches.Count // = 3

Categories