How to unescape a sequence include \u and \U? - c#

I have some strings in a .resx file include some sequences like this:
\u26A0 warning
So i use the following code to unscape it
str = Regex.Unescape(str);
Now, when i see the result everything works well (with \u) and it show the related emoji.
But Regex.Unescape(...) method dose not work when the input string is include \U like this:
\U0001F4D8 book
and it return this error:
Error: Unrecognized escape sequence \U
My question:
Is there another method in .Net framework to Unescape the sequences include \u and \U?
If there is not an embed method, how can i write a helper method manually to do it?
Edit:
When i read string from the resx file it has double backslash, i should convert these Unicode sequences to their characters:

Indeed, according to source code of Regex.Unescape, RegexParser.ScanCharEscape, \U is not handled.
Instead, you could consider a manual conversion with help of char.ConnvertFromUtf32:
string converted = char.ConvertFromUtf32(int.Parse("0001F4D8", NumberStyles.HexNumber));
This is a draft implementation. (The annoying complexity comes from an attempt to distinguish \U and \\U.)
static string Unescape(string str)
{
StringBuilder builder = new StringBuilder();
int startIndex = 0;
while(true)
{
int index = IndexOfBackslashU(str, startIndex);
if (index == -1)
return builder.Append(Regex.Unescape(str.Substring(startIndex))).ToString();
builder.Append(Regex.Unescape(str.Substring(startIndex, index - startIndex)));
string number = str.Substring(index + 2, 8);
builder.Append(char.ConvertFromUtf32(int.Parse(number, NumberStyles.HexNumber)));
startIndex = index + 10;
}
}
static int IndexOfBackslashU(string str, int startIndex)
{
while (true)
{
int index = str.IndexOf(#"\U", startIndex);
if (index == -1)
return index;
bool evenNumberOfPreviousBackslashes = true;
for (int k = index-1; k >= 0 && str[k] == '\\'; k--)
evenNumberOfPreviousBackslashes = !evenNumberOfPreviousBackslashes;
if (evenNumberOfPreviousBackslashes)
return index;
startIndex = index + 2;
}
}

I wrote this method and the problem solved:
public static string UnescapeIt(string str)
{
var regex = new Regex(#"(?<!\\)(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8})", RegexOptions.Compiled);
return regex.Replace(str,
m =>
{
if (m.Value.IndexOf("\\U", StringComparison.Ordinal) > -1)
return char.ConvertFromUtf32(int.Parse(m.Value.Replace("\\U", ""), NumberStyles.HexNumber));
return Regex.Unescape(m.Value);
});
}
It unescape \u sequences and convert \U sequences to related character. So we can see the emojis.
Use:
str= UnescapeIt(str);
Result:
Update:
I changed the regex from
\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}
to
(?<!\\)(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8})
Now it will fail the match if we have a backslash before \u or \U

Related

Can I limit the number of times a character appears in a string?

For instance, I have a string and I only want the character '<' to appear 10 times in the string, and create a substring where the cutoff point is the 10th appearance of that character. Is this possible?
A manual solution could be like the following:
class Program
{
static void Main(string[] args)
{
int maxNum = 10;
string initialString = "a<b<c<d<e<f<g<h<i<j<k<l<m<n<o<p<q<r<s<t<u<v<w<x<y<z";
string[] splitString = initialString.Split('<');
string result = "";
Console.WriteLine(splitString.Length);
if (splitString.Length > maxNum)
{
for (int i = 0; i < maxNum; i++) {
result += splitString[i];
result += "<";
}
}
else
{
result = initialString;
}
Console.WriteLine(result);
Console.ReadKey();
}
}
By the way, it may be better to try to do it using Regex (in case you may have other replacement rules in the future, or need to make changes, etc). However, given your problem, something like that will work, too.
You can utilize TakeWhile for your purpose, given the string s, your character < as c and your count 10 as count, following function would solve your problem:
public static string foo(string s, char c, int count)
{
var i = 0;
return string.Concat(s.TakeWhile(x => (x == c ? i++ : i) < count));
}
Regex.Matches can be used to count the number of occurrences of a patter in a string.
It also reference the position of each occurrence, the Capture.Index property.
You can read the Index of the Nth occurrence and cut your string there:
(The RegexOptions are there just in case the pattern is something different. Modify as required.)
int cutAtOccurrence = 10;
string input = "one<two<three<four<five<six<seven<eight<nine<ten<eleven<twelve<thirteen<fourteen<fifteen";
var regx = Regex.Matches(input, "<", RegexOptions.CultureInvariant | RegexOptions.IgnoreCase);
if (regx.Count >= cutAtOccurrence) {
input = input.Substring(0, regx[cutAtOccurrence - 1].Index);
}
input is now:
one<two<three<four<five<six<seven<eight<nine<ten
If you need to use this procedure many times, it's bettern to build a method that returns a StringBuilder instead.

Email address splitting

So I have a string that I need to split by semicolon's
Email address: "one#tw;,.'o"#hotmail.com;"some;thing"#example.com
Both of the email addresses are valid
So I want to have a List<string> of the following:
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
But the way I am currently splitting the addresses is not working:
var addresses = emailAddressString.Split(new[] { ';' }, StringSplitOptions.RemoveEmptyEntries)
.Select(x => x.Trim()).ToList();
Because of the multiple ; characters I end up with invalid email addresses.
I have tried a few different ways, even going down working out if the string contains quotes and then finding the index of the ; characters and working it out that way, but it's a real pain.
Does anyone have any better suggestions?
Assuming that double-quotes are not allowed, except for the opening and closing quotes ahead of the "at" sign #, you can use this regular expression to capture e-mail addresses:
((?:[^#"]+|"[^"]*")#[^;]+)(?:;|$)
The idea is to capture either an unquoted [^#"]+ or a quoted "[^"]*" part prior to #, and then capture everything up to semicolon ; or the end anchor $.
Demo of the regex.
var input = "\"one#tw;,.'o\"#hotmail.com;\"some;thing\"#example.com;hello#world";
var mm = Regex.Matches(input, "((?:[^#\"]+|\"[^\"]*\")#[^;]+)(?:;|$)");
foreach (Match m in mm) {
Console.WriteLine(m.Groups[1].Value);
}
This code prints
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
hello#world
Demo 1.
If you would like to allow escaped double-quotes inside double-quotes, you could use a more complex expression:
((?:(?:[^#\"]|(?<=\\)\")+|\"([^\"]|(?<=\\)\")*\")#[^;]+)(?:;|$)
Everything else remains the same.
Demo 2.
I obviously started writing my anti regex method at around the same time as juharr (Another answer). I thought that since I already have it written I would submit it.
public static IEnumerable<string> SplitEmailsByDelimiter(string input, char delimiter)
{
var startIndex = 0;
var delimiterIndex = 0;
while (delimiterIndex >= 0)
{
delimiterIndex = input.IndexOf(';', startIndex);
string substring = input;
if (delimiterIndex > 0)
{
substring = input.Substring(0, delimiterIndex);
}
if (!substring.Contains("\"") || substring.IndexOf("\"") != substring.LastIndexOf("\""))
{
yield return substring;
input = input.Substring(delimiterIndex + 1);
startIndex = 0;
}
else
{
startIndex = delimiterIndex + 1;
}
}
}
Then the following
var input = "blah#blah.com;\"one#tw;,.'o\"#hotmail.com;\"some;thing\"#example.com;hello#world;asdasd#asd.co.uk;";
foreach (var email in SplitEmailsByDelimiter(input, ';'))
{
Console.WriteLine(email);
}
Would give this output
blah#blah.com
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
hello#world
asdasd#asd.co.uk
You can also do this without using regular expressions. The following extension method will allow you to specify a delimiter character and a character to begin and end escape sequences. Note it does not validate that all escape sequences are closed.
public static IEnumerable<string> SpecialSplit(
this string str, char delimiter, char beginEndEscape)
{
int beginIndex = 0;
int length = 0;
bool escaped = false;
foreach (char c in str)
{
if (c == beginEndEscape)
{
escaped = !escaped;
}
if (!escaped && c == delimiter)
{
yield return str.Substring(beginIndex, length);
beginIndex += length + 1;
length = 0;
continue;
}
length++;
}
yield return str.Substring(beginIndex, length);
}
Then the following
var input = "\"one#tw;,.'o\"#hotmail.com;\"some;thing\"#example.com;hello#world;\"D;D#blah;blah.com\"";
foreach (var address in input.SpecialSplit(';', '"'))
Console.WriteLine(v);
While give this output
"one#tw;,.'o"#hotmail.com
"some;thing"#example.com
hello#world
"D;D#blah;blah.com"
Here's the version that works with an additional single escape character. It assumes that two consecutive escape characters should become one single escape character and it's escaping both the beginEndEscape charter so it will not trigger the beginning or end of an escape sequence and it also escapes the delimiter. Anything else that comes after the escape character will be left as is with the escape character removed.
public static IEnumerable<string> SpecialSplit(
this string str, char delimiter, char beginEndEscape, char singleEscape)
{
StringBuilder builder = new StringBuilder();
bool escapedSequence = false;
bool previousEscapeChar = false;
foreach (char c in str)
{
if (c == singleEscape && !previousEscapeChar)
{
previousEscapeChar = true;
continue;
}
if (c == beginEndEscape && !previousEscapeChar)
{
escapedSequence = !escapedSequence;
}
if (!escapedSequence && !previousEscapeChar && c == delimiter)
{
yield return builder.ToString();
builder.Clear();
continue;
}
builder.Append(c);
previousEscapeChar = false;
}
yield return builder.ToString();
}
Finally you probably should add null checking for the string that is passed in and note that both will return a sequence with one empty string if you pass in an empty string.

Remove set of elements in an char array

I have a char array where I have to remove the elements from the same if I find the certain char. For eg: "paragraphs" is assigned in an char array. I have given search keyword as 'g'. If it is found then I have to remodify the original char array as "raphs" by removing all elements including the found one.
char[] ch = "paragraphs"
search chr = 'g'
Desired result(if chr found):
char[] ch = "raphs"
To explain bit clearer
I have to write a func. to find whether str(user input) contains all the char of the word "abcdef" in the same sequence as specify in the "abcdef". Return True if contains all the char in the same sequence or else false.
User Input: fgajbdcmdgeoplfqz
Output: true
User Input: asrbcyufde
Output: false
You can use LINQ's SkipWhile, which will skip elements until the search character is found.
An additionnal Skip is necessary to obtain raphs instead of graphs, and the ToArray() for the input string and result because you want to work with arrays.
char[] ch = "paragraphs".ToArray();
char search = 'g';
ch = ch.SkipWhile(c => c != search).Skip(1).ToArray(); // raphs
But honestly since your input is a string, I'd work with that:
string ch = "paragraphs";
char search = 'g';
ch = ch.Substring(ch.IndexOf(search) + 1);
and, if really necessary, convert it with .ToArray() afterwards.
And now to answer your 'clarification' (which is pretty much an other question by the way).
There are probably better ways to do it, but here's something that will accomplish what you want in O(n)
private bool ContainsInSequence(string input, string substring)
{
int substringIndex = 0;
for (int i = 0; i < input.Count(); i++)
{
if (input[i] == substring[substringIndex])
{
substringIndex++;
if (substringIndex == substring.Length)
{
return true;
}
}
}
return false;
}
Basically, you go through the input string in order, and each time you encounter the current letter from your substring you move to the next one.
When you reach the end of the substring, you know the input string contained all your substring, so you return true.
If we're not at the end after going through all the input this means there was a letter either out of order or missing, so we return false.
ContainsInSequence("fgajbdcmdgeoplfqz", "abcdef"); // true
ContainsInSequence("asrbcyufde ", "abcdef"); // false
ContainsInSequence("abcdfe", "abcdef"); // false
Try
char[] ch = "paragraphs".ToCharArray();
int index = Array.IndexOf(ch, 'g');
char[] result = new string(ch).Substring(index+1).ToCharArray();
Here's my version without using Linq:
static void Main(string[] args)
{
Console.WriteLine("Please provide a list of characters");
string Charactersinput = Console.ReadLine();
Console.WriteLine("Please provide the search character");
char Searchinput = Console.ReadKey().KeyChar;
Console.WriteLine("");
List<char> ListOfCharacters = new List<char>();
//fill the list of characters with the characters from the string
// Or empty it, if th esearch charcter is found
for (int i = 0; i < Charactersinput .Length; i++)
{
if (Searchinput == Charactersinput[i])
{
ListOfCharacters.Clear();
}
else
ListOfCharacters.Add(Charactersinput [i]);
}
//get your string back together
string Result = String.Concat(ListOfCharacters);
Console.WriteLine("Here's a list of all characters after processing: {0}", Result);
Console.ReadLine();
}
To answer your "clarification" question, which is very different from the original question:
I have to write a func. to find whether str(user input) contains all
the char of the word "abcdef" in the same sequence as specify in the
"abcdef". Return true if contains all the char in the same sequence or
else false.
Input: fgajbdcmdgeoplfqz Output: true
Input: asrbcyufde Output: false
The following function takes in two strings, a source string to search and a string containing the sequence of characters to match. It then searches through the source string, looking for each character (starting at the found position of the previous character). If any character is not found, it returns false. Otherwise it returns true:
public static bool ContainsAllCharactersInSameSequence(string sourceString,
string characterSequence)
{
// Short-circuit argument check
if (sourceString == null) return characterSequence == null;
if (characterSequence == null) return false;
if (characterSequence.Length > sourceString.Length) return false;
if (sourceString == characterSequence) return true;
int startIndex = 0;
foreach (char character in characterSequence)
{
int charIndex = sourceString.IndexOf(character, startIndex);
if (charIndex == -1) return false;
startIndex = charIndex + 1;
}
return true;
}

Split string after specific character or after max length

i want to split a string the following way:
string s = "012345678x0123x01234567890123456789";
s.SplitString("x",10);
should be split into
012345678
x0123
x012345678
9012345678
9
e.g. the inputstring should be split after the character "x" or length 10 - what comes first.
here is what i've tried so far:
public static IEnumerable<string> SplitString(this string sInput, string search, int maxlength)
{
int index = Math.Min(sInput.IndexOf(search), maxlength);
int start = 0;
while (index != -1)
{
yield return sInput.Substring(start, index-start);
start = index;
index = Math.Min(sInput.IndexOf(search,start), maxlength);
}
}
I would go with this regular expression:
([^x]{1,10})|(x[^x]{1,9})
which means:
Match at most 10 characters that are not x OR match x followed by at most 9 characters thar are not x
Here is working example:
string regex = "([^x]{1,10})|(x[^x]{1,9})";
string input = "012345678x0123x01234567890123456789";
var results = Regex.Matches(input, regex)
.Cast<Match>()
.Select(m => m.Value);
which produces values by you.
Personally I don't like RegEx. It creates code that is hard to de-bug and is very hard to work out what it is meant to be doing when you first look at it. So for a more lengthy solution I would go with something like this.
public static IEnumerable<string> SplitString(this string sInput, char search, int maxlength)
{
var result = new List<string>();
var count = 0;
var lastSplit = 0;
foreach (char c in sInput)
{
if (c == search || count - lastSplit == maxlength)
{
result.Add(sInput.Substring(lastSplit, count - lastSplit));
lastSplit = count;
}
count ++;
}
result.Add(sInput.Substring(lastSplit, count - lastSplit));
return result;
}
Note I changed the first parameter to a char (from a string). This code can probably be optimised some more, but it is nice and readable, which for me is more important.

How do I extract text that lies between parentheses (round brackets)?

I have a string User name (sales) and I want to extract the text between the brackets, how would I do this?
I suspect sub-string but I can't work out how to read until the closing bracket, the length of text will vary.
If you wish to stay away from regular expressions, the simplest way I can think of is:
string input = "User name (sales)";
string output = input.Split('(', ')')[1];
A very simple way to do it is by using regular expressions:
Regex.Match("User name (sales)", #"\(([^)]*)\)").Groups[1].Value
As a response to the (very funny) comment, here's the same Regex with some explanation:
\( # Escaped parenthesis, means "starts with a '(' character"
( # Parentheses in a regex mean "put (capture) the stuff
# in between into the Groups array"
[^)] # Any character that is not a ')' character
* # Zero or more occurrences of the aforementioned "non ')' char"
) # Close the capturing group
\) # "Ends with a ')' character"
Assuming that you only have one pair of parenthesis.
string s = "User name (sales)";
int start = s.IndexOf("(") + 1;
int end = s.IndexOf(")", start);
string result = s.Substring(start, end - start);
Use this function:
public string GetSubstringByString(string a, string b, string c)
{
return c.Substring((c.IndexOf(a) + a.Length), (c.IndexOf(b) - c.IndexOf(a) - a.Length));
}
and here is the usage:
GetSubstringByString("(", ")", "User name (sales)")
and the output would be:
sales
Regular expressions might be the best tool here. If you are not famililar with them, I recommend you install Expresso - a great little regex tool.
Something like:
Regex regex = new Regex("\\((?<TextInsideBrackets>\\w+)\\)");
string incomingValue = "Username (sales)";
string insideBrackets = null;
Match match = regex.Match(incomingValue);
if(match.Success)
{
insideBrackets = match.Groups["TextInsideBrackets"].Value;
}
string input = "User name (sales)";
string output = input.Substring(input.IndexOf('(') + 1, input.IndexOf(')') - input.IndexOf('(') - 1);
A regex maybe? I think this would work...
\(([a-z]+?)\)
using System;
using System.Text.RegularExpressions;
private IEnumerable<string> GetSubStrings(string input, string start, string end)
{
Regex r = new Regex(Regex.Escape(start) +`"(.*?)"` + Regex.Escape(end));
MatchCollection matches = r.Matches(input);
foreach (Match match in matches)
yield return match.Groups[1].Value;
}
int start = input.IndexOf("(") + 1;
int length = input.IndexOf(")") - start;
output = input.Substring(start, length);
Use a Regular Expression:
string test = "(test)";
string word = Regex.Match(test, #"\((\w+)\)").Groups[1].Value;
Console.WriteLine(word);
input.Remove(input.IndexOf(')')).Substring(input.IndexOf('(') + 1);
The regex method is superior I think, but if you wanted to use the humble substring
string input= "my name is (Jayne C)";
int start = input.IndexOf("(");
int stop = input.IndexOf(")");
string output = input.Substring(start+1, stop - start - 1);
or
string input = "my name is (Jayne C)";
string output = input.Substring(input.IndexOf("(") +1, input.IndexOf(")")- input.IndexOf("(")- 1);
var input = "12(34)1(12)(14)234";
var output = "";
for (int i = 0; i < input.Length; i++)
{
if (input[i] == '(')
{
var start = i + 1;
var end = input.IndexOf(')', i + 1);
output += input.Substring(start, end - start) + ",";
}
}
if (output.Length > 0) // remove last comma
output = output.Remove(output.Length - 1);
output : "34,12,14"
Here is a general purpose readable function that avoids using regex:
// Returns the text between 'start' and 'end'.
string ExtractBetween(string text, string start, string end)
{
int iStart = text.IndexOf(start);
iStart = (iStart == -1) ? 0 : iStart + start.Length;
int iEnd = text.LastIndexOf(end);
if(iEnd == -1)
{
iEnd = text.Length;
}
int len = iEnd - iStart;
return text.Substring(iStart, len);
}
To call it in your particular example you can do:
string result = ExtractBetween("User name (sales)", "(", ")");
I'm finding that regular expressions are extremely useful but very difficult to write. So, I did some research and found this tool that makes writing them so easy.
Don't shy away from them because the syntax is difficult to figure out. They can be so powerful.
This code is faster than most solutions here (if not all), packed as String extension method, it does not support recursive nesting:
public static string GetNestedString(this string str, char start, char end)
{
int s = -1;
int i = -1;
while (++i < str.Length)
if (str[i] == start)
{
s = i;
break;
}
int e = -1;
while(++i < str.Length)
if (str[i] == end)
{
e = i;
break;
}
if (e > s)
return str.Substring(s + 1, e - s - 1);
return null;
}
This one is little longer and slower, but it handles recursive nesting more nicely:
public static string GetNestedString(this string str, char start, char end)
{
int s = -1;
int i = -1;
while (++i < str.Length)
if (str[i] == start)
{
s = i;
break;
}
int e = -1;
int depth = 0;
while (++i < str.Length)
if (str[i] == end)
{
e = i;
if (depth == 0)
break;
else
--depth;
}
else if (str[i] == start)
++depth;
if (e > s)
return str.Substring(s + 1, e - s - 1);
return null;
}
I've been using and abusing C#9 recently and I can't help throwing in Spans even in questionable scenarios... Just for the fun of it, here's a variation on the answers above:
var input = "User name (sales)";
var txtSpan = input.AsSpan();
var startPoint = txtSpan.IndexOf('(') + 1;
var length = txtSpan.LastIndexOf(')') - startPoint;
var output = txtSpan.Slice(startPoint, length);
For the OP's specific scenario, it produces the right output.
(Personally, I'd use RegEx, as posted by others. It's easier to get around the more tricky scenarios where the solution above falls apart).
A better version (as extension method) I made for my own project:
//Note: This only captures the first occurrence, but
//can be easily modified to scan across the text (I'd prefer Slicing a Span)
public static string ExtractFromBetweenChars(this string txt, char openChar, char closeChar)
{
ReadOnlySpan<char> span = txt.AsSpan();
int firstCharPos = span.IndexOf(openChar);
int lastCharPos = -1;
if (firstCharPos != -1)
{
for (int n = firstCharPos + 1; n < span.Length; n++)
{
if (span[n] == openChar) firstCharPos = n; //This allows the opening char position to change
if (span[n] == closeChar) lastCharPos = n;
if (lastCharPos > firstCharPos) break;
//This would correctly extract "sales" from this [contrived]
//example: "just (a (name (sales) )))(test"
}
return span.Slice(firstCharPos + 1, lastCharPos - firstCharPos - 1).ToString();
}
return "";
}
Much similar to #Gustavo Baiocchi Costa but offset is being calculated with another intermediate Substring.
int innerTextStart = input.IndexOf("(") + 1;
int innerTextLength = input.Substring(start).IndexOf(")");
string output = input.Substring(innerTextStart, innerTextLength);
I came across this while I was looking for a solution to a very similar implementation.
Here is a snippet from my actual code. Starts substring from the first char (index 0).
string separator = "\n"; //line terminator
string output;
string input= "HowAreYou?\nLets go there!";
output = input.Substring(0, input.IndexOf(separator));

Categories