Matching Unicode characters in a regular expression

Matching Unicode characters in a regular expression - c#

I retrieve strings from a website using the HttpClient class. The webserver sends them in UTF-8 encoding. The strings have the form abc | a and I'd like to remove the pipe, the space and the character after the space from them, if they are at the end of the string.
sText = Regex.Replace (sText, #"\| .$", "");
works as expected. Now, in some cases, the pipe and the space is followed by another character, for example a smiley. The string has then the form abc | 😉. The regular expression above does not work and I have to use
sText = Regex.Replace (sText, #"\| ..$", "");
instead (two dots).
I'm quite sure it has something to do with the encoding and with the fact that the smiley uses more bytes in UTF-8 than a latin character - and the fact that c# doesn't know the encoding. The smiley is just one character, even if it uses more bytes, so after telling c# the correct encoding (or converting the string), the first regular expression should work in both cases.
How can this be done?

Like it was suggested in the comments, this problem is hard to solve using Regex. What you call "looks like one item" is actually a grapheme cluster. The corresponding .NET term is a "text element" that can be parsed and iterated through using StringInfo.GetTextElementEnumerator.
A possible solution based on text elements can be quite simple: we just need to extract the last 3 text elements from the input string and ensure that they refer to a pipe, a space and the last one can be any. Please find below the proposed approach implementation.
void Main()
{
var inputs = new[] {
"abc | a",
"abc | ab", // The only that shouldn't be trimmed
"abc | 😉",
"abc | " + "\uD83D\uDD75\u200D\u2642\uFE0F" // "man-detective" (on Windows)
};
foreach (var input in inputs)
{
var res = TrimTrailingTextElement(input);
Console.WriteLine("Input : " + input);
Console.WriteLine("Result: " + res);
Console.WriteLine();
}
}
string TrimTrailingTextElement(string input)
{
// A circular buffer for storing the last 3 text elements
var lastThreeElementIdxs = new int[3] { -1, -1, -1 };
// Get enumerator of text elements in the input string
var enumerator = StringInfo.GetTextElementEnumerator(input);
// Iterate through the enitre input string,
// at each step save to the buffer the current element index
var i = -1;
while (enumerator.MoveNext())
{
i = (i + 1) % 3;
lastThreeElementIdxs[i] = enumerator.ElementIndex;
}
// The buffer index must be positive for a non-empty input
if (i >= 0)
{
// Extract indexes of the last 3 elements
// from the circular buffer
var i1 = lastThreeElementIdxs[(i + 1) % 3];
var i2 = lastThreeElementIdxs[(i + 2) % 3];
var i3 = lastThreeElementIdxs[i];
if (i1 >= 0 && i2 >= 0 && i3 >= 0 && // All 3 indexes must be initialized
i3 - i2 == 1 && i2 - i1 == 1 && // The 1 and 2 elements must be 1 char long
input[i1] == '|' && // The 1 element must be a pipe
input[i2] == ' ') // The 2 element must be a space
{
return input.Substring(0, i1);
}
}
return input;
}

Related

Maximum product of 13 adjacent numbers

List<int> arr = new List<int>();
long max = 0;
long mul = 1;
string abc = #"73167176531330624919225119674426574742355349194934
85861560789112949495459501737958331952853208805511
96983520312774506326239578318016984801869478851843
12540698747158523863050715693290963295227443043557
66896648950445244523161731856403098711121722383113
62229893423380308135336276614282806444486645238749
30358907296290491560440772390713810515859307960866
70172427121883998797908792274921901699720888093776
65727333001053367881220235421809751254540594752243
52584907711670556013604839586446706324415722155397
53697817977846174064955149290862569321978468622482
83972241375657056057490261407972968652414535100474
82166370484403199890008895243450658541227588666881
16427171479924442928230863465674813919123162824586
17866458359124566529476545682848912883142607690042
24219022671055626321111109370544217506941658960408
07198403850962455444362981230987879927244284909188
84580156166097919133875499200524063689912560717606
05886116467109405077541002256983155200055935729725
71636269561882670428252483600823257530420752963450";
foreach (char a in abc)
{
if(arr.Count == 13)
{
arr.RemoveAt(0);
}
int value = (int)Char.GetNumericValue(a);
arr.Add(value);
if(arr.Count == 13)
{
foreach(int b in arr)
{
mul = mul * b;
if (mul > max)
{
max = mul;
}
}
mul = 1;
}
}
Console.WriteLine(max);
I am getting 5377010688 which is a wrong answer and when I am trying same logic with given example in project Euler it is working fine, please help me.
Don't say the answer just correct me where I am doing wrong or where the code is not running as it should.

The string constant, as it is written down like above, contains blanks and \r\n's, e.g. between the last '4' of the first line and the first '8' on the second line. Char.GetNumericValue() returns -1 for a blank.
Propably the character sequence with the highest product spans across adjacent lines in your string, therefore there are blanks in between, which count as -1, which disables your code in finding them.
Write your constant like this:
string abc = #"73167176531330624919225119674426574742355349194934" +
"85861560789112949495459501737958331952853208805511" +
"96983520312774506326239578318016984801869478851843" + etc.
The result is then 23514624000, I hope that's correct.

Don't say the answer just correct me where I am doing wrong or where
the code is not running as it should
You have included all characters into calculation but you should not do that. The input string also contains for example carriage return '\n' at the end of each line.
Your actual string look like this:
string abc = #"73167176531330624919225119674426574742355349194934\r\n
85861560789112949495459501737958331952853208805511\r\n
...
How to solve this? You should ignore these characters, one possible solution is to check each char if it is a digit:
if(!char.IsDigit(a))
{
continue;
}

How can find all permutations of spinning text in c#

I have a spinning text : {T1{M1|{A1|B1}|M2}F1|{X1|X2}}
My question is : How can i find all permutations in C# ?
T1M1F1
T1M2F1
T1A1F1
T1B1F1
X1
X2
Any suggestions ?
Edit :
Thank you for your help but M1,A1, .. are examples
With words that could give :
{my name is james vick and i am a {member|user|visitor} on this {forum|website|site} and i am loving it | i am admin and i am a {supervisor|admin|moderator} on this {forum|website|site} and i am loving it}.
my name is james vick and i am a {member|user|visitor} on this {forum|website|site} and i am loving it => 3 * 3 => 9 permutations
i am admin and i am a {supervisor|admin|moderator} on this {forum|website|site} and i am loving it => 3 * 3 => 9 permutations
Result : 18 permutations

Method to generate all permuatuons of spinnable strings
I've implemented a simple method to solve this problem.
It takes an ArrayList argument containing spinnable text string(s).
I use it to generate all the permutations of multiple spinnable strings.
It comes with extra functionality of support of optional blocks, surronded by "[ ]" brackets.
Eq.:
If you have a single string object in the ArrayList with content of:
{A | {B1 | B2 } [B optional] }
It populates the array list with all the permutations, "extracted"
Contents after invocation of method:
A
B1
B1 B optional
B2
B2 B optional
You can also pass multiple strings as argument to generate permutations for all of them:
Eg.:
Input:
ArraList with two string
{A1 | A2}
{B1 | B2}
Contents after invocation:
A1
A2
B1
B2
This implementation works by always finding the inner most bracket pair in the first spinnable section, then extract it. I do this until all the special {}, [] characters are removed.
private void ExtractVersions(ArrayList list)
{
ArrayList IndicesToRemove = new ArrayList();
for (int i = 0; i < list.Count; i++)
{
string s = list[i].ToString();
int firstIndexOfCurlyClosing = s.IndexOf('}');
int firstIndexOfBracketClosing = s.IndexOf(']');
if ((firstIndexOfCurlyClosing > -1) || (firstIndexOfBracketClosing > -1))
{
char type = ' ';
int endi = -1;
int starti = -1;
if ((firstIndexOfBracketClosing == -1) && (firstIndexOfCurlyClosing > -1))
{ // Only Curly
endi = firstIndexOfCurlyClosing;
type = '{';
}
else
{
if ((firstIndexOfBracketClosing > -1) && (firstIndexOfCurlyClosing == -1))
{ // Only bracket
endi = firstIndexOfBracketClosing;
type = '[';
}
else
{
// Both
endi = Math.Min(firstIndexOfBracketClosing, firstIndexOfCurlyClosing);
type = s[endi];
if (type == ']')
{
type = '[';
}
else
{
type = '{';
}
}
}
starti = s.Substring(0, endi).LastIndexOf(type);
if (starti == -1)
{
throw new Exception("Brackets are not valid.");
}
// start index, end index and type found. -> make changes
if (type == '[')
{
// Add two new lines, one with the optional part, one without it
list.Add(s.Remove(starti, endi - starti+1));
list.Add(s.Remove(starti, 1).Remove(endi-1, 1));
IndicesToRemove.Add(i);
}
else
if (type == '{')
{
// Add as many new lines as many alternatives there are. This must be an in most bracket.
string alternatives = s.Substring(starti + 1, endi - starti - 1);
foreach(string alt in alternatives.Split('|'))
{
list.Add(s.Remove(starti,endi-starti+1).Insert(starti,alt));
}
IndicesToRemove.Add(i);
}
} // End of if( >-1 && >-1)
} // End of for loop
for (int i = IndicesToRemove.Count-1; i >= 0; i--)
{
list.RemoveAt((int)IndicesToRemove[i]);
}
}
I hope I've helped.
Maybe it is not the simplest and best implementation, but it works well for me. Please feedback, and vote!

In my opinion, you should proceed like this:
All nested choice lists i.e. between { } should be "flattened" to a single choice list. Like in your example:
{M1|{A1|B1}|M2} -> {M1|A1|B1|M2}
Use recursion to generate all possible combinations. For example, starting from an empty array, first place T1 since it is the only option. Then from the nested list {M1|A1|B1|M2} choose each element in turn an place it on the next position and then finally F1. Repeat until all possibilities are exhausted.
This is just a rough hint, you need to fill in the rest of the details.

How to read and parse specific integer value form a text file and add it to listbox in c#?

i'd like to know how to read and parse specific integer value form a text file and add it to listbox in c#. For example I have a text file MyText.txt like this:
<>
101
192
-
399
~
99
128
-
366
~
101
192
-
403
~
And I want to parse the integer value between '-' and '~' and add each one of it to items in list box for example:
#listBox1
399
366
403
Notice that each line of value separated by Carriage Return and Line Feed. And by the way, it is a data transmitted through RS-232 Serial Communication from microcontroller. Sorry, I'm just new in c# programming. Thanks in advance.

Here's a way to do it with LINQ:
bool keep = false;
listBox1.Items.AddRange(
File.ReadLines("MyText.txt")
.Where(l =>
{
if (l == "-") keep = true;
else if (l == "~") keep = false;
else return keep;
return false;
})
.ToArray());

you could use regular expressions like so:
var s = System.Text.RegularExpressions.Regex.Matches(stringtomatch,#"(?<=-\s*)[0-9]+\b(?=\s*~)");
The regex basically looks for a number. It then checks the characters behind, looks for an optional whitespace and a dash (-). then it matches all the numbers until it encounters another non-word character. it checks for an optional whitespace and then a required ~ (dunno what that's called). Also, it only returns the number (not the whitespace and symbols).
So basically this method returns a list of matches. you could then use it like so:
for (int i = 0; i < s.Count; i++)
{
listBox1.Items.Add(s[i]);
}
EDIT:
typo in the regex and updated the loop (for some reason, foreach doesn't work with the MatchCollection).
you can try running this test script:
var stringtomatch = " asdjasdk jh kjh asd\n-\n123123\n~\nasdasd";
var s = System.Text.RegularExpressions.Regex.Matches(stringtomatch,#"(?<=-\s*)[0-9]+\b(?=\s*~)");
Console.WriteLine(stringtomatch);
for (int i = 0; i < s.Count; i++)
{
listBox1.Items.Add(s[i]);
}

Try
List<Int32> values = new List<Int32>();
bool open = false;
String[] lines = File.ReadAllLines(fileName);
foreach(String line in lines)
{
if( (!open) && (line == "-") )
{
open = true;
}
else if( (open) && (line == "~") )
{
open = false;
}
else if(open)
{
Int32 v;
if(Int32.TryParse(line, out v))
{
values.Add(v);
}
}
}
Listbox.Items.AddRange(values);
This is a easy piece of code with reading a file, converting to integer (although you could stay with strings) and handling lists. You should start with some basic .NET/C# tutorials.
Edit: To add the values to the listbox you can switch to values.ForEach(v => listbox.Items.Add(v.ToString()) if you use .NET 3.5. Otherwise make a foreach yourself.

How can I enable a word-breaking function by length without split inside html-encoded special chars

I would like to implement a functionality that insert a word-breaking TAG if a word is too long to appear in a single line.
protected string InstertWBRTags(string text, int interval)
{
if (String.IsNullOrEmpty(text) || interval < 1 || text.Length < interval)
{
return text;
}
int pS = 0, pE = 0, tLength = text.Length;
StringBuilder sb = new StringBuilder(tLength * 2);
while (pS < tLength)
{
pE = pS + interval;
if (pE > tLength)
sb.Append(text.Substring(pS));
else
{
sb.Append(text.Substring(pS, pE - pS));
sb.Append("");//<wbr> not supported by IE 8
}
pS = pE;
}
return sb.ToString();
}
The problem is: What can I do, if the text contains html-encoded special chars?
What can I do to prevent insertion of a TAG inside a ß?
What can I do to count the real string length (that appears in browser)?
A string like ♡♥♡♥ contains only 2 chars (hearts) in browser but its length is 14.

One solution would be to decode the entities into the Unicode characters they represent and work with that. To do that use System.Net.WebUtility.HtmlDecode() if you're in .NET 4 or System.Web.HttpUtility.HtmlDecode() otherwise.
But be aware that not all Unicode character fit in one char.

You need to pass through whole text character by character, when you find a & than you examine what is next, if you reach a # it is quite sure that after this till a column will be a set of number (you can check it also). I such situation you move your iterator to the position of nearest semicolon and increment the counter.
In Java dialect
int count = 0;
for(int i = 0; i < text.length(); i++) {
if(text.charAt(i) == '&') {
i = text.indexOf(';', i) + 1; // what, from
}
count++;
}
Very simplified version

Is there a better way than String.Replace to remove backspaces from a string?

I have a string read from another source such as "\b\bfoo\bx". In this case, it would translate to the word "fox" as the first 2 \b's are ignored, and the last 'o' is erased, and then replaced with 'x'. Also another case would be "patt\b\b\b\b\b\b\b\b\b\bfoo" should be translated to "foo"
I have come up with something using String.Replace, but it is complex and I am worried it is not working correctly, also it is creating a lot of new string objects which I would like to avoid.
Any ideas?

Probably the easiest is to just iterate over the entire string. Given your inputs, the following code does the trick in 1-pass
public string ReplaceBackspace(string hasBackspace)
{
if( string.IsNullOrEmpty(hasBackspace) )
return hasBackspace;
StringBuilder result = new StringBuilder(hasBackspace.Length);
foreach (char c in hasBackspace)
{
if (c == '\b')
{
if (result.Length > 0)
result.Length--;
}
else
{
result.Append(c);
}
}
return result.ToString();
}

The way I would do it is low-tech, but easy to understand.
Create a stack of characters. Then iterate through the string from beginning to end. If the character is a normal character (non-slash), push it onto the stack. If it is a slash, and the next character is a 'b', pop the top of the stack. If the stack is empty, ignore it.
At the end, pop each character in turn, add it to a StringBuilder, and reverse the result.

Regular expressions version:
var data = #"patt\b\b\b\b\b\b\b\b\b\bfoo";
var regex = new Regex(#"(^|[^\\b])\\b");
while (regex.IsMatch(data))
{
data = regex.Replace(data, "");
}
Optimized version (and this one works with backspace '\b' and not with string "\b"):
var data = "patt\b\b\b\b\b\b\b\b\b\bfoo";
var regex = new Regex(#"[^\x08]\x08", RegexOptions.Compiled);
while (data.Contains('\b'))
{
data = regex.Replace(data.TrimStart('\b'), "");
}

public static string ProcessBackspaces(string source)
{
char[] buffer = new char[source.Length];
int idx = 0;
foreach (char c in source)
{
if (c != '\b')
{
buffer[idx] = c;
idx++;
}
else if (idx > 0)
{
idx--;
}
}
return new string(buffer, 0, idx);
}
EDIT
I've done a quick, rough benchmark of the code posted in answers so far (processing the two example strings from the question, one million times each):
ANSWER | TIME (ms)
------------------------|-----------
Luke (this one) | 318
Alexander Taran | 567
Robert Paulson | 683
Markus Nigbur | 2100
Kamarey (new version) | 7075
Kamarey (old version) | 30902

You could iterate through the string backward, making a character array as you go. Every time you hit a backspace, increment a counter, and every time you hit a normal character, skip it if your counter is non-zero and decrement the counter.
I'm not sure what the best C# data structure is to manage this and then be able to get the string in the right order afterward quickly. StringBuilder has an Insert method but I don't know if it will be performant to keep inserting characters at the start or not. You could put the characters in a stack and hit ToArray() at the end -- that might or might not be faster.

String myString = "patt\b\b\b\b\b\b\b\b\b\bfoo";
List<char> chars = myString.ToCharArray().ToList();
int delCount = 0;
for (int i = chars.Count -1; i >= 0; i--)
{
if (chars[i] == '\b')
{
delCount++;
chars.RemoveAt(i);
} else {
if (delCount > 0 && chars[i] != null) {
chars.RemoveAt(i);
delCount--;
}
}
}

i'd go like this:
code is not tested
char[] result = new char[input.Length()];
int r =0;
for (i=0; i<input.Length(); i++){
if (input[i] == '\b' && r>0) r--;
else result[r]=input[i];
}
string resultsring = result.take(r);

Create a StringBuilder and copy over everything but backspace chars.

We Keep Coding

C# (C-Sharp) is a programming language developed by Microsoft that runs on the .NET Framework.

Matching Unicode characters in a regular expression - c#

Related

Maximum product of 13 adjacent numbers

How can find all permutations of spinning text in c#

How to read and parse specific integer value form a text file and add it to listbox in c#?

How can I enable a word-breaking function by length without split inside html-encoded special chars

Is there a better way than String.Replace to remove backspaces from a string?

Categories

Resources