How to remove 4 byte characters? - c#

I have a integration with facebook that sends me special characters(smilies and so on, for example u+1f600 that is called a grinning face). It is not possible to store this in my UTF8(not UTF8mb4) database, so how can I make the string UFT8 (not UTF8mb4) friendly?
I can´t convert my database to UTF8mb4.

You can use a simple regex:
var rx = new Regex(#"[\uD800-\uDBFF][\uDC00-\uDFFF]");
string str = "abcd\U0001D11Eabcd";
str = rx.Replace(str, "?"); // abcd?abcd
If you look http://en.wikipedia.org/wiki/UTF-16 you'll see that non-BMP characters are composed by two 16 bit code units, with the ranges given in the Regex.

Related

Is it possible pad a string using String.Format the same way as numbers?

I know that you can write something like
Console.WriteLine("{0:0000}", 11);
and it will output
0011
I'm curious if it possible to pad the string the same way without writing custom formatter? E.g.
Console.WriteLine("{0:00000000}", "string");
// output: 00string
You can use String.PadRight (or PadLeft, depending on your requirements).
Returns a new string that left-aligns the characters in this string by
padding them on the right with a specified Unicode character, for a
specified total length.

Parsing a string to get a specific value

I'm new to C#. I'm parsing for a lot number in a 2D barcode. The actual lot number 'A2351' is hidden in this barcode string "+M727PP011/$$3201001A2351S". I would like to break this barcode up in separate string blocks but the delimiters are not consistent.
The letter prefix in front of the 4 digit lot number can be a 'A', 'P', or a 'D' There is a single letter following the lot number that can be ignored.
string Delimiter = "/$$3";
//barcode format:M###PP###/$$3 ddmmyy lotnumprefix 'A' followed by lotNum
string lotNum= "+M727PP011/$$3201001A2351S";
string[] split = lotNum.Split(new[] {Delimiter}, StringSplitOptions.None);
How do I extract the lot number after the date?
Based on your initial example and then the subsequent edit in which you showed how you are solving this, it sounds like the lot number is always in the same place. It would be cleaner (and more in line with standard C# code) to use a single call to string.Substring(int,int) rather than the two lines you are using which also require pulling in the VB library. You just need to call Substring and give it the starting index and the length.
So this code:
string lotNum = Strings.Right(barcode, 6);
lotNum = lotNum.Remove((lotNum.Length - 1), 1);
Can be done with this single substring call:
string lotNum = barcode.Substring(barcode.Length - 6, 5);
Edit
Just further clarification on why it might be better to use the call to Substring. In C# string objects are immutable. That means that when you make the call to Strings.Right you are getting back a new string object. When you then call lotNum.Remove you do not "remove" a character from the existing string, a new string is allocated with the character(s) removed and is returned to you. So with your code there are two new string allocations when trying to extract the lot number. When you make the call to Substring you will get back a new string, but instead of getting a new string that you immediately then modify and get a second new string, you will only need to allocate one new string to extract the lot number. In the example you have given there probably would not be any noticeable performance/memory issue, but it is something that could potentially lead to trouble if this code was in a tight loop or something like that.
If you're just trying to get the lot number, it's really dependent on the format of the input string (is it a consistent length, are there any reliable prefixes/suffixes relative to the data you're trying to parse that you can reference from, etc). It looks like your data is definable by its static position in the string, so it looks like you could use the substring
(with an index of 20?) method to accomplish what you want.

Parsing an array

I am in need of parsing an array or characters that is a fixed length but can have just about any combination of letter or number. My 50 digit array looks like this: NL1NAMEOFCO-B032144221111000100600000-A35499001
This array represents a vast combination of settings within our product. I need to extract all reference designators in the array. The first 3 characters represent a particular model NL1, the next 8 characters represent a company NAMEOFCO. The ‘-‘ will always be in the same location. The B (digit 13) represents some value, etc, etc. Also, some values are represented by 2 digits. Digits 20 & 21 (which store the value 22), represent some specific settings.
So by now you get the idea. I can parse the array and extract the values I need by using the following code:
String Company = ConfigCode[3].ToString() +
ConfigCode[4].ToString() +
ConfigCode[5].ToString() +
ConfigCode[6].ToString() +
ConfigCode[7].ToString() +
ConfigCode[8].ToString() +
ConfigCode[9].ToString() +
ConfigCode[10].ToString();
This works without any problems, but to me, there should be an easier way of doing this. I would have thought the following would work, but it does not.
String Company = ConfigCode[3..10].ToString();
Can someone explain to me why it doesn’t work and what would be a better way of extracting the information I need?
Thanks!
I believe that String.Substring method is what you're looking for. The signature for the overloaded method you're looking for is:
public string Substring(
int startIndex,
int length
)
The documentation for it is here: http://msdn.microsoft.com/en-us/library/aka44szs(v=vs.110).aspx
For example, your Company name would be (going by the description of a character length of 8):
string CompanyName = configCode.Substring(3, 8);
Like mentioned before, you can use the Substring extension method like so:
String Company = ConfigCode.Substring(3, 8);
The square-bracket operators for strings, like in ConfigCode[3], actually return individual chars at that specific index. And C# isn't as pretty as other programming languages where stuff like array[3..10] actually gives you a portion of an array (or in this case, a string).

Dynamic Regex for number range using c#

I'm looking at UK postcodes and trying to work out how I can take data from a database (the first part of a UK postcode) and dynamically create a regexp for them using c#. For example:
AB44-56
I know what I want as an output:
AB([4][4-9]|[5][0-6])+
However, I can't work out how I might be able to do this with logic, perhaps I need to split the Letters from the numbers first, but i can't do that using split.
I have other combinations too - single range:
AB31 would be AB[3][1]+
Some with just letters:
BT would be BT+
Some with a single letter and 1 or two numbers:
G83 Would be G[8][3]
Any suggestions or guidance would be very much appriciated how this may be coded.
afrom wikipedia UK postal codes :
This can be generalised as: (one or two letters)(number between 0 and
99)(zero or one letter)(space)(single digit)(two letters)
so
^[A-Z,a-z]{0,2}\d+[A-Z,a-z]?\s\d[A-Z,a-z]{2}$
might work.
EDIT: Also if you are trying to restric the postal codes to say those with the same prefix as the ones in the database you could do this.
var source = "BTasdfweasdf"; //from the database
var input = "BT1A 1BB"; //from the somewhere else
var regex = Regex.Replace(source, #"(^[A-z,a-z]{0,2})(.*)", #"$1\d+[A-Z,a-z]?\s\d[A-Z,a-z]{2}$");
var match = Regex.Match(input,regex);

Extract substring from string with Regex

Imagine that users are inserting strings in several computers.
On one computer, the pattern in the configuration will extract some characters of that string, lets say position 4 to 5.
On another computer, the extract pattern will return other characters, for instance, last 3 positions of the string.
These configurations (the Regex patterns) are different for each computer, and should be available for change by the administrator, without having to change the source code.
Some examples:
Original_String Return_Value
User1 - abcd78defg123 78
User2 - abcd78defg123 78g1
User3 - mm127788abcd 12
User4 - 123456pp12asd ppsd
Can it be done with Regex?
Thanks.
Why do you want to use regex for this? What is wrong with:
string foo = s.Substring(4,2);
string bar = s.Substring(s.Length-3,3);
(you can wrap those up to do a bit of bounds-checking on the length easily enough)
If you really want, you could wrap it up in a Func<string,string> to put somewhere - not sure I'd bother, though:
Func<string, string> get4and5 = s => s.Substring(4, 2);
Func<string,string> getLast3 = s => s.Substring(s.Length - 3, 3);
string value = "abcd78defg123";
string foo = getLast3(value);
string bar = get4and5(value);
If you really want to use regex:
^...(..)
And:
.*(...)$
To have a regex capture values for further use you typically use (), depending on the regex compiler it might be () or for microsoft MSVC I think it's []
Example
User4 - 123456pp12asd ppsd
is most interesting in that you have here 2 seperate capture areas. Is there some default rule on how to join them together, or would you then want to be able to specify how to make the result?
Perhaps something like
r/......(..)...(..)/\1\2/ for ppsd
r/......(..)...(..)/\2-\1/ for sd-pp
do you want to run a regex to get the captures and handle them yourself, or do you want to run more advanced manipulation commands?
I'm not sure what you are hoping to get by using RegEx. RegEx is used for pattern matching. If you want to extract based on position, just use substring.
It seems to me that Regex really isn't the solution here. To return a section of a string beginning at position pos (starting at 0) and of length length, you simply call the Substring function as such:
string section = str.Substring(pos, length)
Grouping. You could match on /^.{3}(.{2})/ and then look at group $1 for example.
The question is why? Normal string handling i.e. actual substring methods are going to be faster and clearer in intent.

Categories